Skip to main content

Basic Data Wrangling with NumPy

Data wrangling is a fundamental skill in data science, involving the process of cleaning, transforming, and organizing raw data into a usable format. NumPy, with its powerful array manipulation capabilities, makes data wrangling efficient and straightforward. In this article, we’ll cover the essential techniques for handling data with NumPy, including reshaping arrays, filtering data, and combining arrays.


1. Reshaping Arrays

Reshaping is the process of changing the dimensions of an array without changing its data. This is often necessary when preparing data for machine learning models or when aligning data for operations.

1.1 Reshaping with np.reshape()

You can use the np.reshape() function to change the shape of an array. The total number of elements must remain the same.

import numpy as np

# Create a 1D array
arr = np.array([1, 2, 3, 4, 5, 6])

# Reshape the array to a 2x3 matrix
reshaped_arr = np.reshape(arr, (2, 3))
print("Reshaped array (2x3):\n", reshaped_arr)

1.2 Flattening an Array

Flattening converts a multi-dimensional array into a 1D array. This is useful when you need to collapse all elements into a single dimension.

# Flatten the 2x3 array back to a 1D array
flattened_arr = reshaped_arr.flatten()
print("Flattened array:\n", flattened_arr)

2. Filtering Data

Filtering is the process of selecting elements from an array that meet specific criteria. This is commonly used in data cleaning and preprocessing.

2.1 Boolean Indexing

You can filter an array using boolean conditions, where a condition is applied to each element of the array.

# Create a sample array
arr = np.array([10, 20, 30, 40, 50])

# Filter elements greater than 25
filtered_arr = arr[arr > 25]
print("Elements greater than 25:", filtered_arr)

2.2 Using np.where()

The np.where() function is another powerful tool for filtering and selecting elements based on conditions.

# Use np.where to find indices where the condition is met
indices = np.where(arr > 25)
print("Indices where elements are greater than 25:", indices)

# Select elements based on the condition
selected_elements = arr[indices]
print("Selected elements:", selected_elements)

3. Combining Arrays

Combining arrays involves concatenating or stacking arrays to create larger datasets. This is useful when merging data from different sources or when preparing datasets for analysis.

3.1 Concatenating Arrays with np.concatenate()

The np.concatenate() function is used to join two or more arrays along an existing axis.

# Create two sample arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Concatenate the arrays along the first axis
combined_arr = np.concatenate((arr1, arr2))
print("Concatenated array:", combined_arr)

3.2 Stacking Arrays

Stacking allows you to combine arrays along a new axis. np.vstack() and np.hstack() are commonly used for vertical and horizontal stacking, respectively.

# Create two 1D arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Vertical stacking (rows)
vstacked_arr = np.vstack((arr1, arr2))
print("Vertically stacked array:\n", vstacked_arr)

# Horizontal stacking (columns)
hstacked_arr = np.hstack((arr1, arr2))
print("Horizontally stacked array:", hstacked_arr)

Conclusion

Basic data wrangling with NumPy provides the foundational skills needed to manipulate and organize data effectively. By mastering reshaping, filtering, and combining arrays, you’ll be well-prepared to tackle more complex data manipulation tasks in your data science projects. These techniques are essential for cleaning and preparing data before feeding it into machine learning models, making them a critical part of your data science toolkit.