Skip to main content

Data Manipulation in pandas

Data manipulation is at the core of data analysis, allowing you to transform raw data into meaningful insights. Pandas offers a wide range of tools to manipulate data efficiently, including sorting, grouping, merging, and reshaping DataFrames. In this article, we’ll explore these essential data manipulation techniques in detail.


1. Sorting Data

Sorting data is a fundamental task when analyzing and preparing data for further processing. Pandas allows you to sort data in DataFrames based on the values in one or more columns.

1.1 Sorting by a Single Column

You can sort a DataFrame by a single column using the sort_values() method.

import pandas as pd

# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']
}
df = pd.DataFrame(data)

# Sorting by Age
df_sorted = df.sort_values(by='Age')
print("DataFrame sorted by Age:\n", df_sorted)

1.2 Sorting by Multiple Columns

You can sort by multiple columns by passing a list of column names to sort_values().

# Sorting by Age and then by Name
df_sorted_multi = df.sort_values(by=['Age', 'Name'])
print("DataFrame sorted by Age and Name:\n", df_sorted_multi)

1.3 Sorting in Descending Order

To sort in descending order, set the ascending parameter to False.

# Sorting by Age in descending order
df_sorted_desc = df.sort_values(by='Age', ascending=False)
print("DataFrame sorted by Age (descending):\n", df_sorted_desc)

2. Grouping Data

Grouping data is a powerful technique for summarizing and aggregating data. The groupby() method in pandas is used to group data based on one or more columns.

2.1 Grouping and Aggregating

You can group data by one or more columns and then apply aggregate functions, such as sum(), mean(), or count().

# Sample DataFrame with sales data
sales_data = {
'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Region': ['North', 'North', 'South', 'South', 'West', 'West'],
'Sales': [100, 150, 200, 250, 300, 350]
}
df_sales = pd.DataFrame(sales_data)

# Grouping by Product and summing the Sales
grouped_sales = df_sales.groupby('Product')['Sales'].sum()
print("Total sales by Product:\n", grouped_sales)

2.2 Grouping by Multiple Columns

You can group by multiple columns to perform more complex aggregations.

# Grouping by Product and Region, then summing the Sales
grouped_sales_multi = df_sales.groupby(['Product', 'Region'])['Sales'].sum()
print("Total sales by Product and Region:\n", grouped_sales_multi)

2.3 Applying Multiple Aggregations

Pandas allows you to apply multiple aggregation functions at once using agg().

# Applying multiple aggregations: sum and mean
grouped_sales_agg = df_sales.groupby('Product')['Sales'].agg(['sum', 'mean'])
print("Aggregated sales by Product:\n", grouped_sales_agg)

3. Merging and Joining DataFrames

Merging and joining are essential operations when working with multiple DataFrames that need to be combined based on common columns or indexes.

3.1 Merging DataFrames

You can merge two DataFrames using the merge() function, specifying the column(s) on which to merge.

# Sample DataFrames
df1 = pd.DataFrame({
'Key': ['A', 'B', 'C'],
'Value1': [100, 200, 300]
})

df2 = pd.DataFrame({
'Key': ['A', 'B', 'D'],
'Value2': [400, 500, 600]
})

# Merging DataFrames on 'Key' column
merged_df = pd.merge(df1, df2, on='Key', how='inner')
print("Merged DataFrame (inner join):\n", merged_df)

3.2 Types of Joins

Pandas supports various types of joins, including inner, outer, left, and right joins.

# Left join
merged_left = pd.merge(df1, df2, on='Key', how='left')
print("Merged DataFrame (left join):\n", merged_left)

# Outer join
merged_outer = pd.merge(df1, df2, on='Key', how='outer')
print("Merged DataFrame (outer join):\n", merged_outer)

4. Reshaping DataFrames

Reshaping data involves changing the structure of your DataFrame to make it easier to analyze or visualize. Common reshaping operations include pivoting, melting, and stacking/unstacking.

4.1 Pivoting DataFrames

Pivoting is the process of transforming or rotating data in a DataFrame so that the rows become columns and vice versa.

# Sample DataFrame with multi-index
df_pivot = pd.DataFrame({
'Product': ['A', 'A', 'B', 'B'],
'Region': ['North', 'South', 'North', 'South'],
'Sales': [100, 200, 150, 250]
})

# Pivoting the DataFrame
pivoted_df = df_pivot.pivot(index='Product', columns='Region', values='Sales')
print("Pivoted DataFrame:\n", pivoted_df)

4.2 Melting DataFrames

Melting is the reverse of pivoting. It transforms a DataFrame from a wide format to a long format.

# Melting the DataFrame
melted_df = pivoted_df.reset_index().melt(id_vars='Product', value_vars=['North', 'South'])
print("Melted DataFrame:\n", melted_df)

4.3 Stacking and Unstacking DataFrames

Stacking compresses columns into a single column, while unstacking does the opposite by spreading out rows into columns.

# Stacking the DataFrame
stacked_df = pivoted_df.stack()
print("Stacked DataFrame:\n", stacked_df)

# Unstacking the DataFrame
unstacked_df = stacked_df.unstack()
print("Unstacked DataFrame:\n", unstacked_df)

5. Conclusion

Data manipulation is a critical step in the data analysis process, and pandas provides a wide array of tools to sort, group, merge, and reshape data. Mastering these techniques will allow you to prepare and analyze your data more effectively, unlocking deeper insights and driving better decision-making. In the next articles, we’ll explore more advanced pandas functionalities and dive into data cleaning, aggregation, and visualization techniques.