Handling Categorical Data in pandas
Categorical data is common in many datasets, especially in fields like marketing, social sciences, and machine learning. Pandas provides robust tools for handling categorical data efficiently, allowing you to convert columns to categorical types and perform one-hot encoding for machine learning tasks. In this article, we’ll explore how to manage and manipulate categorical data in pandas.
1. Understanding Categorical Data
Categorical data represents discrete values or categories, such as "red", "blue", "green" for colors, or "small", "medium", "large" for sizes. These values can be either nominal (unordered) or ordinal (ordered).
1.1 Why Use Categorical Data Types?
Using categorical data types can save memory and improve performance when working with large datasets, as categorical data is stored more efficiently than object (string) data.
2. Converting Columns to Categorical Types
You can convert a column in a DataFrame to a categorical type using the pd.Categorical
function or the astype()
method.
2.1 Basic Conversion
Let’s start by converting a column of strings to a categorical type.
import pandas as pd
# Sample DataFrame
data = {
'Product': ['A', 'B', 'A', 'C', 'B', 'A'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Medium', 'Large']
}
df = pd.DataFrame(data)
# Converting the 'Size' column to a categorical type
df['Size'] = pd.Categorical(df['Size'])
print("DataFrame with categorical 'Size' column:\n", df)
2.2 Specifying Categories and Order
When converting to categorical data, you can specify the categories and their order, which is particularly useful for ordinal data.
# Converting with specified categories and order
df['Size'] = pd.Categorical(df['Size'], categories=['Small', 'Medium', 'Large'], ordered=True)
print("DataFrame with ordered categorical 'Size' column:\n", df)
2.3 Checking Memory Usage
Converting columns to categorical types can reduce memory usage. You can check the memory usage before and after conversion.
# Checking memory usage before and after conversion
print("Memory usage before conversion:\n", df.memory_usage(deep=True))
# Converting another column to categorical
df['Product'] = pd.Categorical(df['Product'])
print("Memory usage after conversion:\n", df.memory_usage(deep=True))
3. Working with Categorical Data
Once a column is converted to a categorical type, you can take advantage of pandas' categorical methods for analysis and manipulation.
3.1 Accessing Categories and Codes
You can access the categories and the underlying integer codes associated with each category.
# Accessing the categories
print("Categories in 'Size':\n", df['Size'].cat.categories)
# Accessing the codes
print("Codes in 'Size':\n", df['Size'].cat.codes)
3.2 Renaming Categories
You can rename the categories of a categorical column using the cat.rename_categories()
method.
# Renaming categories
df['Size'] = df['Size'].cat.rename_categories({'Small': 'S', 'Medium': 'M', 'Large': 'L'})
print("DataFrame with renamed categories:\n", df)
3.3 Adding or Removing Categories
Pandas allows you to add new categories or remove unused ones.
# Adding a new category
df['Size'] = df['Size'].cat.add_categories(['Extra Large'])
print("Categories after adding 'Extra Large':\n", df['Size'].cat.categories)
# Removing unused categories
df['Size'] = df['Size'].cat.remove_unused_categories()
print("Categories after removing unused ones:\n", df['Size'].cat.categories)
4. Handling Categorical Data in Real-World Scenarios
Let’s consider a more complex example of handling categorical data in a dataset.
Example: Customer Segmentation
Imagine you have a dataset containing customer information, including their preferred product category, region, and loyalty level.
# Sample customer data
customer_data = {
'CustomerID': [1, 2, 3, 4, 5],
'Region': ['North', 'South', 'East', 'West', 'North'],
'LoyaltyLevel': ['Gold', 'Silver', 'Gold', 'Bronze', 'Silver']
}
df_customers = pd.DataFrame(customer_data)
# Converting 'Region' and 'LoyaltyLevel' to categorical
df_customers['Region'] = pd.Categorical(df_customers['Region'])
df_customers['LoyaltyLevel'] = pd.Categorical(df_customers['LoyaltyLevel'], categories=['Bronze', 'Silver', 'Gold'], ordered=True)
# One-hot encoding the 'Region' column
df_customers_one_hot = pd.get_dummies(df_customers, columns=['Region'])
print("Customer DataFrame with one-hot encoded 'Region':\n", df_customers_one_hot)
5. Conclusion
Handling categorical data effectively is crucial for many data analysis and machine learning tasks. Pandas offers powerful tools to manage and manipulate categorical data, from converting columns to categorical types to performing one-hot encoding. By mastering these techniques, you'll be better equipped to work with datasets that contain categorical variables. In the next article, we’ll explore DataFrame indexing and multi-indexing techniques in pandas.