Handling Categorical Data in pandas

Categorical data is common in many datasets, especially in fields like marketing, social sciences, and machine learning. Pandas provides robust tools for handling categorical data efficiently, allowing you to convert columns to categorical types and perform one-hot encoding for machine learning tasks. In this article, we’ll explore how to manage and manipulate categorical data in pandas.

1. Understanding Categorical Data

Categorical data represents discrete values or categories, such as "red", "blue", "green" for colors, or "small", "medium", "large" for sizes. These values can be either nominal (unordered) or ordinal (ordered).

1.1 Why Use Categorical Data Types?

Using categorical data types can save memory and improve performance when working with large datasets, as categorical data is stored more efficiently than object (string) data.

2. Converting Columns to Categorical Types

You can convert a column in a DataFrame to a categorical type using the pd.Categorical function or the astype() method.

2.1 Basic Conversion

Let’s start by converting a column of strings to a categorical type.

import pandas as pd

# Sample DataFrame
data = {
    'Product': ['A', 'B', 'A', 'C', 'B', 'A'],
    'Size': ['Small', 'Large', 'Medium', 'Small', 'Medium', 'Large']
}
df = pd.DataFrame(data)

# Converting the 'Size' column to a categorical type
df['Size'] = pd.Categorical(df['Size'])
print("DataFrame with categorical 'Size' column:\n", df)

2.2 Specifying Categories and Order

When converting to categorical data, you can specify the categories and their order, which is particularly useful for ordinal data.

# Converting with specified categories and order
df['Size'] = pd.Categorical(df['Size'], categories=['Small', 'Medium', 'Large'], ordered=True)
print("DataFrame with ordered categorical 'Size' column:\n", df)

2.3 Checking Memory Usage

Converting columns to categorical types can reduce memory usage. You can check the memory usage before and after conversion.

# Checking memory usage before and after conversion
print("Memory usage before conversion:\n", df.memory_usage(deep=True))

# Converting another column to categorical
df['Product'] = pd.Categorical(df['Product'])

print("Memory usage after conversion:\n", df.memory_usage(deep=True))

3. Working with Categorical Data

Once a column is converted to a categorical type, you can take advantage of pandas' categorical methods for analysis and manipulation.

3.1 Accessing Categories and Codes

You can access the categories and the underlying integer codes associated with each category.

# Accessing the categories
print("Categories in 'Size':\n", df['Size'].cat.categories)

# Accessing the codes
print("Codes in 'Size':\n", df['Size'].cat.codes)

3.2 Renaming Categories

You can rename the categories of a categorical column using the cat.rename_categories() method.

# Renaming categories
df['Size'] = df['Size'].cat.rename_categories({'Small': 'S', 'Medium': 'M', 'Large': 'L'})
print("DataFrame with renamed categories:\n", df)

3.3 Adding or Removing Categories

Pandas allows you to add new categories or remove unused ones.

# Adding a new category
df['Size'] = df['Size'].cat.add_categories(['Extra Large'])
print("Categories after adding 'Extra Large':\n", df['Size'].cat.categories)

# Removing unused categories
df['Size'] = df['Size'].cat.remove_unused_categories()
print("Categories after removing unused ones:\n", df['Size'].cat.categories)

4. Handling Categorical Data in Real-World Scenarios

Let’s consider a more complex example of handling categorical data in a dataset.

Example: Customer Segmentation

Imagine you have a dataset containing customer information, including their preferred product category, region, and loyalty level.

# Sample customer data
customer_data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Region': ['North', 'South', 'East', 'West', 'North'],
    'LoyaltyLevel': ['Gold', 'Silver', 'Gold', 'Bronze', 'Silver']
}
df_customers = pd.DataFrame(customer_data)

# Converting 'Region' and 'LoyaltyLevel' to categorical
df_customers['Region'] = pd.Categorical(df_customers['Region'])
df_customers['LoyaltyLevel'] = pd.Categorical(df_customers['LoyaltyLevel'], categories=['Bronze', 'Silver', 'Gold'], ordered=True)

# One-hot encoding the 'Region' column
df_customers_one_hot = pd.get_dummies(df_customers, columns=['Region'])
print("Customer DataFrame with one-hot encoded 'Region':\n", df_customers_one_hot)

5. Conclusion

Handling categorical data effectively is crucial for many data analysis and machine learning tasks. Pandas offers powerful tools to manage and manipulate categorical data, from converting columns to categorical types to performing one-hot encoding. By mastering these techniques, you'll be better equipped to work with datasets that contain categorical variables. In the next article, we’ll explore DataFrame indexing and multi-indexing techniques in pandas.

1. Understanding Categorical Data​

1.1 Why Use Categorical Data Types?​

2. Converting Columns to Categorical Types​

2.1 Basic Conversion​

2.2 Specifying Categories and Order​

2.3 Checking Memory Usage​

3. Working with Categorical Data​

3.1 Accessing Categories and Codes​

3.2 Renaming Categories​

3.3 Adding or Removing Categories​

4. Handling Categorical Data in Real-World Scenarios​

5. Conclusion​