Skip to main content

Advanced Data Wrangling and Cleaning

Data wrangling and cleaning are critical steps in the data science workflow. While basic techniques allow you to manage straightforward tasks, real-world data often presents more complex challenges. This article explores advanced methods for data wrangling and cleaning, building on your foundational skills in pandas and NumPy to handle complex data scenarios effectively.


1. Introduction

1.1 Overview

Data wrangling and cleaning involve transforming raw data into a usable format by addressing inconsistencies, filling in missing values, handling outliers, and standardizing formats. As datasets grow in complexity and size, advanced techniques become necessary to maintain data quality, which is crucial for accurate analysis and modeling.

1.2 Goals

The goal of this article is to enhance your data cleaning toolkit with advanced methods that can tackle complex data structures, sophisticated imputation strategies, and robust outlier detection techniques—all using pandas and NumPy. These skills are essential for dealing with real-world data, where straightforward methods often fall short.


2. Handling Complex Data Structures

2.1 Dealing with Nested and Hierarchical Data

In many real-world scenarios, data comes in nested or hierarchical formats, such as JSON or XML. Flattening these structures is often necessary to perform traditional data analysis.

Flattening Nested JSON Data

Consider a JSON file containing user information with nested address details:

{
"user": {
"id": 1,
"name": "John Doe",
"address": {
"street": "123 Main St",
"city": "Anytown",
"zipcode": "12345"
}
}
}

To flatten this structure into a pandas DataFrame:

import pandas as pd
import json

# Example JSON data
json_data = '''
{
"user": {
"id": 1,
"name": "John Doe",
"address": {
"street": "123 Main St",
"city": "Anytown",
"zipcode": "12345"
}
}
}
'''

# Load JSON data
data = json.loads(json_data)

# Normalize and flatten the data
df = pd.json_normalize(data, sep='_')
print(df)

Output:

   user_id  user_name user_address_street user_address_city user_address_zipcode
0 1 John Doe 123 Main St Anytown 12345

2.2 Working with Unstructured Data

Unstructured data, such as text data, requires specialized techniques to transform it into a structured format.

Text Data Cleaning

Suppose you have a DataFrame with a text column containing customer reviews:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'review': [
'Great product! 😊 #happy',
'Terrible experience... 😠 #fail',
'Average quality, decent price. #ok'
]
})

Cleaning Steps:

  1. Remove Emojis and Special Characters:

    import re

    # Function to remove emojis and special characters
    def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    text = re.sub(r'#\w+', '', text) # Remove hashtags
    text = text.lower() # Convert to lowercase
    return text.strip()

    df['clean_review'] = df['review'].apply(clean_text)
    print(df)

    Output:

                              review               clean_review
    0 Great product! 😊 #happy great product
    1 Terrible experience... 😠 #fail terrible experience
    2 Average quality, decent price. #ok average quality decent price
  2. Tokenize Text:

    # Tokenize the clean text
    df['tokens'] = df['clean_review'].str.split()
    print(df[['clean_review', 'tokens']])

    Output:

                    clean_review                             tokens
    0 great product [great, product]
    1 terrible experience [terrible, experience]
    2 average quality decent price [average, quality, decent, price]

3. Advanced Techniques for Missing Data

3.1 Understanding Missing Data Patterns

Before handling missing data, it's important to understand the pattern of missingness:

  • MCAR (Missing Completely at Random): Missingness is unrelated to any data.
  • MAR (Missing at Random): Missingness is related to observed data but not the missing data itself.
  • MNAR (Missing Not at Random): Missingness is related to the missing data.

Identifying the pattern helps in choosing the appropriate imputation method.

3.2 Advanced Imputation Techniques

Simple methods like mean or median imputation might not suffice for complex datasets. Here are more advanced techniques using pandas and NumPy.

Interpolation Methods

Interpolation fills in missing values using linear or non-linear functions based on other data points.

import pandas as pd
import numpy as np

# Sample time series data with missing values
dates = pd.date_range('2023-01-01', periods=10)
data = [1, np.nan, 3, np.nan, 5, 6, np.nan, 8, 9, 10]
df = pd.DataFrame({'Date': dates, 'Value': data})
df.set_index('Date', inplace=True)

# Interpolate missing values
df['Linear_Interpolation'] = df['Value'].interpolate(method='linear')
df['Quadratic_Interpolation'] = df['Value'].interpolate(method='quadratic')
print(df)

Output:

            Value  Linear_Interpolation  Quadratic_Interpolation
Date
2023-01-01 1.0 1.000000 1.000000
2023-01-02 NaN 2.000000 2.000000
2023-01-03 3.0 3.000000 3.000000
2023-01-04 NaN 4.000000 4.000000
2023-01-05 5.0 5.000000 5.000000
2023-01-06 6.0 6.000000 6.000000
2023-01-07 NaN 7.000000 7.000000
2023-01-08 8.0 8.000000 8.000000
2023-01-09 9.0 9.000000 9.000000
2023-01-10 10.0 10.000000 10.000000

Using Rolling Statistics

Impute missing values using rolling statistics like mean or median.

# Rolling mean imputation
df['Rolling_Mean_Imputation'] = df['Value'].fillna(df['Value'].rolling(window=2, min_periods=1).mean())
print(df)

Output:

            Value  Linear_Interpolation  Quadratic_Interpolation  Rolling_Mean_Imputation
Date
2023-01-01 1.0 1.000000 1.000000 1.000000
2023-01-02 NaN 2.000000 2.000000 1.000000
2023-01-03 3.0 3.000000 3.000000 3.000000
2023-01-04 NaN 4.000000 4.000000 3.000000
2023-01-05 5.0 5.000000 5.000000 5.000000
2023-01-06 6.0 6.000000 6.000000 6.000000
2023-01-07 NaN 7.000000 7.000000 6.000000
2023-01-08 8.0 8.000000 8.000000 8.000000
2023-01-09 9.0 9.000000 9.000000 9.000000
2023-01-10 10.0 10.000000 10.000000 10.000000

3.3 Handling Data Imbalance

While data imbalance is often associated with classification tasks, in the context of data cleaning, it's important to be aware of imbalanced categories to prevent biased analysis.

Addressing Imbalanced Categories

  • Resampling Techniques: Oversample minority classes or undersample majority classes.

Example using pandas:

# Sample DataFrame with imbalanced classes
df = pd.DataFrame({
'Category': ['A'] * 90 + ['B'] * 10,
'Value': np.random.randn(100)
})

# Oversample minority class 'B'
df_minority = df[df['Category'] == 'B']
df_oversampled = df_minority.sample(n=80, replace=True, random_state=42)

# Combine with majority class
df_balanced = pd.concat([df[df['Category'] == 'A'], df_oversampled])

print(df_balanced['Category'].value_counts())

Output:

A    90
B 80
Name: Category, dtype: int64

4. Data Transformation and Standardization

4.1 Advanced Scaling and Normalization

Applying appropriate transformations can improve the performance of statistical models and algorithms.

Log Transformation

Useful for reducing skewness in data.

import numpy as np

# Sample data
data = np.array([1, 10, 100, 1000, 10000])

# Log transformation
log_data = np.log10(data)
print(log_data)

Output:

[0. 1. 2. 3. 4.]

Power Transformations with NumPy

When data contains negative values, use transformations that can handle them.

# Sample data with negative values
data = np.array([-100, -10, -1, 0, 1, 10, 100])

# Apply sign function with log to handle negatives
def log_transform(x):
return np.sign(x) * np.log1p(np.abs(x))

transformed_data = log_transform(data)
print(transformed_data)

Output:

[-4.61512052 -2.39789527 -0.69314718  0.          0.69314718  2.39789527
4.61512052]

4.2 Encoding High-Cardinality Categorical Variables

When dealing with categorical variables with many unique values, One-Hot Encoding becomes impractical.

Frequency Encoding

Replace categories with their frequency counts.

import pandas as pd

# Sample data
df = pd.DataFrame({
'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'D', 'E', 'F', 'A']
})

# Calculate frequency encoding
frequency_encoding = df['Category'].value_counts()
df['Category_Encoded'] = df['Category'].map(frequency_encoding)
print(df)

Output:

  Category  Category_Encoded
0 A 3
1 B 2
2 C 2
3 A 3
4 B 2
5 C 2
6 D 1
7 E 1
8 F 1
9 A 3

5. Outlier and Anomaly Detection

5.1 Identifying Outliers

Outliers can significantly impact analysis and need to be identified accurately.

Z-Score Method with Robust Statistics

Using median and Median Absolute Deviation (MAD) instead of mean and standard deviation.

import pandas as pd
import numpy as np

# Sample data
data = np.array([10, 12, 12, 13, 12, 11, 14, 13, 100]) # 100 is an outlier
df = pd.DataFrame({'Value': data})

# Calculate median and MAD
median = df['Value'].median()
mad = np.median(np.abs(df['Value'] - median))

# Compute modified Z-scores
df['Modified_Z_Score'] = 0.6745 * (df['Value'] - median) / mad

# Identify outliers
df['Outlier'] = df['Modified_Z_Score'].abs() > 3.5
print(df)

Output:

   Value  Modified_Z_Score  Outlier
0 10 -0.674500 False
1 12 0.000000 False
2 12 0.000000 False
3 13 0.337250 False
4 12 0.000000 False
5 11 -0.337250 False
6 14 0.674500 False
7 13 0.337250 False
8 100 59.415500 True

5.2 Handling Outliers

Strategies include:

  • Removal: Exclude outliers from the dataset.
  • Transformation: Apply transformations to reduce their impact.
  • Imputation: Replace outliers with statistical measures like median.

Example:

# Remove outliers
df_cleaned = df[df['Outlier'] == False]
print(df_cleaned)

Output:

   Value  Modified_Z_Score  Outlier
0 10 -0.674500 False
1 12 0.000000 False
2 12 0.000000 False
3 13 0.337250 False
4 12 0.000000 False
5 11 -0.337250 False
6 14 0.674500 False
7 13 0.337250 False

6. Data Integration and Aggregation

6.1 Advanced Merging and Joining Techniques

When merging datasets, you might encounter issues like duplicate keys or mismatched data types.

Handling Duplicate Keys

# Sample data
df1 = pd.DataFrame({
'Key': ['A', 'B', 'B', 'C'],
'Data1': [1, 2, 3, 4]
})

df2 = pd.DataFrame({
'Key': ['B', 'B', 'C', 'D'],
'Data2': [5, 6, 7, 8]
})

# Merge with duplicate keys
merged_df = pd.merge(df1, df2, on='Key', how='inner')
print(merged_df)

Output:

  Key  Data1  Data2
0 B 2 5
1 B 2 6
2 B 3 5
3 B 3 6
4 C 4 7

6.2 Aggregation and Grouping in Complex Scenarios

Grouping by Multiple Keys

# Sample data
df = pd.DataFrame({
'Product': ['A', 'A', 'B', 'B'],
'Region': ['North', 'South', 'North', 'South'],
'Sales': [100, 150, 200, 250]
})

# Group by multiple keys and aggregate
grouped_df = df.groupby(['Product', 'Region']).agg({'Sales': 'sum'})
print(grouped_df)

Output:

                  Sales
Product Region
A North 100
South 150
B North 200
South 250

7. Automating Data Cleaning Processes

7.1 Writing Reusable Functions

Creating functions for repetitive tasks ensures consistency and saves time.

Example: Function for Scaling Numeric Columns

def scale_columns(df, columns):
for col in columns:
df[col] = (df[col] - df[col].mean()) / df[col].std()
return df

# Sample DataFrame
df = pd.DataFrame({
'A': [10, 20, 30, 40],
'B': [1, 2, 3, 4],
'C': ['X', 'Y', 'Z', 'W']
})

# Scale columns 'A' and 'B'
df_scaled = scale_columns(df, ['A', 'B'])
print(df_scaled)

Output:

          A         B  C
0 -1.341641 -1.341641 X
1 -0.447214 -0.447214 Y
2 0.447214 0.447214 Z
3 1.341641 1.341641 W

7.2 Using pandas' pipe() Method

The pipe() method allows chaining custom functions in a readable manner.

def remove_missing_values(df):
return df.dropna()

def encode_categories(df):
df['C'] = df['C'].astype('category').cat.codes
return df

# Chain functions using pipe
df_cleaned = (df
.pipe(remove_missing_values)
.pipe(encode_categories))
print(df_cleaned)

Output:

          A         B  C
0 -1.341641 -1.341641 2
1 -0.447214 -0.447214 3
2 0.447214 0.447214 1
3 1.341641 1.341641 0

9. Best Practices

9.1 Documentation and Reproducibility

  • Comment Your Code: Explain the purpose of complex operations.
  • Version Control: Use Git or other version control systems to track changes.
  • Use Notebooks Wisely: Keep notebooks organized and avoid running cells out of order.

9.2 Common Pitfalls and How to Avoid Them

  • Overlooking Data Types: Ensure data types are appropriate for operations.
  • Ignoring Data Quality Issues: Always perform exploratory data analysis to uncover hidden issues.
  • Assuming Data Is Clean: Never assume data from external sources is clean; always validate and clean it.

10. Conclusion

By mastering these advanced data wrangling and cleaning techniques using pandas and NumPy, you are better equipped to tackle complex datasets. These skills are essential for any data professional aiming to deliver accurate and reliable analyses.

Further Reading