Advanced Data Wrangling and Cleaning

Data wrangling and cleaning are critical steps in the data science workflow. While basic techniques allow you to manage straightforward tasks, real-world data often presents more complex challenges. This article explores advanced methods for data wrangling and cleaning, building on your foundational skills in pandas and NumPy to handle complex data scenarios effectively.

1. Introduction

1.1 Overview

Data wrangling and cleaning involve transforming raw data into a usable format by addressing inconsistencies, filling in missing values, handling outliers, and standardizing formats. As datasets grow in complexity and size, advanced techniques become necessary to maintain data quality, which is crucial for accurate analysis and modeling.

1.2 Goals

The goal of this article is to enhance your data cleaning toolkit with advanced methods that can tackle complex data structures, sophisticated imputation strategies, and robust outlier detection techniques—all using pandas and NumPy. These skills are essential for dealing with real-world data, where straightforward methods often fall short.

2. Handling Complex Data Structures

2.1 Dealing with Nested and Hierarchical Data

In many real-world scenarios, data comes in nested or hierarchical formats, such as JSON or XML. Flattening these structures is often necessary to perform traditional data analysis.

Flattening Nested JSON Data

Consider a JSON file containing user information with nested address details:

{
  "user": {
    "id": 1,
    "name": "John Doe",
    "address": {
      "street": "123 Main St",
      "city": "Anytown",
      "zipcode": "12345"
    }
  }
}

To flatten this structure into a pandas DataFrame:

import pandas as pd
import json

# Example JSON data
json_data = '''
{
  "user": {
    "id": 1,
    "name": "John Doe",
    "address": {
      "street": "123 Main St",
      "city": "Anytown",
      "zipcode": "12345"
    }
  }
}
'''

# Load JSON data
data = json.loads(json_data)

# Normalize and flatten the data
df = pd.json_normalize(data, sep='_')
print(df)

Output:

   user_id  user_name user_address_street user_address_city user_address_zipcode
0        1  John Doe           123 Main St          Anytown                12345

2.2 Working with Unstructured Data

Unstructured data, such as text data, requires specialized techniques to transform it into a structured format.

Text Data Cleaning

Suppose you have a DataFrame with a text column containing customer reviews:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'review': [
        'Great product! 😊 #happy',
        'Terrible experience... 😠 #fail',
        'Average quality, decent price. #ok'
    ]
})

Cleaning Steps:

Remove Emojis and Special Characters:

import re

# Function to remove emojis and special characters
def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)      # Remove numbers
    text = re.sub(r'#\w+', '', text)     # Remove hashtags
    text = text.lower()                  # Convert to lowercase
    return text.strip()

df['clean_review'] = df['review'].apply(clean_text)
print(df)

Output:

                          review               clean_review
      Great product! 😊 #happy              great product
Terrible experience... 😠 #fail         terrible experience
Average quality, decent price. #ok  average quality decent price

Tokenize Text:

# Tokenize the clean text
df['tokens'] = df['clean_review'].str.split()
print(df[['clean_review', 'tokens']])

Output:

                clean_review                             tokens
            great product                   [great, product]
       terrible experience              [terrible, experience]
average quality decent price  [average, quality, decent, price]

3. Advanced Techniques for Missing Data

3.1 Understanding Missing Data Patterns

Before handling missing data, it's important to understand the pattern of missingness:

MCAR (Missing Completely at Random): Missingness is unrelated to any data.
MAR (Missing at Random): Missingness is related to observed data but not the missing data itself.
MNAR (Missing Not at Random): Missingness is related to the missing data.

Identifying the pattern helps in choosing the appropriate imputation method.

3.2 Advanced Imputation Techniques

Simple methods like mean or median imputation might not suffice for complex datasets. Here are more advanced techniques using pandas and NumPy.

Interpolation Methods

Interpolation fills in missing values using linear or non-linear functions based on other data points.

import pandas as pd
import numpy as np

# Sample time series data with missing values
dates = pd.date_range('2023-01-01', periods=10)
data = [1, np.nan, 3, np.nan, 5, 6, np.nan, 8, 9, 10]
df = pd.DataFrame({'Date': dates, 'Value': data})
df.set_index('Date', inplace=True)

# Interpolate missing values
df['Linear_Interpolation'] = df['Value'].interpolate(method='linear')
df['Quadratic_Interpolation'] = df['Value'].interpolate(method='quadratic')
print(df)

Output:

            Value  Linear_Interpolation  Quadratic_Interpolation
Date                                                           
2023-01-01    1.0              1.000000                 1.000000
2023-01-02    NaN              2.000000                 2.000000
2023-01-03    3.0              3.000000                 3.000000
2023-01-04    NaN              4.000000                 4.000000
2023-01-05    5.0              5.000000                 5.000000
2023-01-06    6.0              6.000000                 6.000000
2023-01-07    NaN              7.000000                 7.000000
2023-01-08    8.0              8.000000                 8.000000
2023-01-09    9.0              9.000000                 9.000000
2023-01-10   10.0             10.000000                10.000000

Using Rolling Statistics

Impute missing values using rolling statistics like mean or median.

# Rolling mean imputation
df['Rolling_Mean_Imputation'] = df['Value'].fillna(df['Value'].rolling(window=2, min_periods=1).mean())
print(df)

Output:

            Value  Linear_Interpolation  Quadratic_Interpolation  Rolling_Mean_Imputation
Date                                                                                    
2023-01-01    1.0              1.000000                 1.000000                 1.000000
2023-01-02    NaN              2.000000                 2.000000                 1.000000
2023-01-03    3.0              3.000000                 3.000000                 3.000000
2023-01-04    NaN              4.000000                 4.000000                 3.000000
2023-01-05    5.0              5.000000                 5.000000                 5.000000
2023-01-06    6.0              6.000000                 6.000000                 6.000000
2023-01-07    NaN              7.000000                 7.000000                 6.000000
2023-01-08    8.0              8.000000                 8.000000                 8.000000
2023-01-09    9.0              9.000000                 9.000000                 9.000000
2023-01-10   10.0             10.000000                10.000000                10.000000

3.3 Handling Data Imbalance

While data imbalance is often associated with classification tasks, in the context of data cleaning, it's important to be aware of imbalanced categories to prevent biased analysis.

Addressing Imbalanced Categories

Resampling Techniques: Oversample minority classes or undersample majority classes.

Example using pandas:

# Sample DataFrame with imbalanced classes
df = pd.DataFrame({
    'Category': ['A'] * 90 + ['B'] * 10,
    'Value': np.random.randn(100)
})

# Oversample minority class 'B'
df_minority = df[df['Category'] == 'B']
df_oversampled = df_minority.sample(n=80, replace=True, random_state=42)

# Combine with majority class
df_balanced = pd.concat([df[df['Category'] == 'A'], df_oversampled])

print(df_balanced['Category'].value_counts())

Output:

A    90
B    80
Name: Category, dtype: int64

4. Data Transformation and Standardization

4.1 Advanced Scaling and Normalization

Applying appropriate transformations can improve the performance of statistical models and algorithms.

Log Transformation

Useful for reducing skewness in data.

import numpy as np

# Sample data
data = np.array([1, 10, 100, 1000, 10000])

# Log transformation
log_data = np.log10(data)
print(log_data)

Output:

[0. 1. 2. 3. 4.]

Power Transformations with NumPy

When data contains negative values, use transformations that can handle them.

# Sample data with negative values
data = np.array([-100, -10, -1, 0, 1, 10, 100])

# Apply sign function with log to handle negatives
def log_transform(x):
    return np.sign(x) * np.log1p(np.abs(x))

transformed_data = log_transform(data)
print(transformed_data)

Output:

[-4.61512052 -2.39789527 -0.69314718  0.          0.69314718  2.39789527
  4.61512052]

4.2 Encoding High-Cardinality Categorical Variables

When dealing with categorical variables with many unique values, One-Hot Encoding becomes impractical.

Frequency Encoding

Replace categories with their frequency counts.

import pandas as pd

# Sample data
df = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'D', 'E', 'F', 'A']
})

# Calculate frequency encoding
frequency_encoding = df['Category'].value_counts()
df['Category_Encoded'] = df['Category'].map(frequency_encoding)
print(df)

Output:

  Category  Category_Encoded
      A                 3
      B                 2
      C                 2
      A                 3
      B                 2
      C                 2
      D                 1
      E                 1
      F                 1
      A                 3

5. Outlier and Anomaly Detection

5.1 Identifying Outliers

Outliers can significantly impact analysis and need to be identified accurately.

Z-Score Method with Robust Statistics

Using median and Median Absolute Deviation (MAD) instead of mean and standard deviation.

import pandas as pd
import numpy as np

# Sample data
data = np.array([10, 12, 12, 13, 12, 11, 14, 13, 100])  # 100 is an outlier
df = pd.DataFrame({'Value': data})

# Calculate median and MAD
median = df['Value'].median()
mad = np.median(np.abs(df['Value'] - median))

# Compute modified Z-scores
df['Modified_Z_Score'] = 0.6745 * (df['Value'] - median) / mad

# Identify outliers
df['Outlier'] = df['Modified_Z_Score'].abs() > 3.5
print(df)

Output:

   Value  Modified_Z_Score  Outlier
   10          -0.674500    False
   12           0.000000    False
   12           0.000000    False
   13           0.337250    False
   12           0.000000    False
   11          -0.337250    False
   14           0.674500    False
   13           0.337250    False
  100          59.415500     True

5.2 Handling Outliers

Strategies include:

Removal: Exclude outliers from the dataset.
Transformation: Apply transformations to reduce their impact.
Imputation: Replace outliers with statistical measures like median.

Example:

# Remove outliers
df_cleaned = df[df['Outlier'] == False]
print(df_cleaned)

Output:

   Value  Modified_Z_Score  Outlier
   10          -0.674500    False
   12           0.000000    False
   12           0.000000    False
   13           0.337250    False
   12           0.000000    False
   11          -0.337250    False
   14           0.674500    False
   13           0.337250    False

6. Data Integration and Aggregation

6.1 Advanced Merging and Joining Techniques

When merging datasets, you might encounter issues like duplicate keys or mismatched data types.

Handling Duplicate Keys

# Sample data
df1 = pd.DataFrame({
    'Key': ['A', 'B', 'B', 'C'],
    'Data1': [1, 2, 3, 4]
})

df2 = pd.DataFrame({
    'Key': ['B', 'B', 'C', 'D'],
    'Data2': [5, 6, 7, 8]
})

# Merge with duplicate keys
merged_df = pd.merge(df1, df2, on='Key', how='inner')
print(merged_df)

Output:

  Key  Data1  Data2
 B      2      5
 B      2      6
 B      3      5
 B      3      6
 C      4      7

6.2 Aggregation and Grouping in Complex Scenarios

Grouping by Multiple Keys

# Sample data
df = pd.DataFrame({
    'Product': ['A', 'A', 'B', 'B'],
    'Region': ['North', 'South', 'North', 'South'],
    'Sales': [100, 150, 200, 250]
})

# Group by multiple keys and aggregate
grouped_df = df.groupby(['Product', 'Region']).agg({'Sales': 'sum'})
print(grouped_df)

Output:

                  Sales
Product Region        
A       North      100
        South      150
B       North      200
        South      250

7. Automating Data Cleaning Processes

7.1 Writing Reusable Functions

Creating functions for repetitive tasks ensures consistency and saves time.

Example: Function for Scaling Numeric Columns

def scale_columns(df, columns):
    for col in columns:
        df[col] = (df[col] - df[col].mean()) / df[col].std()
    return df

# Sample DataFrame
df = pd.DataFrame({
    'A': [10, 20, 30, 40],
    'B': [1, 2, 3, 4],
    'C': ['X', 'Y', 'Z', 'W']
})

# Scale columns 'A' and 'B'
df_scaled = scale_columns(df, ['A', 'B'])
print(df_scaled)

Output:

          A         B  C
-1.341641 -1.341641  X
-0.447214 -0.447214  Y
0.447214  0.447214  Z
1.341641  1.341641  W

7.2 Using pandas' `pipe()` Method

The pipe() method allows chaining custom functions in a readable manner.

def remove_missing_values(df):
    return df.dropna()

def encode_categories(df):
    df['C'] = df['C'].astype('category').cat.codes
    return df

# Chain functions using pipe
df_cleaned = (df
              .pipe(remove_missing_values)
              .pipe(encode_categories))
print(df_cleaned)

Output:

          A         B  C
-1.341641 -1.341641  2
-0.447214 -0.447214  3
0.447214  0.447214  1
1.341641  1.341641  0

9. Best Practices

9.1 Documentation and Reproducibility

Comment Your Code: Explain the purpose of complex operations.
Version Control: Use Git or other version control systems to track changes.
Use Notebooks Wisely: Keep notebooks organized and avoid running cells out of order.

9.2 Common Pitfalls and How to Avoid Them

Overlooking Data Types: Ensure data types are appropriate for operations.
Ignoring Data Quality Issues: Always perform exploratory data analysis to uncover hidden issues.
Assuming Data Is Clean: Never assume data from external sources is clean; always validate and clean it.

10. Conclusion

By mastering these advanced data wrangling and cleaning techniques using pandas and NumPy, you are better equipped to tackle complex datasets. These skills are essential for any data professional aiming to deliver accurate and reliable analyses.

1. Introduction​

1.1 Overview​

1.2 Goals​

2. Handling Complex Data Structures​

2.1 Dealing with Nested and Hierarchical Data​

Flattening Nested JSON Data​

2.2 Working with Unstructured Data​

Text Data Cleaning​

3. Advanced Techniques for Missing Data​

3.1 Understanding Missing Data Patterns​

3.2 Advanced Imputation Techniques​

Interpolation Methods​

Using Rolling Statistics​

3.3 Handling Data Imbalance​

Addressing Imbalanced Categories​

4. Data Transformation and Standardization​

4.1 Advanced Scaling and Normalization​

Log Transformation​

Power Transformations with NumPy​

4.2 Encoding High-Cardinality Categorical Variables​

Frequency Encoding​

5. Outlier and Anomaly Detection​

5.1 Identifying Outliers​

Z-Score Method with Robust Statistics​

5.2 Handling Outliers​

6. Data Integration and Aggregation​

6.1 Advanced Merging and Joining Techniques​

Handling Duplicate Keys​

6.2 Aggregation and Grouping in Complex Scenarios​

Grouping by Multiple Keys​

7. Automating Data Cleaning Processes​

7.1 Writing Reusable Functions​

Example: Function for Scaling Numeric Columns​

7.2 Using pandas' pipe() Method​

9. Best Practices​

9.1 Documentation and Reproducibility​

9.2 Common Pitfalls and How to Avoid Them​

10. Conclusion​

Further Reading​

1. Introduction

1.1 Overview

1.2 Goals

2. Handling Complex Data Structures

2.1 Dealing with Nested and Hierarchical Data

Flattening Nested JSON Data

2.2 Working with Unstructured Data

Text Data Cleaning

3. Advanced Techniques for Missing Data

3.1 Understanding Missing Data Patterns

3.2 Advanced Imputation Techniques

Interpolation Methods

Using Rolling Statistics

3.3 Handling Data Imbalance

Addressing Imbalanced Categories

4. Data Transformation and Standardization

4.1 Advanced Scaling and Normalization

Log Transformation

Power Transformations with NumPy

4.2 Encoding High-Cardinality Categorical Variables

Frequency Encoding

5. Outlier and Anomaly Detection

5.1 Identifying Outliers

Z-Score Method with Robust Statistics

5.2 Handling Outliers

6. Data Integration and Aggregation

6.1 Advanced Merging and Joining Techniques

Handling Duplicate Keys

6.2 Aggregation and Grouping in Complex Scenarios

Grouping by Multiple Keys

7. Automating Data Cleaning Processes

7.1 Writing Reusable Functions

Example: Function for Scaling Numeric Columns

7.2 Using pandas' `pipe()` Method

9. Best Practices

9.1 Documentation and Reproducibility

9.2 Common Pitfalls and How to Avoid Them

10. Conclusion

Further Reading