Skip to main content

Applied Feature Engineering

Feature engineering is the process of creating, transforming, and selecting features that enhance the performance of machine learning models. It is one of the most critical steps in the data science pipeline because the quality of the features you provide to your model often determines the model's performance. This article explores various feature engineering techniques using pandas and NumPy, focusing on creating new features, transforming existing features, and encoding categorical variables.


1. Introduction

1.1 What is Feature Engineering?

Feature engineering involves creating new features or modifying existing ones to improve the predictive power of a model. This can include creating interaction terms, transforming features to better match the assumptions of a model, and encoding categorical variables into numerical formats that machine learning algorithms can process.

info

Visit our Introduction to Feature Engineering article for learning the basics of Feature Engineering.

1.2 Why is Feature Engineering Important?

Effective feature engineering can significantly boost the performance of your model by providing it with more informative data. It allows models to make better predictions by capturing the underlying patterns and relationships in the data.


2. Creating New Features

2.1 Interaction Features

Interaction features are created by multiplying or combining two or more existing features. These features can capture relationships between variables that might not be evident when the variables are considered independently.

Example: Creating Interaction Features

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Feature1': [1, 2, 3, 4],
'Feature2': [10, 20, 30, 40]
})

# Creating an interaction feature by multiplying Feature1 and Feature2
df['Interaction'] = df['Feature1'] * df['Feature2']
print(df)

Output:

   Feature1  Feature2  Interaction
0 1 10 10
1 2 20 40
2 3 30 90
3 4 40 160

2.2 Polynomial Features

Polynomial features involve raising existing features to a power (e.g., square, cube) and are particularly useful in linear models where you want to capture non-linear relationships.

Example: Creating Polynomial Features

# Creating polynomial features (e.g., square of Feature1)
df['Feature1_Squared'] = df['Feature1'] ** 2
print(df)

Output:

   Feature1  Feature2  Interaction  Feature1_Squared
0 1 10 10 1
1 2 20 40 4
2 3 30 90 9
3 4 40 160 16

2.3 Binning and Discretization

Binning involves converting continuous variables into categorical variables by grouping them into intervals or bins. This can be useful when the relationship between the feature and the target variable is not linear or when you want to reduce the effect of outliers.

Example: Binning Continuous Data

# Binning Feature2 into 3 equal-width bins
df['Feature2_Binned'] = pd.cut(df['Feature2'], bins=3, labels=['Low', 'Medium', 'High'])
print(df)

Output:

   Feature1  Feature2  Interaction  Feature1_Squared Feature2_Binned
0 1 10 10 1 Low
1 2 20 40 4 Medium
2 3 30 90 9 Medium
3 4 40 160 16 High

3. Transforming Features

3.1 Logarithmic and Exponential Transformations

Logarithmic transformations are often used to reduce the skewness of a feature's distribution, making it more normal-like, which can improve the performance of linear models. Exponential transformations are less common but can be used in cases where you need to model rapid growth.

Example: Logarithmic Transformation

import numpy as np

# Applying a logarithmic transformation to Feature2
df['Log_Feature2'] = np.log(df['Feature2'])
print(df)

Output:

   Feature1  Feature2  Interaction  Feature1_Squared Feature2_Binned  Log_Feature2
0 1 10 10 1 Low 2.302585
1 2 20 40 4 Medium 2.995732
2 3 30 90 9 Medium 3.401197
3 4 40 160 16 High 3.688879

3.2 Scaling and Normalization

Scaling and normalization are used to standardize the range of features, which is especially important when features have different units or magnitudes. This ensures that no single feature dominates the others due to its scale.

Example: Scaling Features

# Standardizing Feature1 (zero mean, unit variance)
df['Feature1_Standardized'] = (df['Feature1'] - df['Feature1'].mean()) / df['Feature1'].std()
print(df)

Output:

   Feature1  Feature2  Interaction  Feature1_Squared Feature2_Binned  Log_Feature2  Feature1_Standardized
0 1 10 10 1 Low 2.302585 -1.341641
1 2 20 40 4 Medium 2.995732 -0.447214
2 3 30 90 9 Medium 3.401197 0.447214
3 4 40 160 16 High 3.688879 1.341641

3.3 Encoding Cyclical Features

Some features, like time (hours, days, months), are cyclical in nature. Encoding these features properly is crucial to maintain the cyclical continuity (e.g., 23 hours and 1 hour are close in time).

Example: Encoding Hour of the Day

# Sample DataFrame with hours of the day
df_time = pd.DataFrame({
'Hour': [0, 6, 12, 18]
})

# Encode as sine and cosine
df_time['Hour_sin'] = np.sin(2 * np.pi * df_time['Hour'] / 24)
df_time['Hour_cos'] = np.cos(2 * np.pi * df_time['Hour'] / 24)
print(df_time)

Output:

   Hour  Hour_sin  Hour_cos
0 0 0.000000 1.000000
1 6 1.000000 0.000000
2 12 0.000000 -1.000000
3 18 -1.000000 0.000000

4. Encoding Categorical Variables

4.1 One-Hot Encoding

One-hot encoding converts categorical variables into a set of binary variables (0 or 1), where each category is represented by a separate column. This method is useful when the categorical variable is nominal (no inherent order).

Example: One-Hot Encoding

# Sample DataFrame with categorical data
df_cat = pd.DataFrame({
'Category': ['A', 'B', 'A', 'C']
})

# One-Hot Encoding
df_cat_encoded = pd.get_dummies(df_cat, columns=['Category'])
print(df_cat_encoded)

Output:

   Category_A  Category_B  Category_C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1

4.2 Label Encoding

Label encoding converts categorical variables into numerical labels. However, this method introduces an ordinal relationship between categories, which may not be appropriate for all situations.

Example: Label Encoding

# Assigning numerical labels to categories
df_cat['Category_Encoded'] = df_cat['Category'].astype('category').cat.codes
print(df_cat)

Output:

  Category  Category_Encoded
0 A 0
1 B 1
2 A 0
3 C 2

4.3 Frequency Encoding

Frequency encoding replaces each category with the frequency of its occurrence. This method is useful for high-cardinality categorical variables.

Example: Frequency Encoding

# Frequency Encoding
df_cat['Category_Frequency'] = df_cat['Category'].map(df_cat['Category'].value_counts())
print(df_cat)

Output:

 

Category Category_Encoded Category_Frequency
0 A 0 2
1 B 1 1
2 A 0 2
3 C 2 1

5. Best Practices in Feature Engineering

5.1 Understanding the Data

Before creating or transforming features, thoroughly explore and understand your data. Use visualizations and statistical summaries to guide your feature engineering decisions.

5.2 Avoiding Data Leakage

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. Ensure that feature engineering is only based on training data during model building.

5.3 Iterative Process

Feature engineering is an iterative process. Continuously refine your features based on model performance and new insights from the data.

5.4 Domain Knowledge

Incorporate domain knowledge whenever possible. Features that are meaningful within the context of the problem can significantly improve model performance.


6. Conclusion

6.1 Summary of Key Techniques

In this article, we covered various feature engineering techniques using pandas and NumPy, including creating interaction features, polynomial features, and encoding categorical variables. Effective feature engineering can drastically improve model performance by making the data more informative.

6.2 Further Reading