Applied Feature Engineering
Feature engineering is the process of creating, transforming, and selecting features that enhance the performance of machine learning models. It is one of the most critical steps in the data science pipeline because the quality of the features you provide to your model often determines the model's performance. This article explores various feature engineering techniques using pandas and NumPy, focusing on creating new features, transforming existing features, and encoding categorical variables.
1. Introduction
1.1 What is Feature Engineering?
Feature engineering involves creating new features or modifying existing ones to improve the predictive power of a model. This can include creating interaction terms, transforming features to better match the assumptions of a model, and encoding categorical variables into numerical formats that machine learning algorithms can process.
Visit our Introduction to Feature Engineering article for learning the basics of Feature Engineering.
1.2 Why is Feature Engineering Important?
Effective feature engineering can significantly boost the performance of your model by providing it with more informative data. It allows models to make better predictions by capturing the underlying patterns and relationships in the data.
2. Creating New Features
2.1 Interaction Features
Interaction features are created by multiplying or combining two or more existing features. These features can capture relationships between variables that might not be evident when the variables are considered independently.
Example: Creating Interaction Features
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Feature1': [1, 2, 3, 4],
'Feature2': [10, 20, 30, 40]
})
# Creating an interaction feature by multiplying Feature1 and Feature2
df['Interaction'] = df['Feature1'] * df['Feature2']
print(df)
Output:
Feature1 Feature2 Interaction
0 1 10 10
1 2 20 40
2 3 30 90
3 4 40 160
2.2 Polynomial Features
Polynomial features involve raising existing features to a power (e.g., square, cube) and are particularly useful in linear models where you want to capture non-linear relationships.
Example: Creating Polynomial Features
# Creating polynomial features (e.g., square of Feature1)
df['Feature1_Squared'] = df['Feature1'] ** 2
print(df)
Output:
Feature1 Feature2 Interaction Feature1_Squared
0 1 10 10 1
1 2 20 40 4
2 3 30 90 9
3 4 40 160 16
2.3 Binning and Discretization
Binning involves converting continuous variables into categorical variables by grouping them into intervals or bins. This can be useful when the relationship between the feature and the target variable is not linear or when you want to reduce the effect of outliers.
Example: Binning Continuous Data
# Binning Feature2 into 3 equal-width bins
df['Feature2_Binned'] = pd.cut(df['Feature2'], bins=3, labels=['Low', 'Medium', 'High'])
print(df)
Output:
Feature1 Feature2 Interaction Feature1_Squared Feature2_Binned
0 1 10 10 1 Low
1 2 20 40 4 Medium
2 3 30 90 9 Medium
3 4 40 160 16 High
3. Transforming Features
3.1 Logarithmic and Exponential Transformations
Logarithmic transformations are often used to reduce the skewness of a feature's distribution, making it more normal-like, which can improve the performance of linear models. Exponential transformations are less common but can be used in cases where you need to model rapid growth.
Example: Logarithmic Transformation
import numpy as np
# Applying a logarithmic transformation to Feature2
df['Log_Feature2'] = np.log(df['Feature2'])
print(df)
Output:
Feature1 Feature2 Interaction Feature1_Squared Feature2_Binned Log_Feature2
0 1 10 10 1 Low 2.302585
1 2 20 40 4 Medium 2.995732
2 3 30 90 9 Medium 3.401197
3 4 40 160 16 High 3.688879
3.2 Scaling and Normalization
Scaling and normalization are used to standardize the range of features, which is especially important when features have different units or magnitudes. This ensures that no single feature dominates the others due to its scale.
Example: Scaling Features
# Standardizing Feature1 (zero mean, unit variance)
df['Feature1_Standardized'] = (df['Feature1'] - df['Feature1'].mean()) / df['Feature1'].std()
print(df)
Output:
Feature1 Feature2 Interaction Feature1_Squared Feature2_Binned Log_Feature2 Feature1_Standardized
0 1 10 10 1 Low 2.302585 -1.341641
1 2 20 40 4 Medium 2.995732 -0.447214
2 3 30 90 9 Medium 3.401197 0.447214
3 4 40 160 16 High 3.688879 1.341641
3.3 Encoding Cyclical Features
Some features, like time (hours, days, months), are cyclical in nature. Encoding these features properly is crucial to maintain the cyclical continuity (e.g., 23 hours and 1 hour are close in time).
Example: Encoding Hour of the Day
# Sample DataFrame with hours of the day
df_time = pd.DataFrame({
'Hour': [0, 6, 12, 18]
})
# Encode as sine and cosine
df_time['Hour_sin'] = np.sin(2 * np.pi * df_time['Hour'] / 24)
df_time['Hour_cos'] = np.cos(2 * np.pi * df_time['Hour'] / 24)
print(df_time)
Output:
Hour Hour_sin Hour_cos
0 0 0.000000 1.000000
1 6 1.000000 0.000000
2 12 0.000000 -1.000000
3 18 -1.000000 0.000000
4. Encoding Categorical Variables
4.1 One-Hot Encoding
One-hot encoding converts categorical variables into a set of binary variables (0 or 1), where each category is represented by a separate column. This method is useful when the categorical variable is nominal (no inherent order).
Example: One-Hot Encoding
# Sample DataFrame with categorical data
df_cat = pd.DataFrame({
'Category': ['A', 'B', 'A', 'C']
})
# One-Hot Encoding
df_cat_encoded = pd.get_dummies(df_cat, columns=['Category'])
print(df_cat_encoded)
Output:
Category_A Category_B Category_C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4.2 Label Encoding
Label encoding converts categorical variables into numerical labels. However, this method introduces an ordinal relationship between categories, which may not be appropriate for all situations.
Example: Label Encoding
# Assigning numerical labels to categories
df_cat['Category_Encoded'] = df_cat['Category'].astype('category').cat.codes
print(df_cat)
Output:
Category Category_Encoded
0 A 0
1 B 1
2 A 0
3 C 2
4.3 Frequency Encoding
Frequency encoding replaces each category with the frequency of its occurrence. This method is useful for high-cardinality categorical variables.
Example: Frequency Encoding
# Frequency Encoding
df_cat['Category_Frequency'] = df_cat['Category'].map(df_cat['Category'].value_counts())
print(df_cat)
Output:
Category Category_Encoded Category_Frequency
0 A 0 2
1 B 1 1
2 A 0 2
3 C 2 1
5. Best Practices in Feature Engineering
5.1 Understanding the Data
Before creating or transforming features, thoroughly explore and understand your data. Use visualizations and statistical summaries to guide your feature engineering decisions.
5.2 Avoiding Data Leakage
Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. Ensure that feature engineering is only based on training data during model building.
5.3 Iterative Process
Feature engineering is an iterative process. Continuously refine your features based on model performance and new insights from the data.
5.4 Domain Knowledge
Incorporate domain knowledge whenever possible. Features that are meaningful within the context of the problem can significantly improve model performance.
6. Conclusion
6.1 Summary of Key Techniques
In this article, we covered various feature engineering techniques using pandas and NumPy, including creating interaction features, polynomial features, and encoding categorical variables. Effective feature engineering can drastically improve model performance by making the data more informative.
6.2 Further Reading
- Pandas Documentation: https://pandas.pydata.org/docs/
- NumPy Documentation: https://numpy.org/doc/