CatBoost with PyTorch Example
In this article, we will demonstrate how to integrate CatBoost with PyTorch using the Pima Indians Diabetes dataset. While CatBoost is not natively integrated with PyTorch, it can work alongside PyTorch for efficient data preprocessing and management.
1. Dataset Overview
The Pima Indians Diabetes dataset is a binary classification dataset used to predict whether a patient has diabetes (1) or not (0), based on medical features such as Pregnancies
, Glucose
, BMI
, and others.
2. Importing Libraries
We will import PyTorch for data handling and CatBoost for model training. Additionally, we will use pandas for loading and manipulating the dataset.
# Import necessary libraries
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix
from catboost import CatBoostClassifier
3. Loading and Preprocessing the Dataset with PyTorch
We will load the dataset using pandas and create a custom PyTorch Dataset
class for handling batching, shuffling, and data transformation.
# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']
# Read the dataset into a pandas DataFrame
data = pd.read_csv(url, names=column_names)
# Split data into features (X) and target (y)
X = data.iloc[:, :-1].values # All columns except the last one (features)
y = data.iloc[:, -1].values # The last column (target variable)
Next, we will define a custom PyTorch Dataset
class to handle the data pipeline.
class DiabetesDataset(Dataset):
def __init__(self, features, targets):
self.features = torch.tensor(features, dtype=torch.float32)
self.targets = torch.tensor(targets, dtype=torch.float32)
def __len__(self):
return len(self.features)
def __getitem__(self, idx):
return self.features[idx], self.targets[idx]
# Create PyTorch datasets
train_dataset = DiabetesDataset(X, y)
# DataLoader for batching and shuffling
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
4. Defining and Training the CatBoost Model
Since CatBoost requires NumPy arrays, we will extract the data from the PyTorch DataLoader
and convert it back into NumPy arrays before passing it to the CatBoost classifier.
# Convert PyTorch DataLoader data to NumPy arrays for CatBoost
X_train, y_train = [], []
for batch_features, batch_targets in train_loader:
X_train.extend(batch_features.numpy())
y_train.extend(batch_targets.numpy())
X_train = np.array(X_train)
y_train = np.array(y_train)
# Initialize the CatBoost classifier
catboost_model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, random_state=42, verbose=0)
# Train the CatBoost model
catboost_model.fit(X_train, y_train)
In this example:
iterations=100
: Specifies the number of boosting iterations.learning_rate=0.1
: Controls the learning rate for each iteration.depth=6
: Limits the maximum depth of each tree.
5. Model Evaluation
After training, we will evaluate the CatBoost model by making predictions and calculating the accuracy score and confusion matrix.
# Make predictions
y_pred = catboost_model.predict(X_train)
# Evaluate the model
accuracy = accuracy_score(y_train, y_pred)
conf_matrix = confusion_matrix(y_train, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
Output:
- Accuracy: Displays the performance of the model.
- Confusion Matrix: Provides insight into the number of true positives, false positives, true negatives, and false negatives.
6. Using PyTorch for Data Transformations with CatBoost
PyTorch is highly flexible for performing custom transformations and preprocessing before passing data to CatBoost. Here, we define a custom transformation (e.g., feature scaling) and apply it to the dataset before passing it to CatBoost.
# Define a PyTorch transformation (e.g., normalization)
class NormalizeTransform:
def __call__(self, sample):
features, target = sample
features = (features - torch.mean(features)) / torch.std(features)
return features, target
# Apply the transformation during dataset creation
train_dataset_transformed = DiabetesDataset(X, y)
train_loader_transformed = DataLoader(train_dataset_transformed, batch_size=BATCH_SIZE, shuffle=True)
# Convert the transformed data into NumPy arrays for CatBoost
X_train_transformed, y_train_transformed = [], []
for batch_features, batch_targets in train_loader_transformed:
X_train_transformed.extend(batch_features.numpy())
y_train_transformed.extend(batch_targets.numpy())
X_train_transformed = np.array(X_train_transformed)
y_train_transformed = np.array(y_train_transformed)
# Train the CatBoost model with the transformed data
catboost_model.fit(X_train_transformed, y_train_transformed)
This workflow demonstrates how PyTorch transformations can be applied before passing the processed data to CatBoost.
7. Hyperparameter Tuning with PyTorch and CatBoost
To optimize the CatBoost model, we can perform hyperparameter tuning using GridSearchCV while leveraging PyTorch for data processing.
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {
'iterations': [100, 200],
'depth': [4, 6, 8],
'learning_rate': [0.01, 0.1, 0.2]
}
# Initialize the CatBoost classifier
catboost_grid = CatBoostClassifier(random_state=42, verbose=0)
# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=catboost_grid, param_grid=param_grid, cv=3, scoring='accuracy')
# Fit the model
grid_search.fit(X_train_transformed, y_train_transformed)
# Output the best parameters and accuracy score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_:.2f}")
Output:
- Best Parameters: Displays the optimal hyperparameters found by GridSearchCV.
- Best Accuracy: Shows the best accuracy score from cross-validation.
8. Feature Importance in CatBoost
CatBoost provides insights into feature importance, allowing us to understand which features contribute the most to the model's predictions. We can visualize feature importance using matplotlib
.
# Plot feature importance
import matplotlib.pyplot as plt
feature_importances = catboost_model.get_feature_importance()
plt.barh(column_names[:-1], feature_importances)
plt.xlabel("Feature Importance")
plt.ylabel("Feature Names")
plt.title("Feature Importance in CatBoost")
plt.show()
This plot will display the relative importance of each feature used in the model.
9. Conclusion
In this article, we demonstrated how to integrate CatBoost with PyTorch for a binary classification task using the Pima Indians Diabetes dataset. The steps covered include:
- Preprocessing data with PyTorch’s
Dataset
andDataLoader
. - Converting PyTorch data into NumPy format for training CatBoost.
- Training and evaluating the CatBoost model.
- Performing custom transformations using PyTorch before passing the data to CatBoost.
- Hyperparameter tuning using GridSearchCV for optimal model performance.
- Visualizing feature importance using CatBoost’s built-in feature importance method.
This approach shows how PyTorch can be used for preprocessing and managing data in complex machine learning pipelines while leveraging the power of CatBoost for gradient boosting. Let me know if you'd like to proceed with further examples or make any adjustments!