Skip to main content

CatBoost Practical Example Using Scikit-Learn

In this article, we will implement CatBoost using the scikit-learn API on a classification task. We will use the Pima Indians Diabetes dataset to showcase how to train a CatBoost model and evaluate its performance.


1. Dataset Overview

The Pima Indians Diabetes dataset is a binary classification dataset that contains several medical features used to predict whether a patient has diabetes (Outcome = 1) or not (Outcome = 0). The features include variables like:

  • Pregnancies, Glucose, BloodPressure, BMI, and other medical indicators.

2. Importing Libraries

We will import CatBoost from the catboost library, along with scikit-learn for model evaluation and data handling.

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from catboost import CatBoostClassifier

3. Loading the Dataset

We will load the dataset using pandas, then split it into features (X) and target (y). After that, we will perform a train-test split to evaluate the model.

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Read the dataset into a pandas DataFrame
data = pd.read_csv(url, names=column_names)

# Split data into features (X) and target (y)
X = data.iloc[:, :-1].values # All columns except the last one (features)
y = data.iloc[:, -1].values # The last column (target variable)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Training the CatBoost Model

Next, we will initialize the CatBoost classifier using the CatBoostClassifier class and train the model using the training data. Since the dataset contains only numerical data, no categorical features need to be explicitly handled.

# Initialize the CatBoost classifier
catboost_model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, random_state=42, verbose=0)

# Train the CatBoost model
catboost_model.fit(X_train, y_train)

In this example:

  • iterations=100: Specifies the number of boosting iterations (trees).
  • learning_rate=0.1: Controls the step size at each iteration.
  • depth=6: Determines the maximum depth of the trees.

5. Model Evaluation

After training the model, we will evaluate it on the test set and calculate the accuracy score and confusion matrix to understand the model’s performance.

# Make predictions on the test set
y_pred = catboost_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

Output:

  • Accuracy: A higher value indicates better model performance.
  • Confusion Matrix: Shows the number of true positives, true negatives, false positives, and false negatives.

6. Hyperparameter Tuning with GridSearchCV

To further optimize the model’s performance, we can use GridSearchCV to perform hyperparameter tuning on key parameters such as iterations, depth, and learning_rate.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
'iterations': [100, 200],
'depth': [4, 6, 8],
'learning_rate': [0.01, 0.1, 0.2]
}

# Initialize the CatBoost classifier
catboost_grid = CatBoostClassifier(random_state=42, verbose=0)

# Perform GridSearchCV to find the best parameters
grid_search = GridSearchCV(estimator=catboost_grid, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Output the best parameters and best accuracy score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_:.2f}")

Output:

  • Best Parameters: Displays the optimal hyperparameter values that result in the best accuracy.
  • Best Accuracy: Shows the best accuracy score achieved during cross-validation.

7. Feature Importance in CatBoost

CatBoost provides insights into feature importance, which helps us understand which features contribute the most to the model’s predictions.

# Plot feature importance
import matplotlib.pyplot as plt

feature_importances = catboost_model.get_feature_importance()
plt.barh(column_names[:-1], feature_importances)
plt.xlabel("Feature Importance")
plt.ylabel("Feature Names")
plt.title("Feature Importance in CatBoost")
plt.show()

This will generate a bar chart showing the importance of each feature in making predictions. Features with higher importance scores have a greater impact on the model.


8. Conclusion

In this article, we demonstrated how to implement CatBoost using Python's scikit-learn API for a binary classification task. We covered:

  • Loading and preprocessing the Pima Indians Diabetes dataset.
  • Training and evaluating the CatBoost model using accuracy and confusion matrix.
  • Hyperparameter tuning with GridSearchCV to optimize model performance.
  • Visualizing feature importance to understand which features contribute the most to the predictions.

CatBoost is an excellent choice for handling categorical data, though in this case we focused on a numerical dataset. Its built-in handling of categorical features and fast training make it a powerful algorithm for many machine learning tasks. Let me know if you would like to proceed with a TensorFlow or PyTorch example or make any adjustments!