Skip to main content

CatBoost with TensorFlow Example

In this article, we will show how to integrate CatBoost with TensorFlow using the Pima Indians Diabetes dataset. TensorFlow can be used for data pipeline management, while CatBoost handles the model training and prediction.


1. Dataset Overview

The Pima Indians Diabetes dataset consists of various medical predictor features (such as Pregnancies, Glucose, and BMI) and a binary target variable (Outcome) indicating whether a patient has diabetes (1) or not (0).


2. Importing Libraries

We will import TensorFlow to handle data pipelines and CatBoost to train the model. Although CatBoost isn’t natively integrated with TensorFlow, it can still be used effectively by converting TensorFlow data to NumPy arrays.

# Import necessary libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.metrics import accuracy_score, confusion_matrix
from catboost import CatBoostClassifier

3. Loading and Preprocessing the Dataset with TensorFlow

We will load the dataset using pandas and convert it into a TensorFlow Dataset object for better data management and batching.

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Read the dataset into a pandas DataFrame
data = pd.read_csv(url, names=column_names)

# Split data into features (X) and target (y)
X = data.iloc[:, :-1].values # All columns except the last one (features)
y = data.iloc[:, -1].values # The last column (target variable)

Next, we convert the data into a TensorFlow Dataset object and batch it for efficient data processing.

# Create a TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices((X, y))

# Shuffle and batch the dataset
BATCH_SIZE = 32
dataset = dataset.shuffle(buffer_size=len(data)).batch(BATCH_SIZE)

4. Defining and Training the CatBoost Model

Since CatBoost requires NumPy arrays, we need to extract the data from the TensorFlow dataset and convert it into NumPy format for model training.

# Convert TensorFlow dataset back to NumPy arrays for CatBoost
X_train, y_train = [], []
for batch_x, batch_y in dataset:
X_train.extend(batch_x.numpy())
y_train.extend(batch_y.numpy())

X_train = np.array(X_train)
y_train = np.array(y_train)

# Initialize the CatBoost classifier
catboost_model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, random_state=42, verbose=0)

# Train the CatBoost model
catboost_model.fit(X_train, y_train)

In this example:

  • iterations=100: Specifies the number of boosting iterations.
  • learning_rate=0.1: Controls the step size of the boosting process.
  • depth=6: Limits the maximum depth of each tree.

5. Model Evaluation

After training, we will evaluate the CatBoost model by making predictions on the same dataset and calculating the accuracy score and confusion matrix.

# Make predictions on the training set
y_pred = catboost_model.predict(X_train)

# Evaluate the model
accuracy = accuracy_score(y_train, y_pred)
conf_matrix = confusion_matrix(y_train, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

Output:

  • Accuracy: Indicates how well the model performs.
  • Confusion Matrix: Provides insights into the number of true positives, false positives, true negatives, and false negatives.

6. Hyperparameter Tuning with TensorFlow Data

You can also integrate TensorFlow’s data pipelines with GridSearchCV for hyperparameter tuning. Here, we perform hyperparameter tuning on iterations, depth, and learning_rate to find the best combination for the CatBoost model.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
'iterations': [100, 200],
'depth': [4, 6, 8],
'learning_rate': [0.01, 0.1, 0.2]
}

# Initialize the CatBoost classifier
catboost_model = CatBoostClassifier(random_state=42, verbose=0)

# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=catboost_model, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit the model using the TensorFlow preprocessed data
grid_search.fit(X_train, y_train)

# Output the best parameters and best accuracy score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_:.2f}")

Output:

  • Best Parameters: Displays the optimal parameters for CatBoost.
  • Best Accuracy: Shows the best accuracy score achieved using cross-validation.

7. Feature Importance in CatBoost

CatBoost provides insights into which features contribute most to the model's predictions. We can visualize this by plotting feature importance using matplotlib.

# Plot feature importance
import matplotlib.pyplot as plt

feature_importances = catboost_model.get_feature_importance()
plt.barh(column_names[:-1], feature_importances)
plt.xlabel("Feature Importance")
plt.ylabel("Feature Names")
plt.title("Feature Importance in CatBoost")
plt.show()

This chart shows the importance of each feature in making predictions. Features with higher importance scores have a greater impact on the outcome.


8. Using TensorFlow's Callbacks with CatBoost (Advanced)

Although CatBoost does not natively integrate with TensorFlow's callbacks, you can still integrate TensorFlow callbacks like ModelCheckpoint and EarlyStopping into your overall pipeline, where TensorFlow handles data management, and CatBoost is used for model training.

checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath='catboost_model.h5', save_best_only=True)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=5)

# Manually loop through CatBoost training with TensorFlow callbacks
for epoch in range(10):
catboost_model.fit(X_train, y_train, verbose=0)

# Evaluate model accuracy after each epoch
accuracy = accuracy_score(y_train, catboost_model.predict(X_train))
print(f"Epoch {epoch+1}, Accuracy: {accuracy:.2f}")

# Invoke callbacks
checkpoint.on_epoch_end(epoch, logs={'accuracy': accuracy})
if early_stopping.on_epoch_end(epoch, logs={'accuracy': accuracy}):
break

This method enables you to take advantage of TensorFlow's callback system while using CatBoost for training.


9. Conclusion

In this article, we demonstrated how to integrate CatBoost with TensorFlow using the Pima Indians Diabetes dataset. The workflow included:

  • Preprocessing data using TensorFlow's Dataset API.
  • Converting TensorFlow data back into NumPy format for CatBoost model training.
  • Training and evaluating the CatBoost model using TensorFlow-managed data.
  • Performing hyperparameter tuning with GridSearchCV.
  • Exploring feature importance to understand which features contribute the most to the predictions.

This approach allows you to combine TensorFlow's powerful data pipeline management with CatBoost's advanced tree-based boosting algorithms. Let me know if you’d like to proceed with a PyTorch example or make any adjustments!