Skip to main content

XGBoost Practical Example Using Scikit-Learn

In this article, we will walk through a practical example of implementing XGBoost using the scikit-learn API. We will apply XGBoost to a classification task using the UCI Pima Indians Diabetes dataset, a well-known dataset for binary classification.


1. Dataset Overview

The Pima Indians Diabetes dataset contains several medical predictor variables and a binary target variable indicating whether or not a patient has diabetes. The dataset includes the following features:

  • Pregnancies: Number of times pregnant.
  • Glucose: Plasma glucose concentration.
  • BloodPressure: Diastolic blood pressure (mm Hg).
  • SkinThickness: Triceps skinfold thickness (mm).
  • Insulin: 2-hour serum insulin (mu U/ml).
  • BMI: Body mass index (weight in kg/(height in m)^2).
  • DiabetesPedigreeFunction: Diabetes pedigree function (genetic risk factor).
  • Age: Age in years.

The target variable (Outcome) is binary, where 1 indicates that the patient has diabetes and 0 indicates no diabetes.


2. Importing Libraries

First, we need to import the necessary libraries, including XGBoost and scikit-learn for model building and evaluation.

# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from xgboost import XGBClassifier

3. Loading the Dataset

We will load the Pima Indians Diabetes dataset from a CSV file.

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Read the dataset into a pandas DataFrame
data = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print(data.head())

4. Data Preprocessing

Before building the model, we need to split the dataset into features (X) and target (y) and perform a train-test split to evaluate the model.

# Split data into features (X) and target (y)
X = data.iloc[:, :-1] # All columns except the last one
y = data.iloc[:, -1] # The last column (Outcome)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Training the XGBoost Model

We will now initialize the XGBoost classifier using the XGBClassifier class from xgboost and train the model using the training data.

# Initialize the XGBoost classifier
model = XGBClassifier(learning_rate=0.1, n_estimators=100, max_depth=3, random_state=42)

# Train the model
model.fit(X_train, y_train)

In this example:

  • learning_rate=0.1: Controls the step size at each iteration.
  • n_estimators=100: Specifies the number of boosting rounds (trees).
  • max_depth=3: Limits the maximum depth of the individual trees.

6. Model Evaluation

After training, we evaluate the model using the test set and compute the accuracy score and confusion matrix.

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

Output:

  • Accuracy: This metric shows how well the model performs on the test set. Higher accuracy indicates better performance.
  • Confusion Matrix: The confusion matrix provides insight into the number of true positives, true negatives, false positives, and false negatives.

7. Hyperparameter Tuning with GridSearchCV

To further improve the performance of the XGBoost model, we can use GridSearchCV to tune the hyperparameters. Here, we will search for the best combination of max_depth, learning_rate, and n_estimators.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
'max_depth': [3, 4, 5],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [50, 100, 200]
}

# Initialize the XGBoost classifier
xgb = XGBClassifier(random_state=42)

# Perform GridSearchCV to find the best parameters
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best accuracy
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.2f}")

Output:

  • Best Parameters: The hyperparameters that result in the best performance.
  • Best Accuracy: The accuracy score achieved with the optimal hyperparameters.

8. Feature Importance

XGBoost provides insight into the importance of each feature in making predictions. We can visualize feature importance using the plot_importance function.

import matplotlib.pyplot as plt
from xgboost import plot_importance

# Plot feature importance
plt.figure(figsize=(10, 6))
plot_importance(model)
plt.show()

This will display a bar chart showing the relative importance of each feature. Features with higher importance scores contribute more to the model’s predictions.


9. Conclusion

In this article, we demonstrated how to implement XGBoost using Python's scikit-learn API for a binary classification task. We covered the following steps:

  • Loading and preprocessing the Pima Indians Diabetes dataset.
  • Training the XGBoost model and evaluating its performance.
  • Hyperparameter tuning using GridSearchCV to find the best parameters.
  • Visualizing feature importance to understand which features are most impactful.

XGBoost is a powerful and flexible algorithm, especially well-suited for tasks involving large datasets and complex relationships between variables. By leveraging its built-in support for hyperparameter tuning and feature importance, you can achieve high performance in a variety of machine learning tasks.