Skip to main content

LightGBM Practical Example with scikit-learn

In this example, we will implement LightGBM using the scikit-learn interface to predict house prices. We will use the California Housing Dataset from sklearn.datasets, which includes features like the median income, population, and other factors affecting housing prices.

We will follow these steps:

  • Load and preprocess the data.
  • Train a LightGBM model.
  • Evaluate the model's performance using metrics like Mean Absolute Error (MAE) and R-Squared (R2R^2).

1. Install LightGBM

Before starting, ensure that LightGBM is installed. You can install it via pip:

pip install lightgbm

2. Load the Dataset

We will use the California Housing Dataset, which is available in sklearn.datasets. This dataset contains information about various factors influencing house prices in different regions of California.

import lightgbm as lgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

# Load the dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print feature names for reference
print(data.feature_names)

3. Train the LightGBM Model

We will use LightGBM’s LGBMRegressor, which is compatible with scikit-learn’s API. We’ll also set important hyperparameters such as n_estimators (number of trees) and learning_rate.

# Initialize the LightGBM regressor
model = lgb.LGBMRegressor(
n_estimators=1000, # Number of boosting rounds
learning_rate=0.05, # Learning rate for shrinkage
max_depth=7, # Maximum depth of trees to avoid overfitting
random_state=42
)

# Train the model
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=10, verbose=False)

In this example:

  • n_estimators=1000: The model will train up to 1000 trees unless early stopping triggers.
  • learning_rate=0.05: This controls how much each tree influences the final prediction.
  • max_depth=7: Limits the depth of trees to prevent overfitting.

The early_stopping_rounds parameter ensures that the model stops training if it doesn't improve after 10 rounds.


4. Model Evaluation

Now that we have trained the model, we can evaluate its performance using the Mean Absolute Error (MAE) and R-Squared (R2R^2) metrics on the test data.

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.2f}")
print(f"R-Squared: {r2:.2f}")

Interpretation:

  • MAE: The Mean Absolute Error measures the average magnitude of errors in a set of predictions, without considering their direction. A lower MAE means a better fit.
  • R-Squared (R2R^2): The R-Squared value indicates how well the model explains the variance in the target variable. An R2R^2 value close to 1 means the model fits the data well.

5. Feature Importance

LightGBM allows us to visualize which features contribute the most to the model’s predictions by using feature importance. Let’s plot the importance of each feature.

import matplotlib.pyplot as plt

# Plot feature importance
lgb.plot_importance(model, max_num_features=10)
plt.show()

Example Output:

  • The plot will show the most important features for predicting housing prices, such as median income or house age.

6. Hyperparameter Tuning (Optional)

To improve the model’s performance, you can tune the hyperparameters using GridSearchCV or RandomizedSearchCV. Here’s an example of how to do that:

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [500, 1000],
'max_depth': [5, 7, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=lgb.LGBMRegressor(), param_grid=param_grid, cv=5, scoring='r2')

# Train the model with grid search
grid_search.fit(X_train, y_train)

# Print the best parameters and the best R-Squared score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best R-Squared Score: {grid_search.best_score_}")

Explanation:

  • GridSearchCV performs an exhaustive search over the parameter grid, evaluating the model’s performance on each combination of parameters.

Summary

In this example, we successfully implemented LightGBM using scikit-learn to predict house prices. We walked through:

  1. Loading and preprocessing the dataset.
  2. Training the model using LGBMRegressor.
  3. Evaluating the model’s performance with MAE and R-Squared.
  4. Visualizing feature importance.

LightGBM provides a fast and efficient way to handle large datasets and complex models, and with its native integration with scikit-learn, it’s easy to include in machine learning pipelines.

In the next sections, we’ll explore implementing LightGBM using TensorFlow and PyTorch.