Practical Example in scikit-learn
In this example, we will apply linear regression using the scikit-learn
library to predict house prices based on various features such as the number of bedrooms, the size of the house, and the location.
We will use the California Housing dataset, which contains data on houses in California, such as median house value, population, and median income. This dataset is included in scikit-learn
and is a great alternative to the deprecated Boston Housing dataset.
Objective:
We aim to build a model that predicts house prices based on input features, and evaluate the model's performance using different metrics.
Steps in This Practical Example:
- Load and Explore the Dataset: Understand the features and targets.
- Split the Dataset: Separate data into training and testing sets.
- Train the Linear Regression Model: Fit a linear regression model to the data.
- Evaluate the Model: Use performance metrics such as R² and RMSE.
- Make Predictions: Predict house prices for the test set.
Step 1: Load and Explore the Dataset
Let's start by loading the California Housing dataset and taking a quick look at the features and target variables.
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
# Load the California Housing dataset
california_housing = fetch_california_housing()
# Convert to a Pandas DataFrame for better readability
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = pd.Series(california_housing.target, name='Price')
# Display first few rows
print(X.head())
print(f"Target values (Price): {y.head()}")
Feature Explanation:
MedInc
: Median income in block group.HouseAge
: Median house age in block group.AveRooms
: Average number of rooms per household.AveBedrms
: Average number of bedrooms per household.Population
: Total population in block group.AveOccup
: Average number of household members.Latitude
: Latitude coordinate of the block group.Longitude
: Longitude coordinate of the block group.
Step 2: Split the Dataset
We will split the dataset into training and testing sets, which will allow us to train the model on one portion and evaluate its performance on another.
# Split data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Check the shape of the training and testing sets
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
Step 3: Train the Linear Regression Model
Next, we'll initialize the LinearRegression model from scikit-learn
and train it using the training set.
from sklearn.linear_model import LinearRegression
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Output the model's learned coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")
Interpretation:
- The intercept is the expected price when all features are zero.
- The coefficients represent the change in price for each unit increase in the corresponding feature, assuming other features remain constant.
Step 4: Evaluate the Model
Once trained, we need to evaluate the model's performance using the test set. The R² score (coefficient of determination) and the Root Mean Squared Error (RMSE) are commonly used metrics.
from sklearn.metrics import r2_score, mean_squared_error
# Predict house prices on the test set
y_pred = model.predict(X_test)
# Calculate R² score
r2 = r2_score(y_test, y_pred)
print(f"R² score: {r2}")
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse}")
Model Evaluation Metrics:
- R² Score: The proportion of variance in the dependent variable that is predictable from the independent variables. A value close to 1 indicates a good fit.
- RMSE: Provides an estimate of the average magnitude of errors between predicted and actual values. The lower the RMSE, the better the model performance.
Step 5: Make Predictions on New Data
Now, let's use our trained model to predict house prices for a new set of input features.
# New data for prediction (just an example, you can input any custom data)
new_data = pd.DataFrame({
'MedInc': [8.3252], # Median income of the block group
'HouseAge': [41.0], # Median house age
'AveRooms': [6.9841], # Average number of rooms per household
'AveBedrms': [1.0238], # Average number of bedrooms per household
'Population': [322.0], # Block group population
'AveOccup': [2.5556], # Average number of occupants per household
'Latitude': [37.88], # Latitude
'Longitude': [-122.23] # Longitude
})
# Predict the house price for this new data
predicted_price = model.predict(new_data)
print(f"Predicted House Price: {predicted_price[0]:.2f} (in 100,000s)")
This new data can represent any house with specific features, and the model will predict the expected price based on the trained model.
Summary and Key Takeaways:
- We successfully trained a linear regression model using
scikit-learn
and evaluated it using the California Housing dataset. - We used the R² score and RMSE to measure the performance of the model.
- The model can now predict house prices based on a variety of input features such as income, house age, and location.
- This example can easily be adapted to other datasets and prediction tasks by following the same steps.
In the next section, we will explore how to implement linear regression using other libraries such as TensorFlow and PyTorch to compare performance and flexibility across different machine learning frameworks.