Skip to main content

Linear Regression with TensorFlow

In this tutorial, we will implement linear regression using TensorFlow. This will give us more flexibility and control over the model training process compared to higher-level libraries like scikit-learn.

We will again use the California Housing dataset for predicting house prices, as it contains multiple useful features for training a regression model. The steps outlined below will help you understand how to manually set up, train, and evaluate a linear regression model using TensorFlow's low-level API.

Objective:

We aim to build a TensorFlow linear regression model to predict house prices based on several features like median income, house age, population, and location.


Steps in This Practical Example:

  1. Load and Explore the Dataset: Understand the dataset and prepare it for training.
  2. Create the Model: Define the linear regression model using TensorFlow.
  3. Train the Model: Train the model on the dataset.
  4. Evaluate the Model: Calculate the loss and measure performance.
  5. Make Predictions: Use the trained model to predict new house prices.

Step 1: Load and Explore the Dataset

First, we'll load the California Housing dataset using TensorFlow's datasets library and convert it into a TensorFlow-compatible format.

import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Load California Housing dataset
california_housing = fetch_california_housing()

# Convert to Pandas DataFrame for easier exploration
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = pd.Series(california_housing.target, name='Price')

# Split data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features for better performance in TensorFlow
mean = X_train.mean()
std = X_train.std()
X_train_norm = (X_train - mean) / std
X_test_norm = (X_test - mean) / std

Why Normalization?

We normalize the input features to ensure that they are on the same scale, which helps TensorFlow optimize the training process.


Step 2: Create the TensorFlow Linear Regression Model

Next, we'll define our linear regression model in TensorFlow. In TensorFlow, this is done by manually specifying the weights (W) and the bias (b) for the model.

# Set seed for reproducibility
tf.random.set_seed(42)

# Initialize weights and bias for the linear regression model
W = tf.Variable(tf.random.normal([X_train.shape[1], 1]), dtype=tf.float32) # Shape: (num_features, 1)
b = tf.Variable(tf.zeros([1]), dtype=tf.float32)

# Define the linear regression model
def linear_regression(X):
return tf.matmul(X, W) + b

Explanation:

  • W: The weights, initialized as random values, with one weight for each feature.
  • b: The bias term, initialized to zero.
  • linear_regression(X): This function defines the prediction as the dot product between the input features and the weights, plus the bias term.

Step 3: Define Loss Function and Optimizer

We'll use Mean Squared Error (MSE) as our loss function and Stochastic Gradient Descent (SGD) as the optimizer to minimize the error.

# Define the loss function (Mean Squared Error)
def mean_squared_error(y_true, y_pred):
return tf.reduce_mean(tf.square(y_true - y_pred))

# Define the optimizer (Stochastic Gradient Descent)
optimizer = tf.optimizers.SGD(learning_rate=0.01)

Explanation:

  • MSE: The average of the squared differences between the predicted and actual values. It's the most common loss function for regression.
  • SGD: Optimizer that adjusts the weights and bias based on the gradient of the loss with respect to them.

Step 4: Train the Model

We will now train the model over several epochs using gradient descent to minimize the loss. During each epoch, the weights and bias will be updated to reduce the error.

# Training parameters
epochs = 500
batch_size = 32
steps_per_epoch = len(X_train_norm) // batch_size

# Function to perform one step of training
def train_step(X_batch, y_batch):
with tf.GradientTape() as tape:
predictions = linear_regression(X_batch)
loss = mean_squared_error(y_batch, predictions)
gradients = tape.gradient(loss, [W, b])
optimizer.apply_gradients(zip(gradients, [W, b]))
return loss

# Convert data to TensorFlow tensors
X_train_tensor = tf.convert_to_tensor(X_train_norm, dtype=tf.float32)
y_train_tensor = tf.convert_to_tensor(y_train.values.reshape(-1, 1), dtype=tf.float32)

# Training loop
for epoch in range(epochs):
# Shuffle the data and create batches
indices = np.random.permutation(len(X_train_tensor))
X_train_shuffled = tf.gather(X_train_tensor, indices)
y_train_shuffled = tf.gather(y_train_tensor, indices)

for i in range(steps_per_epoch):
X_batch = X_train_shuffled[i * batch_size: (i + 1) * batch_size]
y_batch = y_train_shuffled[i * batch_size: (i + 1) * batch_size]
loss = train_step(X_batch, y_batch)

if epoch % 50 == 0:
print(f"Epoch {epoch}, Loss: {loss.numpy():.4f}")

Explanation:

  • GradientTape: Records operations for automatic differentiation to compute the gradients of the loss with respect to the model parameters.
  • apply_gradients: Updates the weights and bias using the calculated gradients.
  • Epochs: The number of complete passes through the training data.
  • Batch Size: Number of samples per gradient update. We use batch_size = 32.

Step 5: Evaluate the Model

After training, we evaluate the model using the test set. We'll calculate both the R² score and Root Mean Squared Error (RMSE) to see how well the model generalizes.

from sklearn.metrics import r2_score

# Convert test data to tensors
X_test_tensor = tf.convert_to_tensor(X_test_norm, dtype=tf.float32)
y_test_tensor = tf.convert_to_tensor(y_test.values.reshape(-1, 1), dtype=tf.float32)

# Predict house prices using the trained model
y_pred_tensor = linear_regression(X_test_tensor)
y_pred = y_pred_tensor.numpy()

# Calculate R² score
r2 = r2_score(y_test, y_pred)
print(f"R² score: {r2}")

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test_tensor, y_pred_tensor).numpy())
print(f"Root Mean Squared Error (RMSE): {rmse}")

Explanation:

  • R² Score: Measures how well the model explains the variance in the target variable.
  • RMSE: Gives the average magnitude of prediction errors in the same units as the target variable (house prices).

Step 6: Make Predictions on New Data

We can now use the trained model to predict house prices for new data.

# Example of new data for prediction
new_data = pd.DataFrame({
'MedInc': [8.3252],
'HouseAge': [41.0],
'AveRooms': [6.9841],
'AveBedrms': [1.0238],
'Population': [322.0],
'AveOccup': [2.5556],
'Latitude': [37.88],
'Longitude': [-122.23]
})

# Normalize new data based on training set statistics
new_data_norm = (new_data - mean) / std

# Convert to tensor and predict
new_data_tensor = tf.convert_to_tensor(new_data_norm, dtype=tf.float32)
predicted_price = linear_regression(new_data_tensor).numpy()

print(f"Predicted House Price: {predicted_price[0][0]:.2f} (in 100,000s)")

This allows you to use the trained model for real-world predictions.


Summary and Key Takeaways:

  • We implemented linear regression from scratch using TensorFlow's low-level API.
  • The model was trained using the California Housing dataset, and we used MSE as the loss function and SGD as the optimizer.
  • We evaluated the model's performance using both R² score and RMSE.
  • Finally, we used the model to predict house prices on new data.

In the next section, we will explore how to implement linear regression using PyTorch to compare its flexibility and ease of use with TensorFlow.