Skip to main content

Logistic Regression with Scikit-Learn

In this practical example, we will use Logistic Regression from the scikit-learn library to classify whether or not a person has diabetes based on health-related variables from the Pima Indians Diabetes dataset. This is a common binary classification problem where we use logistic regression to predict a binary outcome (diabetic or not).


Steps Covered:

  1. Loading and exploring the dataset.
  2. Splitting the data into training and testing sets.
  3. Training a logistic regression model.
  4. Evaluating the model’s performance.
  5. Making predictions on new data.

1. Load and Explore the Dataset

We'll begin by loading the Pima Indians Diabetes dataset using pandas and performing a quick exploration of the data to understand its structure.

# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Load the data into a pandas DataFrame
data = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print(data.head())

Dataset Information:

  • Pregnancies: Number of times pregnant.
  • Glucose: Plasma glucose concentration (2 hours after a glucose tolerance test).
  • BloodPressure: Diastolic blood pressure (mm Hg).
  • SkinThickness: Triceps skinfold thickness (mm).
  • Insulin: 2-Hour serum insulin (mu U/ml).
  • BMI: Body Mass Index.
  • DiabetesPedigreeFunction: Diabetes pedigree function (a measure of genetic influence).
  • Age: Age (years).
  • Outcome: Binary outcome (1 = diabetic, 0 = not diabetic).

2. Split the Dataset into Training and Testing Sets

Next, we’ll split the dataset into training and testing sets to evaluate the model's performance on unseen data.

# Separate features and target variable
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

Explanation:

  • We use train_test_split to randomly split the data into training and testing sets.
  • 80% of the data is used for training, and 20% is reserved for testing.

3. Train the Logistic Regression Model

Now, we’ll train the Logistic Regression model using the training set.

from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model
model = LogisticRegression(max_iter=1000)

# Train the model on the training data
model.fit(X_train, y_train)

# Display the learned coefficients and intercept
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Explanation:

  • We initialize the LogisticRegression model with max_iter=1000 to ensure enough iterations for convergence.
  • The model learns the coefficients β1,β2,,βn\beta_1, \beta_2, \dots, \beta_n and the intercept β0\beta_0 from the training data.

4. Evaluate the Model

After training, we need to evaluate how well the model performs on the test set. We will calculate accuracy and examine the confusion matrix, precision, recall, and F1-score.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict the labels for the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Print the classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

Evaluation Metrics:

  • Accuracy: The percentage of correctly predicted labels.
  • Confusion Matrix: Shows the true positives, true negatives, false positives, and false negatives.
  • Precision: The ratio of true positives to all predicted positives.
  • Recall: The ratio of true positives to all actual positives.
  • F1-Score: The harmonic mean of precision and recall.

Example Output (Interpretation):

  • Accuracy: Measures overall performance, but may be misleading in imbalanced datasets.
  • Confusion Matrix: Gives a clearer picture of where the model is making errors (e.g., false positives and false negatives).
  • Precision, Recall, F1-Score: These metrics are especially useful in cases of class imbalance.

5. Make Predictions on New Data

Finally, let’s use the trained logistic regression model to make predictions on new, unseen data.

# New data (example of a single data point)
new_data = [[6, 148, 72, 35, 0, 33.6, 0.627, 50]] # Example input

# Predict whether the person has diabetes or not
predicted_outcome = model.predict(new_data)

# Predict the probability that the person has diabetes
predicted_proba = model.predict_proba(new_data)

print(f"Predicted Outcome: {'Diabetic' if predicted_outcome[0] == 1 else 'Not Diabetic'}")
print(f"Probability of being Diabetic: {predicted_proba[0][1] * 100:.2f}%")

Explanation:

  • We create an example new data point to predict whether the person is diabetic.
  • The predict method returns the predicted class (0 or 1), and predict_proba gives the probability of each class.

Summary

In this section, we walked through the process of using Logistic Regression with scikit-learn to classify diabetes in the Pima Indians Diabetes dataset. We covered:

  • Loading and exploring the dataset.
  • Splitting the data into training and testing sets.
  • Training a logistic regression model.
  • Evaluating the model using accuracy, confusion matrix, and classification report.
  • Making predictions on new data.

This tutorial demonstrates how easy and effective logistic regression can be for solving binary classification problems. In the next sections, we will explore how to implement logistic regression using other machine learning libraries such as TensorFlow and PyTorch.