Support Vector Machines with Scikit-Learn
In this article, we will walk through a practical example of implementing Support Vector Machines (SVM) using scikit-learn
. We will apply SVM for classification on a popular dataset, using different kernels, and evaluate the model’s performance.
Steps Covered:
- Loading and preparing the dataset.
- Training a linear SVM model.
- Using different kernels (RBF kernel).
- Evaluating the model’s performance.
- Hyperparameter tuning with GridSearchCV.
1. Load and Prepare the Dataset
For this example, we will use the Iris dataset, a well-known dataset for classification tasks, which is available directly from scikit-learn
.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
# Convert to pandas DataFrame for easier manipulation
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['target'] = iris.target
# Display the first few rows of the dataset
print(data.head())
Dataset Information:
- The Iris dataset consists of 150 samples from each of three species of Iris flowers.
- The features include:
- Sepal length and sepal width.
- Petal length and petal width.
- The target variable represents the species (0: Setosa, 1: Versicolor, 2: Virginica).
We will focus on classifying two of the species, so we will reduce the dataset to a binary classification problem.
# Select only two classes for binary classification (Setosa and Versicolor)
data = data[data['target'] != 2]
# Split the data into features and target
X = data.drop('target', axis=1)
y = data['target']
# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features (SVM is sensitive to feature scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Explanation:
- We reduced the dataset to a binary classification problem by selecting only the classes
0
(Setosa) and1
(Versicolor). - The data was split into training and testing sets, and the features were standardized using
StandardScaler
to improve the performance of the SVM model.
2. Train a Linear SVM Model
Next, we will train an SVM model with a linear kernel using the SVC
class from scikit-learn
.
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Initialize the linear SVM model
model_linear = SVC(kernel='linear', C=1.0, random_state=42)
# Train the model on the training data
model_linear.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred = model_linear.predict(X_test_scaled)
# Evaluate the model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy (Linear Kernel): {accuracy * 100:.2f}%")
Explanation:
- We initialized an SVM model with a linear kernel.
- The regularization parameter is set to 1.0 (you can adjust this value to control the margin and classification trade-off).
- We trained the model on the scaled training data and evaluated its accuracy on the test set.
3. Train an SVM Model with an RBF Kernel
Next, we will use an RBF kernel to capture potential non-linear relationships in the data.
# Initialize the SVM model with an RBF kernel
model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
# Train the model on the training data
model_rbf.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred_rbf = model_rbf.predict(X_test_scaled)
# Evaluate the model performance
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)
print(f"Test Accuracy (RBF Kernel): {accuracy_rbf * 100:.2f}%")
Explanation:
- The RBF kernel is a commonly used kernel for non-linear data.
- The parameter controls the influence of individual training examples. Setting it to
'scale'
adjusts the parameter automatically based on the number of features.
4. Model Evaluation
In addition to accuracy, we can use other metrics like precision, recall, and the confusion matrix to evaluate the model's performance.
from sklearn.metrics import classification_report, confusion_matrix
# Generate the classification report for the RBF kernel model
print("Classification Report (RBF Kernel):")
print(classification_report(y_test, y_pred_rbf))
# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_rbf)
print("Confusion Matrix (RBF Kernel):")
print(conf_matrix)
Explanation:
- The classification report provides metrics like precision, recall, F1-score, and support for each class.
- The confusion matrix helps in understanding the number of true positives, true negatives, false positives, and false negatives.