K-Nearest Neighbors (KNN) Practical Example in PyTorch
In this article, we will implement K-Nearest Neighbors (KNN) from scratch using PyTorch for a classification task on the Iris dataset. Like in TensorFlow, PyTorch doesn't have a built-in KNN classifier, so we will manually compute the distances between data points and predict the classes based on the nearest neighbors.
The Iris dataset is a simple dataset consisting of 150 samples of iris flowers, with three classes (Setosa, Versicolor, Virginica) and four features (sepal length, sepal width, petal length, and petal width). Our goal is to classify the flowers based on these features.
1. Steps in the Example
- Load and preprocess the Iris dataset.
- Implement the Euclidean distance function in PyTorch.
- Implement the KNN classifier using PyTorch.
- Evaluate the model on the test set.
2. Loading and Preprocessing the Dataset
We will use scikit-learn to load the Iris dataset, split it into training and test sets, and normalize the features. This step ensures that our KNN model performs better since it is sensitive to the scale of features.
import torch
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert the data to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.long)
3. Defining the KNN Algorithm in PyTorch
In KNN, the Euclidean distance between a test point and all training points is calculated, and the K nearest neighbors are identified. We will implement these core parts of the algorithm in PyTorch.
Euclidean Distance Function
To calculate the Euclidean distance between the test point and each training point, we can use the following formula:
Here’s how we can implement it using PyTorch:
# Define the function to calculate Euclidean distance
def euclidean_distance(X_train, X_test_point):
# Subtract the test point from each training point and compute the squared distances
distances = torch.sum((X_train - X_test_point) ** 2, axis=1)
return torch.sqrt(distances)
Implementing the KNN Classifier
We will now implement the KNN classifier using PyTorch. For each test point, the steps are:
- Compute the Euclidean distance to all training points.
- Identify the K nearest neighbors.
- Perform a majority vote among the nearest neighbors to predict the class.
# Define the KNN classifier function
def knn(X_train, y_train, X_test, K):
y_pred = []
for X_test_point in X_test:
# Compute distances from the test point to all training points
distances = euclidean_distance(X_train, X_test_point)
# Get the indices of the K nearest neighbors
nearest_neighbors = torch.argsort(distances)[:K]
# Get the labels of the K nearest neighbors
nearest_labels = y_train[nearest_neighbors]
# Perform a majority vote to predict the label
predicted_label = torch.mode(nearest_labels).values.item()
y_pred.append(predicted_label)
return torch.tensor(y_pred)
4. Training and Testing the Model
With the KNN classifier defined, we can now make predictions on the test set and evaluate the accuracy of the model.
# Set the value of K
K = 5
# Make predictions on the test set
y_pred = knn(X_train, y_train, X_test, K)
# Evaluate the accuracy
accuracy = torch.sum(y_pred == y_test).item() / len(y_test)
print(f"KNN Model Accuracy: {accuracy * 100:.2f}%")
5. Interpreting the Results
- The accuracy score tells us how well the model performs on the test set. Since the Iris dataset is relatively simple, you should expect a high accuracy score.
- You can experiment with different values of K (such as K=3 or K=7) to observe the effect on the accuracy.
6. Visualizing Decision Boundaries (Optional)
For better intuition, you can visualize the decision boundaries of the KNN classifier. We’ll use only two features (e.g., sepal length and sepal width) to plot a 2D decision boundary.
import matplotlib.pyplot as plt
def plot_decision_boundary(X_train, y_train, K):
# Create a mesh grid
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
# Flatten the grid and make predictions
grid_points = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
Z = knn(X_train[:, :2], y_train, grid_points, K)
Z = Z.numpy().reshape(xx.shape)
# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k')
plt.title(f"KNN Decision Boundary (K={K})")
plt.show()
# Plot decision boundary using the first two features of the Iris dataset
plot_decision_boundary(X_train[:, :2], y_train, K=5)
Summary
In this article, we built a K-Nearest Neighbors (KNN) classifier from scratch using PyTorch for the Iris dataset. We covered:
- Loading and preprocessing the Iris dataset.
- Implementing the Euclidean distance and KNN classifier in PyTorch.
- Training and evaluating the KNN model.
- Visualizing the decision boundary (optional).
This hands-on approach in PyTorch allows for a deeper understanding of the KNN algorithm. In the next section, we will explore more advanced topics and comparisons with other machine learning algorithms.