K-Nearest Neighbors (KNN) Practical Example in PyTorch

In this article, we will implement K-Nearest Neighbors (KNN) from scratch using PyTorch for a classification task on the Iris dataset. Like in TensorFlow, PyTorch doesn't have a built-in KNN classifier, so we will manually compute the distances between data points and predict the classes based on the nearest neighbors.

The Iris dataset is a simple dataset consisting of 150 samples of iris flowers, with three classes (Setosa, Versicolor, Virginica) and four features (sepal length, sepal width, petal length, and petal width). Our goal is to classify the flowers based on these features.

1. Steps in the Example

Load and preprocess the Iris dataset.
Implement the Euclidean distance function in PyTorch.
Implement the KNN classifier using PyTorch.
Evaluate the model on the test set.

2. Loading and Preprocessing the Dataset

We will use scikit-learn to load the Iris dataset, split it into training and test sets, and normalize the features. This step ensures that our KNN model performs better since it is sensitive to the scale of features.

import torch
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert the data to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.long)

3. Defining the KNN Algorithm in PyTorch

In KNN, the Euclidean distance between a test point and all training points is calculated, and the K nearest neighbors are identified. We will implement these core parts of the algorithm in PyTorch.

Euclidean Distance Function

To calculate the Euclidean distance between the test point and each training point, we can use the following formula:

d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

Here’s how we can implement it using PyTorch:

# Define the function to calculate Euclidean distance
def euclidean_distance(X_train, X_test_point):
    # Subtract the test point from each training point and compute the squared distances
    distances = torch.sum((X_train - X_test_point) ** 2, axis=1)
    return torch.sqrt(distances)

Implementing the KNN Classifier

We will now implement the KNN classifier using PyTorch. For each test point, the steps are:

Compute the Euclidean distance to all training points.
Identify the K nearest neighbors.
Perform a majority vote among the nearest neighbors to predict the class.

# Define the KNN classifier function
def knn(X_train, y_train, X_test, K):
    y_pred = []
    
    for X_test_point in X_test:
        # Compute distances from the test point to all training points
        distances = euclidean_distance(X_train, X_test_point)
        
        # Get the indices of the K nearest neighbors
        nearest_neighbors = torch.argsort(distances)[:K]
        
        # Get the labels of the K nearest neighbors
        nearest_labels = y_train[nearest_neighbors]
        
        # Perform a majority vote to predict the label
        predicted_label = torch.mode(nearest_labels).values.item()
        y_pred.append(predicted_label)
    
    return torch.tensor(y_pred)

4. Training and Testing the Model

With the KNN classifier defined, we can now make predictions on the test set and evaluate the accuracy of the model.

# Set the value of K
K = 5

# Make predictions on the test set
y_pred = knn(X_train, y_train, X_test, K)

# Evaluate the accuracy
accuracy = torch.sum(y_pred == y_test).item() / len(y_test)
print(f"KNN Model Accuracy: {accuracy * 100:.2f}%")

5. Interpreting the Results

The accuracy score tells us how well the model performs on the test set. Since the Iris dataset is relatively simple, you should expect a high accuracy score.
You can experiment with different values of K (such as K=3 or K=7) to observe the effect on the accuracy.

6. Visualizing Decision Boundaries (Optional)

For better intuition, you can visualize the decision boundaries of the KNN classifier. We’ll use only two features (e.g., sepal length and sepal width) to plot a 2D decision boundary.

import matplotlib.pyplot as plt

def plot_decision_boundary(X_train, y_train, K):
    # Create a mesh grid
    x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
    y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

    # Flatten the grid and make predictions
    grid_points = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
    Z = knn(X_train[:, :2], y_train, grid_points, K)
    Z = Z.numpy().reshape(xx.shape)

    # Plot the decision boundary
    plt.contourf(xx, yy, Z, alpha=0.8)
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k')
    plt.title(f"KNN Decision Boundary (K={K})")
    plt.show()

# Plot decision boundary using the first two features of the Iris dataset
plot_decision_boundary(X_train[:, :2], y_train, K=5)

Summary

In this article, we built a K-Nearest Neighbors (KNN) classifier from scratch using PyTorch for the Iris dataset. We covered:

Loading and preprocessing the Iris dataset.
Implementing the Euclidean distance and KNN classifier in PyTorch.
Training and evaluating the KNN model.
Visualizing the decision boundary (optional).

This hands-on approach in PyTorch allows for a deeper understanding of the KNN algorithm. In the next section, we will explore more advanced topics and comparisons with other machine learning algorithms.

1. Steps in the Example​

2. Loading and Preprocessing the Dataset​

3. Defining the KNN Algorithm in PyTorch​

Euclidean Distance Function​

Implementing the KNN Classifier​

4. Training and Testing the Model​

5. Interpreting the Results​

6. Visualizing Decision Boundaries (Optional)​

Summary​