K-Nearest Neighbors (KNN) Practical Example in TensorFlow
In this article, we will implement K-Nearest Neighbors (KNN) from scratch using TensorFlow for a classification task on the Iris dataset. Unlike in scikit-learn, where KNN is provided as part of the library, TensorFlow doesn’t have a built-in KNN classifier, so we’ll manually calculate the distances and predict the classes.
The Iris dataset consists of 150 samples of iris flowers, each described by four features (sepal length, sepal width, petal length, and petal width) and classified into three species: Setosa, Versicolor, and Virginica.
1. Steps in the Example
- Load and preprocess the Iris dataset.
- Implement a function to compute the Euclidean distance between points.
- Implement the KNN classification algorithm using TensorFlow.
- Evaluate the model on the test set.
2. Loading and Preprocessing the Dataset
First, we need to load the Iris dataset and split it into training and testing sets. We'll also normalize the features to ensure that the distance calculations are not biased by features with larger scales.
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
3. Defining the KNN Algorithm in TensorFlow
We’ll implement the KNN algorithm from scratch using TensorFlow. The core of KNN is calculating the distance between the new data point and all points in the training set. We will use Euclidean distance for this purpose and find the K nearest neighbors.
Euclidean Distance Function
We can calculate the Euclidean distance between two points using the following formula:
Here’s how we can implement the distance calculation in TensorFlow:
# Define the function to calculate Euclidean distance
def euclidean_distance(X_train, X_test_point):
# Subtract the test point from each training point and compute the squared distance
distances = tf.reduce_sum(tf.square(X_train - X_test_point), axis=1)
return tf.sqrt(distances)
Implementing the KNN Classifier
Now we can implement the KNN classifier. For each test point, we will:
- Compute the Euclidean distance to all training points.
- Find the K nearest neighbors.
- Perform a majority vote among the nearest neighbors to predict the class.
# Define the KNN classifier function
def knn(X_train, y_train, X_test, K):
y_pred = []
for X_test_point in X_test:
# Compute distances from the test point to all training points
distances = euclidean_distance(X_train, X_test_point)
# Get the indices of the K nearest neighbors
nearest_neighbors = tf.argsort(distances)[:K]
# Get the labels of the K nearest neighbors
nearest_labels = tf.gather(y_train, nearest_neighbors)
# Perform a majority vote to predict the label
predicted_label = tf.math.reduce_max(tf.unique_with_counts(nearest_labels)[2])
y_pred.append(predicted_label.numpy())
return np.array(y_pred)
4. Training and Testing the Model
Now that we have the KNN classifier function defined, we can use it to make predictions on the test set and evaluate the model’s performance.
# Set the value of K
K = 5
# Make predictions on the test set
y_pred = knn(X_train, y_train, X_test, K)
# Evaluate the accuracy
accuracy = np.mean(y_pred == y_test)
print(f"KNN Model Accuracy: {accuracy * 100:.2f}%")
5. Interpreting the Results
- The accuracy score provides a measure of how well the model is performing on the test set. Since the Iris dataset is fairly simple, we should expect a high accuracy score.
- Feel free to experiment with different values of K (such as K=3 or K=7) to see how they affect the model’s performance.
6. Visualizing the Decision Boundary (Optional)
For visualization purposes, we can reduce the features of the dataset to just two (e.g., sepal length and sepal width) and plot the decision boundary of the KNN classifier. This helps illustrate how the algorithm divides the feature space.
import matplotlib.pyplot as plt
def plot_decision_boundary(X_train, y_train, K):
# Create a mesh grid
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
# Flatten the grid and make predictions
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z = knn(X_train[:, :2], y_train, grid_points, K)
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k')
plt.title(f"KNN Decision Boundary (K={K})")
plt.show()
# Plot decision boundary using the first two features of the Iris dataset
plot_decision_boundary(X_train[:, :2], y_train, K=5)
Summary
In this article, we built a K-Nearest Neighbors (KNN) classifier from scratch using TensorFlow for the Iris dataset. We walked through:
- Loading and preprocessing the data.
- Implementing the Euclidean distance function and the KNN algorithm.
- Training the KNN model and making predictions.
- Evaluating the model’s performance.
Although TensorFlow doesn't have a built-in KNN function, implementing KNN from scratch gives you a deeper understanding of how the algorithm works. In the next section, we will explore how to implement KNN using PyTorch.