K-Nearest Neighbors (KNN) Practical Example in scikit-learn
In this article, we will walk through a K-Nearest Neighbors (KNN) example using the popular scikit-learn library. We’ll be using the Iris dataset to demonstrate how KNN can be applied to a classification task.
The Iris dataset contains 150 samples of iris flowers, with three classes (Setosa, Versicolor, and Virginica) and four features (sepal length, sepal width, petal length, and petal width). Our goal is to classify the flowers into one of the three species based on the input features.
1. Steps in the Example
- Load the Iris dataset.
- Split the data into training and testing sets.
- Scale the features to ensure uniform distance calculations.
- Train the KNN model using the training data.
- Evaluate the model on the test data.
2. Loading and Splitting the Dataset
We first need to import the necessary libraries and load the Iris dataset.
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Feature Scaling
Since KNN is a distance-based algorithm, it’s important to scale the features to ensure they contribute equally to the distance calculation. We’ll use StandardScaler to standardize the features by removing the mean and scaling to unit variance.
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler on the training data and transform both training and test sets
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
4. Training the KNN Model
Now that the data is preprocessed, we can initialize and train the KNeighborsClassifier from scikit-learn. We’ll choose K=5, meaning that the algorithm will consider the 5 nearest neighbors for classification.
# Initialize the KNeighborsClassifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)
# Train the KNN model using the training data
knn.fit(X_train, y_train)
5. Evaluating the Model
Once the model is trained, we can make predictions on the test data and evaluate the accuracy of the model using accuracy_score.
# Make predictions on the test data
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Model Accuracy: {accuracy * 100:.2f}%")
6. Interpreting the Results
- The accuracy score will give you an indication of how well the model is performing. Since the Iris dataset is relatively simple, you should expect a high accuracy score.
- You can experiment with different values of K to see how it affects the model's performance. For example, try using K=3 or K=7 and observe the change in accuracy.
7. Visualizing Decision Boundaries (Optional)
For further exploration, you can visualize the decision boundaries of the KNN classifier. However, this requires reducing the dimensions of the data to 2 (using only 2 features), which may not capture the full complexity of the dataset. Below is an example using sepal length and sepal width as features.
import matplotlib.pyplot as plt
import numpy as np
# Define a function to visualize decision boundaries
def plot_decision_boundaries(X, y, model, title):
# Create a mesh grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
# Predict the class for each point in the mesh
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.Paired)
# Scatter plot of the data points
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k')
plt.title(title)
plt.show()
# Visualize the decision boundaries using two features: sepal length and sepal width
X_vis = X_train[:, :2] # Use only the first two features for visualization
knn_vis = KNeighborsClassifier(n_neighbors=5)
knn_vis.fit(X_vis, y_train)
plot_decision_boundaries(X_vis, y_train, knn_vis, title="KNN Decision Boundary (K=5)")
Summary
In this article, we demonstrated how to implement K-Nearest Neighbors (KNN) using scikit-learn for a classification task on the Iris dataset. We covered:
- Loading and splitting the dataset.
- Scaling the features to ensure uniform distance calculations.
- Training the KNN model using the training data.
- Evaluating the model’s performance using the accuracy score.
Feel free to experiment with different values of K and explore other datasets to further your understanding of how KNN works in practice. In the next section, we will explore how to implement KNN using TensorFlow.