Implementation of Agglomerative Hierarchical Clustering in Scikit-Learn
Agglomerative Hierarchical Clustering is a popular clustering method that builds a hierarchy of clusters by iteratively merging the closest pairs of clusters. In this article, we’ll guide you through implementing Agglomerative Hierarchical Clustering using Scikit-Learn, a powerful machine learning library in Python.
1. Introduction to Scikit-Learn's Agglomerative Clustering
Scikit-Learn provides an easy-to-use implementation of Agglomerative Clustering through the AgglomerativeClustering
class. This class allows you to specify the number of clusters, the linkage criterion, and the distance metric, making it highly flexible and suitable for various types of data.
2. Step-by-Step Guide to Implementing Agglomerative Clustering
2.1 Importing Necessary Libraries
First, let's import the necessary libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage
2.2 Generating a Synthetic Dataset
For this example, we'll use a synthetic dataset generated by the make_blobs
function, which creates Gaussian blobs for clustering.
# Generate synthetic data
X, y = make_blobs(n_samples=150, centers=3, cluster_std=0.6, random_state=42)
# Plot the data
plt.scatter(X[:, 0], X[:, 1], c='blue', marker='o', edgecolor='k')
plt.title("Generated Data")
plt.show()
2.3 Performing Agglomerative Clustering
Now, we'll perform Agglomerative Clustering on the dataset. We’ll use the ward
linkage method, which minimizes the variance of the clusters being merged.
# Initialize the Agglomerative Clustering model
agg_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
# Fit the model to the data
y_pred = agg_clustering.fit_predict(X)
# Plot the clustering result
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', marker='o', edgecolor='k')
plt.title("Agglomerative Clustering with Ward's Linkage")
plt.show()
2.4 Visualizing the Dendrogram
To better understand the hierarchical structure of the clusters, we can visualize a dendrogram. Although Scikit-Learn does not directly support dendrogram creation with its AgglomerativeClustering
class, we can use SciPy's dendrogram
and linkage
functions to create one.
# Create a linkage matrix using SciPy's linkage function
linked = linkage(X, method='ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title("Dendrogram")
plt.show()
2.5 Exploring Different Linkage Criteria
You can explore different linkage criteria, such as single
, complete
, and average
, by adjusting the linkage
parameter in the AgglomerativeClustering
model.
# Complete linkage example
agg_clustering_complete = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')
y_pred_complete = agg_clustering_complete.fit_predict(X)
# Plot the clustering result for complete linkage
plt.scatter(X[:, 0], X[:, 1], c=y_pred_complete, cmap='viridis', marker='o', edgecolor='k')
plt.title("Agglomerative Clustering with Complete Linkage")
plt.show()
2.6 Choosing the Number of Clusters
To choose the optimal number of clusters, you can visually inspect the dendrogram and decide where to cut it. The height at which the dendrogram is cut will determine the number of clusters.
3. Conclusion
Agglomerative Hierarchical Clustering in Scikit-Learn is a versatile and powerful tool for grouping data into clusters. By experimenting with different linkage criteria and distance metrics, you can tailor the algorithm to best suit your specific dataset and clustering goals.
In this article, we've covered the essential steps to implement Agglomerative Clustering using Scikit-Learn. You can now apply these techniques to your own datasets to uncover hidden patterns and structures.
Next, we’ll explore how to implement Agglomerative Hierarchical Clustering using PyTorch.