Skip to main content

Implementation of Agglomerative Hierarchical Clustering in Scikit-Learn

Agglomerative Hierarchical Clustering is a popular clustering method that builds a hierarchy of clusters by iteratively merging the closest pairs of clusters. In this article, we’ll guide you through implementing Agglomerative Hierarchical Clustering using Scikit-Learn, a powerful machine learning library in Python.

1. Introduction to Scikit-Learn's Agglomerative Clustering

Scikit-Learn provides an easy-to-use implementation of Agglomerative Clustering through the AgglomerativeClustering class. This class allows you to specify the number of clusters, the linkage criterion, and the distance metric, making it highly flexible and suitable for various types of data.

2. Step-by-Step Guide to Implementing Agglomerative Clustering

2.1 Importing Necessary Libraries

First, let's import the necessary libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage

2.2 Generating a Synthetic Dataset

For this example, we'll use a synthetic dataset generated by the make_blobs function, which creates Gaussian blobs for clustering.

# Generate synthetic data
X, y = make_blobs(n_samples=150, centers=3, cluster_std=0.6, random_state=42)

# Plot the data
plt.scatter(X[:, 0], X[:, 1], c='blue', marker='o', edgecolor='k')
plt.title("Generated Data")

2.3 Performing Agglomerative Clustering

Now, we'll perform Agglomerative Clustering on the dataset. We’ll use the ward linkage method, which minimizes the variance of the clusters being merged.

# Initialize the Agglomerative Clustering model
agg_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')

# Fit the model to the data
y_pred = agg_clustering.fit_predict(X)

# Plot the clustering result
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', marker='o', edgecolor='k')
plt.title("Agglomerative Clustering with Ward's Linkage")

2.4 Visualizing the Dendrogram

To better understand the hierarchical structure of the clusters, we can visualize a dendrogram. Although Scikit-Learn does not directly support dendrogram creation with its AgglomerativeClustering class, we can use SciPy's dendrogram and linkage functions to create one.

# Create a linkage matrix using SciPy's linkage function
linked = linkage(X, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)

2.5 Exploring Different Linkage Criteria

You can explore different linkage criteria, such as single, complete, and average, by adjusting the linkage parameter in the AgglomerativeClustering model.

# Complete linkage example
agg_clustering_complete = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')
y_pred_complete = agg_clustering_complete.fit_predict(X)

# Plot the clustering result for complete linkage
plt.scatter(X[:, 0], X[:, 1], c=y_pred_complete, cmap='viridis', marker='o', edgecolor='k')
plt.title("Agglomerative Clustering with Complete Linkage")

2.6 Choosing the Number of Clusters

To choose the optimal number of clusters, you can visually inspect the dendrogram and decide where to cut it. The height at which the dendrogram is cut will determine the number of clusters.

3. Conclusion

Agglomerative Hierarchical Clustering in Scikit-Learn is a versatile and powerful tool for grouping data into clusters. By experimenting with different linkage criteria and distance metrics, you can tailor the algorithm to best suit your specific dataset and clustering goals.

In this article, we've covered the essential steps to implement Agglomerative Clustering using Scikit-Learn. You can now apply these techniques to your own datasets to uncover hidden patterns and structures.

Next, we’ll explore how to implement Agglomerative Hierarchical Clustering using PyTorch.