Theory of Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is a powerful and intuitive method of clustering that builds a hierarchy of clusters by iteratively merging the closest pairs of clusters. This bottom-up approach is useful for identifying structures within data that may not be apparent with other clustering algorithms. In this article, we’ll explore the mathematical foundation, different types of linkage criteria, and a detailed example of how the algorithm works.

1. Mathematical Foundations

1.1 Distance Measures

The core of Agglomerative Hierarchical Clustering lies in the distance or dissimilarity between data points. Commonly used distance measures include:

Euclidean Distance: The straight-line distance between two points in Euclidean space.

d(x_i, x_j) = \sqrt{\sum_{k=1}^p (x_{ik} - x_{jk})^2}

Manhattan Distance: The sum of the absolute differences of their coordinates.

d(x_i, x_j) = \sum_{k=1}^p |x_{ik} - x_{jk}|

Cosine Similarity: A measure of similarity between two vectors.

\text{Cosine}(x_i, x_j) = \frac{x_i \cdot x_j}{\|x_i\| \|x_j\|}

1.2 Linkage Criteria

The choice of linkage criterion determines how the distance between clusters is calculated during the merging process. Common linkage methods include:

1.2.1 Single Linkage

Single linkage, also known as minimum linkage, defines the distance between two clusters as the minimum distance between any single point in the first cluster and any single point in the second cluster.

D(A, B) = \min_{x \in A, y \in B} d(x, y)

This method is prone to the chaining effect, where clusters may be elongated and less compact.

1.2.2 Complete Linkage

Complete linkage, or maximum linkage, defines the distance between two clusters as the maximum distance between any single point in the first cluster and any single point in the second cluster.

D(A, B) = \max_{x \in A, y \in B} d(x, y)

This method tends to create more compact clusters but may be sensitive to outliers.

1.2.3 Average Linkage

Average linkage, or mean linkage, defines the distance between two clusters as the average distance between all pairs of points, where one point is from the first cluster and the other is from the second cluster.

D(A, B) = \frac{1}{|A| \cdot |B|} \sum_{x \in A} \sum_{y \in B} d(x, y)

This method strikes a balance between single and complete linkage and is often used in practice.

1.2.4 Ward’s Method

Ward’s method minimizes the variance within each cluster. The distance between two clusters is defined as the increase in the sum of squared deviations from the mean when the clusters are merged.

D(A, B) = \frac{|A| + |B|}{T} \cdot \| \mu_A - \mu_B \|^2

Where:

$|A|$ and $|B|$ are the sizes of clusters $A$ and $B$ ,
$T$ is the total number of data points,
$\mu_A$ and $\mu_B$ are the centroids of clusters $A$ and $B$ .

Ward’s method tends to produce clusters of approximately equal size.

2. The Algorithm in Detail

2.1 Initialization

Each data point starts as its own cluster. If you have $n$ data points, you initially have $n$ clusters.

2.2 Iterative Merging

At each step, the algorithm merges the two closest clusters based on the chosen linkage criterion. This process is repeated until only one cluster remains or until a stopping criterion (such as a predefined number of clusters) is met.

2.3 Dendrogram

The result of the Agglomerative Hierarchical Clustering process is a dendrogram, a tree-like diagram that shows the arrangement of the clusters produced by the algorithm. The height of each node in the dendrogram represents the distance or dissimilarity between the clusters being merged.

2.4 Cutting the Dendrogram

The final clusters can be obtained by “cutting” the dendrogram at a specific height. This height represents the distance threshold for clustering, allowing you to define the number of clusters based on the level at which you cut the dendrogram.

3. Practical Example

3.1 Example Data

Consider a simple dataset with six 2D points:

D = \{(1, 2), (2, 3), (6, 8), (7, 8), (8, 9), (10, 10)\}

3.2 Step-by-Step Clustering Process

Initialization: Start with each point as its own cluster:
$\{(1, 2)\}, \{(2, 3)\}, \{(6, 8)\}, \{(7, 8)\}, \{(8, 9)\}, \{(10, 10)\}$
First Merge (Single Linkage Example):
- Calculate the pairwise distances between clusters.
- Merge the two closest clusters:
  $D(\{(1, 2)\}, \{(2, 3)\}) = \sqrt{(1-2)^2 + (2-3)^2} = \sqrt{2} \approx 1.41$
  The closest pair is $\{(1, 2)\}$ and $\{(2, 3)\}$ .
- New clusters:
  $\{(1, 2), (2, 3)\}, \{(6, 8)\}, \{(7, 8)\}, \{(8, 9)\}, \{(10, 10)\}$
Subsequent Merges:
- Repeat the process for the remaining clusters. For example:
  $D(\{(6, 8)\}, \{(7, 8)\}) = \sqrt{(6-7)^2 + (8-8)^2} = 1$
  Merge to form:
  $\{(1, 2), (2, 3)\}, \{(6, 8), (7, 8)\}, \{(8, 9)\}, \{(10, 10)\}$
- Continue merging until all points are in a single cluster or the desired number of clusters is reached.

3.3 Dendrogram

This process is visualized as a dendrogram, which shows how the clusters are merged at each stage and helps to determine the optimal number of clusters.

4. Choosing the Right Linkage Criterion

The choice of linkage criterion can significantly affect the clustering results. Here’s a summary of when to use each:

Single Linkage: Good for chaining effects but can produce elongated clusters.
Complete Linkage: Produces more compact clusters, sensitive to outliers.
Average Linkage: A balance between single and complete linkage, suitable for many applications.
Ward’s Method: Ideal for minimizing variance within clusters, producing clusters of similar size.

5. Conclusion

Agglomerative Hierarchical Clustering is a flexible and powerful clustering method, particularly useful when the relationships between data points are important. Its ability to create a hierarchy of clusters makes it unique among clustering algorithms. However, the choice of linkage criteria and distance measures can significantly impact the results, so these decisions should be made carefully based on the specific characteristics of the dataset.

1. Mathematical Foundations​

1.1 Distance Measures​

1.2 Linkage Criteria​

1.2.1 Single Linkage​

1.2.2 Complete Linkage​

1.2.3 Average Linkage​

1.2.4 Ward’s Method​

2. The Algorithm in Detail​

2.1 Initialization​

2.2 Iterative Merging​

2.3 Dendrogram​

2.4 Cutting the Dendrogram​

3. Practical Example​

3.1 Example Data​

3.2 Step-by-Step Clustering Process​

3.3 Dendrogram​

4. Choosing the Right Linkage Criterion​

5. Conclusion​