DBSCAN Theory

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a fundamental clustering algorithm in unsupervised machine learning. It identifies clusters based on the density of points in the data space, making it particularly effective for detecting clusters of arbitrary shapes and sizes, as well as for handling noise and outliers. In this article, we delve into the theoretical aspects of DBSCAN, including its mathematical formulation, algorithmic steps, and key properties.

1. Mathematical Foundations of DBSCAN

DBSCAN is rooted in the concept of density, where clusters are defined as regions of high point density separated by regions of lower point density. The algorithm uses two critical parameters:

Epsilon ( $\epsilon$ ): The maximum distance between two points for them to be considered as part of the same neighborhood.
MinPts: The minimum number of points required within a neighborhood of radius $\epsilon$ for a point to be considered a core point.

1.1 Neighborhood and Core Points

The $\epsilon$ -neighborhood of a point $p$ is defined as the set of all points within a distance $\epsilon$ of $p$ :

N_\epsilon(p) = \{ q \in D \mid \text{dist}(p, q) \leq \epsilon \}

Where:

$N_\epsilon(p)$ is the $\epsilon$ -neighborhood of point $p$ .
$\text{dist}(p, q)$ is the distance between points $p$ and $q$ .
$D$ is the dataset.

A point $p$ is classified as a core point if its $\epsilon$ -neighborhood contains at least MinPts points:

|N_\epsilon(p)| \geq \text{MinPts}

1.2 Border and Noise Points

Points that are not core points but fall within the $\epsilon$ -neighborhood of a core point are called border points. These points are part of a cluster but do not have sufficient density around them to be core points.

Points that are neither core points nor border points are classified as noise points (or outliers). These points do not belong to any cluster.

2. Algorithmic Steps of DBSCAN

The DBSCAN algorithm can be broken down into the following steps:

Step 1: Initialize and Select a Point

Begin with an unvisited point $p$ from the dataset $D$ .
Mark $p$ as visited.

Step 2: Retrieve the $\epsilon$ -Neighborhood

Retrieve the $\epsilon$ -neighborhood $N_\epsilon(p)$ of the point $p$ .

Step 3: Check Core Point Criteria

If $|N_\epsilon(p)| \geq \text{MinPts}$ , $p$ is a core point. A new cluster is created, and $p$ and all points in $N_\epsilon(p)$ are added to this cluster.
If $|N_\epsilon(p)| < \text{MinPts}$ , $p$ is marked as noise. However, $p$ may later be reclassified as a border point if it falls within the $\epsilon$ -neighborhood of a core point.

Step 4: Expand the Cluster

For each point $q$ $q$ in $N_\epsilon(p)$ $N_{ϵ} (p)$ :
- If $q$ is unvisited, mark it as visited and retrieve its $\epsilon$ -neighborhood $N_\epsilon(q)$ .
- If $|N_\epsilon(q)| \geq \text{MinPts}$ , add $N_\epsilon(q)$ to the cluster.
- If $q$ is not already assigned to a cluster, add it to the current cluster.

Step 5: Repeat

Repeat the process for all unvisited points in the dataset until all points have been processed.

Step 6: Result

The result is a set of clusters, with each cluster containing core points and possibly border points. Noise points are not assigned to any cluster.

3. Properties of DBSCAN

DBSCAN has several important properties that make it a powerful clustering algorithm:

3.1 Handling of Noise and Outliers

One of the key advantages of DBSCAN is its ability to naturally identify and exclude noise points. This is achieved by the density-based approach, where points that do not meet the density criteria (i.e., not enough neighbors within the $\epsilon$ -neighborhood) are classified as noise.

3.2 Clustering Arbitrary Shapes

DBSCAN does not make any assumptions about the shape of clusters. It can identify clusters of arbitrary shapes, as it relies on the density of points rather than distances between points in a feature space. This makes DBSCAN particularly effective for datasets where clusters are non-convex or irregularly shaped.

3.3 Automatic Determination of the Number of Clusters

Unlike algorithms such as K-Means, where the number of clusters must be specified a priori, DBSCAN automatically determines the number of clusters based on the density of points in the dataset. The number of clusters is an emergent property of the data and the chosen $\epsilon$ and MinPts parameters.

3.4 Sensitivity to Parameters

The effectiveness of DBSCAN depends on the appropriate selection of $\epsilon$ and MinPts. If $\epsilon$ is too small, many points will be classified as noise. If $\epsilon$ is too large, clusters may merge, leading to fewer clusters than expected. Similarly, the choice of MinPts affects the minimum density required for a cluster to form.

Choosing $\epsilon$ : A common method to select an appropriate $\epsilon$ is to plot the k-distance graph (typically with $k = \text{MinPts} - 1$ ) and choose the value of $\epsilon$ at the point of maximum curvature (the "elbow" point).
Choosing MinPts: A rule of thumb is to set MinPts to at least the dimensionality of the data plus one. However, this can vary depending on the application and dataset characteristics.

4. Mathematical Example of DBSCAN

Consider a simple 2D dataset with points distributed in two clusters and some noise:

D = \{(1, 2), (2, 2), (2, 3), (8, 7), (8, 8), (25, 80)\}

Let’s apply DBSCAN with parameters $\epsilon = 1.5$ and $\text{MinPts} = 2$ .

Select Point $(1, 2)$ :
- $\epsilon$ -neighborhood: $N_{1.5}((1, 2)) = \{(1, 2), (2, 2)\}$
- $|N_{1.5}((1, 2))| = 2$ , so $(1, 2)$ is a core point, and a new cluster is formed.
Expand Cluster:
- Include $(2, 2)$ in the cluster.
- Next, consider point $(2, 2)$ $(2, 2)$ .
  - $N_{1.5}((2, 2)) = \{(1, 2), (2, 2), (2, 3)\}$
  - $|N_{1.5}((2, 2))| = 3$ , so $(2, 2)$ is a core point, and $(2, 3)$ is added to the cluster.
Move to Next Unvisited Point $(8, 7)$ :
- $N_{1.5}((8, 7)) = \{(8, 7), (8, 8)\}$
- $|N_{1.5}((8, 7))| = 2$ , so $(8, 7)$ is a core point, and a new cluster is formed.
Expand Cluster:
- Include $(8, 8)$ in the cluster.
- Consider point $(8, 8)$ $(8, 8)$ .
  - $N_{1.5}((8, 8)) = \{(8, 7), (8, 8)\}$
  - $|N_{1.5}((8, 8))| = 2$ , so $(8, 8)$ remains in the cluster.
Move to Last Unvisited Point $(25, 80)$ :
- $N_{1.5}((25, 80)) = \{(25, 80)\}$
- $|N_{1.5}((25, 80))| = 1$ , so $(25, 80)$ is classified as noise.

Result:

Cluster 1: $\{(1, 2), (2, 2), (2, 3)\}$
Cluster 2: $\{(8, 7), (8, 8)\}$
Noise: $\{(25, 80)\}$

5. Conclusion

DBSCAN is a powerful clustering algorithm that identifies clusters based on density, making it particularly useful for datasets with non-convex or irregularly shaped clusters. Its ability to handle noise and automatically determine the number of clusters are significant advantages. However, careful tuning of its parameters $\epsilon$ and MinPts is crucial for optimal performance. Understanding the theoretical underpinnings of DBSCAN provides a strong foundation for applying this algorithm effectively in various domains.

1. Mathematical Foundations of DBSCAN​

1.1 Neighborhood and Core Points​

1.2 Border and Noise Points​

2. Algorithmic Steps of DBSCAN​

Step 1: Initialize and Select a Point​

Step 2: Retrieve the ϵ\epsilonϵ-Neighborhood​

Step 3: Check Core Point Criteria​

Step 4: Expand the Cluster​

Step 5: Repeat​

Step 6: Result​

3. Properties of DBSCAN​

3.1 Handling of Noise and Outliers​

3.2 Clustering Arbitrary Shapes​

3.3 Automatic Determination of the Number of Clusters​

3.4 Sensitivity to Parameters​

4. Mathematical Example of DBSCAN​

5. Conclusion​