Implementation of DBSCAN in Scikit-learn
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm that is particularly effective at discovering clusters of varying shapes and identifying outliers. In this article, we will walk through the implementation of DBSCAN using Scikit-learn, a powerful Python library for machine learning.
1. Introduction to DBSCAN in Scikit-learn
Scikit-learn provides a robust and efficient implementation of DBSCAN through its sklearn.cluster.DBSCAN
class. This implementation allows you to specify the critical parameters, such as eps
(the maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples
(the number of samples in a neighborhood for a point to be considered a core point).
2. Step-by-Step Implementation
2.1 Importing Required Libraries
First, let's import the necessary libraries.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
2.2 Generating Sample Data
We will use the make_blobs
function to generate a sample dataset that contains clusters with varying densities.
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0)
# Standardize features by removing the mean and scaling to unit variance
X = StandardScaler().fit_transform(X)
2.3 Applying DBSCAN
Now, we apply the DBSCAN algorithm to the standardized data.
# Apply DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_
eps=0.3
: The maximum distance between two samples for one to be considered as in the neighborhood of the other.min_samples=10
: The number of samples in a neighborhood for a point to be considered a core point.
2.4 Identifying Core, Border, and Noise Points
The labels_
attribute of the DBSCAN
object gives us the cluster labels assigned to each point. Points labeled -1
are classified as noise.
# Identify core, border, and noise points
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print(f'Estimated number of clusters: {n_clusters_}')
2.5 Visualizing the Results
Finally, let's visualize the clustering result. We'll color each point according to its assigned cluster, and noise points will be colored black.
# Unique labels
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
# Plot
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title(f'Estimated number of clusters: {n_clusters_}')
plt.show()
2.6 Result Analysis
The plot should display the clusters identified by DBSCAN. Points within the same cluster are colored similarly, while noise points are displayed in black.
3. Conclusion
In this article, we implemented the DBSCAN clustering algorithm using Scikit-learn. We generated a sample dataset, applied DBSCAN, and visualized the results. DBSCAN is particularly powerful in identifying clusters of varying shapes and handling noise, making it an essential tool for unsupervised learning tasks.