DBSCAN: What is it? When to Use it? How to use it
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular unsupervised learning method utilized in model building and machine learning algorithms. Before we go any further, we need to define what an “unsupervised” learning method is. Unsupervised learning methods are when there is no clear objective or outcome we are seeking to find. Instead, we are clustering the data together based on the similarity of observations. To help clarify, let’s take Netflix as an example. Based on previous shows you have watched in the past, Netflix will recommend shows for you to watch next. Anyone who has ever watched or been on Netflix has seen the screen below with recommendations (Yes, this image is taken directly from my Netflix account and if you have never watched Shameless before I suggest you get on that ASAP).
Because I watched ‘Shameless’, Netflix recommends several other similar shows to watch. But where is Netflix gathering those recommendations from? Considering it is trying to predict the future with what show I am going to watch next, Netflix has nothing to base the predictions or recommendations on (no clear definitive objective). Instead, Netflix looks at other users who have also watched ‘Shameless’ in the past, and looks at what those users watched in addition to ‘Shameless’. By doing so, Netflix is clustering its users together based on similarity of interests. This is exactly how unsupervised learning works. Simply clustering observations together based on similarity, hoping to make accurate conclusions based on the clusters.
Back to DBSCAN. DBSCAN is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density. Given that DBSCAN is a density based clustering algorithm, it does a great job of seeking areas in the data that have a high density of observations, versus areas of the data that are not very dense with observations. DBSCAN can sort data into clusters of varying shapes as well, another strong advantage. DBSCAN works as such:
- Divides the dataset into n dimensions
- For each point in the dataset, DBSCAN forms an n dimensional shape around that data point, and then counts how many data points fall within that shape.
- DBSCAN counts this shape as a cluster. DBSCAN iteratively expands the cluster, by going through each individual point within the cluster, and counting the number of other data points nearby. Take the graphic below for an example:
Going through the aforementioned process step-by-step, DBSCAN will start by dividing the data into n dimensions. After DBSCAN has done so, it will start at a random point (in this case lets assume it was one of the red points), and it will count how many other points are nearby. DBSCAN will continue this process until no other data points are nearby, and then it will look to form a second cluster.
As you may have noticed from the graphic, there are a couple parameters and specifications that we need to give DBSCAN before it does its work. The two parameters we need to specify are as such:
What is the minimum number of data points needed to determine a single cluster?
How far away can one point be from the next point within the same cluster?
Referring back to the graphic, the epsilon is the radius given to test the distance between data points. If a point falls within the epsilon distance of another point, those two points will be in the same cluster.
Furthermore, the minimum number of points needed is set to 4 in this scenario. When going through each data point, as long as DBSCAN finds 4 points within epsilon distance of each other, a cluster is formed.
IMPORTANT: In order for a point to be considered a “core” point, it must contain the minimum number of points within epsilon distance. The point itself is included in the count. View the documentation and look at the min_samples parameter in particular.
You will also notice that the Blue point in the graphic is not contained within any cluster. DBSCAN does NOT necessarily categorize every data point, and is therefore terrific with handling outliers in the dataset. Lets examine the graphic below:
The left image depicts a more traditional clustering method, such as K-Means, that does not account for multi-dimensionality. Whereas the right image shows how DBSCAN can contort the data into different shapes and dimensions in order to find similar clusters. We also notice in the right image, that the points along the outer edge of the dataset are not classified, suggesting they are outliers amongst the data.
Advantages of DBSCAN:
- Is great at separating clusters of high density versus clusters of low density within a given dataset.
- Is great with handling outliers within the dataset.
Disadvantages of DBSCAN:
- While DBSCAN is great at separating high density clusters from low density clusters, DBSCAN struggles with clusters of similar density.
- Struggles with high dimensionality data. I know, this entire article I have stated how DBSCAN is great at contorting the data into different dimensions and shapes. However, DBSCAN can only go so far, if given data with too many dimensions, DBSCAN suffers
Below I have included how to implement DBSCAN in Python, in which afterwards I explain the metrics and evaluating your DBSCAN Model
DBSCAN Implementation in Python
1. Assigning the data as our X values
# setting up data to cluster
X = data# scale and standardizing data
X = StandardScaler().fit_transform(X)
2. Instantiating our DBSCAN Model. In the code below, epsilon = 3 and min_samples is the minimum number of points needed to constitute a cluster.
# instantiating DBSCAN
dbscan = DBSCAN(eps=3, min_samples=4)# fitting model
model = dbscan.fit(X)
3. Storing the labels formed by the DBSCAN
labels = model.labels_
4. Identifying which points make up our “core points”
import numpy as np
from sklearn import metrics# identify core samples
core_samples = np.zeros_like(labels, dtype=bool)core_samples[dbscan.core_sample_indices_] = True
print(core_samples)
5. Calculating the number of clusters
# declare the number of clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(n_clusters_)
6. Computing the Silhouette Score
print("Silhoette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)
Metrics for Measuring DBSCAN’s Performance:
Silhouette Score: The silhouette score is calculated utilizing the mean intra- cluster distance between points, AND the mean nearest-cluster distance. For instance, a cluster with a lot of data points very close to each other (high density) AND is far away from the next nearest cluster (suggesting the cluster is very unique in comparison to the next closest), will have a strong silhouette score. A silhouette score ranges from -1 to 1, with -1 being the worst score possible and 1 being the best score. Silhouette scores of 0 suggest overlapping clusters.
Inertia: Inertia measures the internal cluster sum of squares (sum of squares is the sum of all residuals). Inertia is utilized to measure how related clusters are amongst themselves, the lower the inertia score the better. HOWEVER, it is important to note that inertia heavily relies on the assumption that the clusters are convex (of spherical shape). DBSCAN does not necessarily divide data into spherical clusters, therefore inertia is not a good metric to use for evaluating DBSCAN models (which is why I did not include inertia in the code above). Inertia is more often used in other clustering methods, such as K-means clustering.