Image for post
Image for post

DBSCAN: What is it? When to Use it? How to use it

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular unsupervised learning method utilized in model building and machine learning algorithms. Before we go any further, we need to define what an “unsupervised” learning method is. Unsupervised learning methods are when there is no clear objective or outcome we are seeking to find. Instead, we are clustering the data together based on the similarity of observations. To help clarify, let’s take Netflix as an example. Based on previous shows you have watched in the past, Netflix will recommend shows for you to watch next. Anyone who has ever watched or been on Netflix has seen the screen below with recommendations (Yes, this image is taken directly from my Netflix account and if you have never watched Shameless before I suggest you get on that ASAP).

Image for post
Image for post

Because I watched ‘Shameless’, Netflix recommends several other similar shows to watch. But where is Netflix gathering those recommendations from? Considering it is trying to predict the future with what show I am going to watch next, Netflix has nothing to base the predictions or recommendations on (no clear definitive objective). Instead, Netflix looks at other users who have also watched ‘Shameless’ in the past, and looks at what those users watched in addition to ‘Shameless’. By doing so, Netflix is clustering its users together based on similarity of interests. This is exactly how unsupervised learning works. Simply clustering observations together based on similarity, hoping to make accurate conclusions based on the clusters.

Back to DBSCAN. DBSCAN is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density. Given that DBSCAN is a density based clustering algorithm, it does a great job of seeking areas in the data that have a high density of observations, versus areas of the data that are not very dense with observations. DBSCAN can sort data into clusters of varying shapes as well, another strong advantage. DBSCAN works as such:

Image for post
Image for post

Going through the aforementioned process step-by-step, DBSCAN will start by dividing the data into n dimensions. After DBSCAN has done so, it will start at a random point (in this case lets assume it was one of the red points), and it will count how many other points are nearby. DBSCAN will continue this process until no other data points are nearby, and then it will look to form a second cluster.

As you may have noticed from the graphic, there are a couple parameters and specifications that we need to give DBSCAN before it does its work. The two parameters we need to specify are as such:

What is the minimum number of data points needed to determine a single cluster?

How far away can one point be from the next point within the same cluster?

Image for post
Image for post

Referring back to the graphic, the epsilon is the radius given to test the distance between data points. If a point falls within the epsilon distance of another point, those two points will be in the same cluster.

Furthermore, the minimum number of points needed is set to 4 in this scenario. When going through each data point, as long as DBSCAN finds 4 points within epsilon distance of each other, a cluster is formed.

IMPORTANT: In order for a point to be considered a “core” point, it must contain the minimum number of points within epsilon distance. The point itself is included in the count. View the documentation and look at the min_samples parameter in particular.

You will also notice that the Blue point in the graphic is not contained within any cluster. DBSCAN does NOT necessarily categorize every data point, and is therefore terrific with handling outliers in the dataset. Lets examine the graphic below:

Image for post
Image for post
The left image depicts a more traditional clustering method that does not account for multi-dimensionality. Whereas the right image shows how DBSCAN can contort the data into different shapes and dimensions in order to find similar clusters.

The left image depicts a more traditional clustering method, such as K-Means, that does not account for multi-dimensionality. Whereas the right image shows how DBSCAN can contort the data into different shapes and dimensions in order to find similar clusters. We also notice in the right image, that the points along the outer edge of the dataset are not classified, suggesting they are outliers amongst the data.

Advantages of DBSCAN:

Disadvantages of DBSCAN:

Below I have included how to implement DBSCAN in Python, in which afterwards I explain the metrics and evaluating your DBSCAN Model

DBSCAN Implementation in Python

1. Assigning the data as our X values

# setting up data to cluster
X = data
# scale and standardizing data
X = StandardScaler().fit_transform(X)

2. Instantiating our DBSCAN Model. In the code below, epsilon = 3 and min_samples is the minimum number of points needed to constitute a cluster.

# instantiating DBSCAN
dbscan = DBSCAN(eps=3, min_samples=4)
# fitting model
model = dbscan.fit(X)

3. Storing the labels formed by the DBSCAN

labels = model.labels_

4. Identifying which points make up our “core points”

import numpy as np
from sklearn import metrics
# identify core samples
core_samples = np.zeros_like(labels, dtype=bool)
core_samples[dbscan.core_sample_indices_] = True
print(core_samples)

5. Calculating the number of clusters

# declare the number of clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(n_clusters_)

6. Computing the Silhouette Score

print("Silhoette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)

Metrics for Measuring DBSCAN’s Performance:

Silhouette Score: The silhouette score is calculated utilizing the mean intra- cluster distance between points, AND the mean nearest-cluster distance. For instance, a cluster with a lot of data points very close to each other (high density) AND is far away from the next nearest cluster (suggesting the cluster is very unique in comparison to the next closest), will have a strong silhouette score. A silhouette score ranges from -1 to 1, with -1 being the worst score possible and 1 being the best score. Silhouette scores of 0 suggest overlapping clusters.

Inertia: Inertia measures the internal cluster sum of squares (sum of squares is the sum of all residuals). Inertia is utilized to measure how related clusters are amongst themselves, the lower the inertia score the better. HOWEVER, it is important to note that inertia heavily relies on the assumption that the clusters are convex (of spherical shape). DBSCAN does not necessarily divide data into spherical clusters, therefore inertia is not a good metric to use for evaluating DBSCAN models (which is why I did not include inertia in the code above). Inertia is more often used in other clustering methods, such as K-means clustering.

University of Maryland graduate Current Data Scientist at Pinpoint

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store