This blog post on the DBSCAN algoritm is part of the article series Understanding AI Algorithms. DBSCAN is a clustering algorithm.
The density-based spatial clustering of applications with noise algorithm (DBSCAN) uses clustering by finding groups of observations with a high density, meaning they are not spread out.
This is appropriate if the clusters can be assumed to have different shapes. This differs from k-means because it can accommodate different shapes in data instead of only round clusters. This allows an analysis to pull other observations from the data and help to find similarities that can otherwise go unnoticed.
DBSCAN requires two parameters. The first is called epsilon, which looks at the maximum distance away a data point can be to still be part of the group. The other parameter is MinPoints, which dictates the minimum number of similar observations in a group required to be included in that cluster.
So, if you set MinPoint to five, then even if an observation has a distance less than the epsilon to three observations, it would still be excluded from the cluster. As you might notice, DBSCAN doesn’t require that we tell the algorithm how many clusters to find.
Not having to select the number of clusters simplifies the task of getting new information out from data and reduces the risk of missing out of important or hidden knowledge.
This sounds complicated, but these types of clustering algorithms have a unique function in marketing.
For instance, they can used to recommend products to a cluster of customers based on what they have previously purchased. If one customer in a cluster has purchased product A and product B, and another customer in the same cluster has purchased product A as well, then we could recommend product B for the second customer.
Amazon’s site does something similar when it recommends purchases based on what other people bought.
If you want to read all the related articles on the topic of AI algorithms, here is the list of all blog posts in this article series: