Revision

Back to Machine Learning


Definition

DBSCAN (for Density-based spatial clustering of applications with noise) is a density based clustering method that computes for each point in the dataset its neighbours based on a given distance \(\varepsilon\).

\(\varepsilon\) is an hyperparameter of DBSCAN and the other hyperparameter of the model is \(min_{points}\).


Notations

A cluster is thus a set of close core points with distance less than \(\varepsilon\) and non core sample but having a distance less than \(\varepsilon\) to at least one core point. Data points with distance more than \(\varepsilon\) to every core points are considered outliers.


Visual representation


In this diagram, \(min_{points}=4\). Point \(A\) and the other red points are core points, because the area surrounding these points in an \(\varepsilon\) radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points \(B\) and \(C\) are not core points, but are reachable from \(A\) (via other core points) and thus belong to the cluster as well. Point \(N\) is a noise point that is neither a core point nor directly-reachable. From DBSCAN Wikipedia page.


Pseudo code


Pros and cons

Pros


Cons


KD tree

KD tree is a method to avoid computing the whole distance matrix (ie to compute for each point its distance to every other point). KD tree build a data structure that organises the data into a tree structure.


Pseudo code


Ressources

See: