Revision

Back to Machine Learning

Definition

DBSCAN (for Density-based spatial clustering of applications with noise) is a density based clustering method that computes for each point in the dataset its neighbours based on a given distance \(\varepsilon\).

\(\varepsilon\) is an hyperparameter of DBSCAN and the other hyperparameter of the model is \(min_{points}\).

Notations

Two data points \(p\) and \(q\) are considered neighbours if \(D(p, q) \lt \varepsilon\) where \(D\) is a given distance measure (euclidian distance for example),
A point \(p\) is a core point if at least \(min_{points}\) neighbours, ie \(min_{points}\) points which are within a distance \(\varepsilon\) of it (including \(p\)),
A point \(q\) is directly reachable from \(p\) if point \(q\) is within distance \(\varepsilon\) from core point \(p\). Points are only said to be directly reachable from core points,
A point \(q\) is reachable from \(p\) if there is a path \(p_1, \ldots, p_n\) with \(p_1 = p\) and \(p_n = q\), where each \(p_{i+1}\) is directly reachable from \(p_i\). Note that this implies that the initial point and all points on the path must be core points, with the possible exception of \(q\).

A cluster is thus a set of close core points with distance less than \(\varepsilon\) and non core sample but having a distance less than \(\varepsilon\) to at least one core point. Data points with distance more than \(\varepsilon\) to every core points are considered outliers.

Visual representation

In this diagram, \(min_{points}=4\). Point \(A\) and the other red points are core points, because the area surrounding these points in an \(\varepsilon\) radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points \(B\) and \(C\) are not core points, but are reachable from \(A\) (via other core points) and thus belong to the cluster as well. Point \(N\) is a noise point that is neither a core point nor directly-reachable. From DBSCAN Wikipedia page.

Pseudo code

Associate all points to cluster -1 (outliers)
Initialize the current cluster number \(C\) to 1
For each point \(p\) in dataset:
- \(N := \text{neighbours of } p\) (ie points at a distance from \(p\) less than \(\varepsilon\))
- if \(\vert N \vert \gt min_{points}\):
  - if \(cluster(p) = -1\):
    - \(cluster(p) := C\),
    - \(C := C + 1\),
  - for each point \(p'\) in \(N\) (neighbours of \(p\)):
    - \(cluster(p') := cluster(p)\).

Pros and cons

Pros

Can cluster different structure shape,
Average complexity: \(O(nlog(n))\),
Automatically selects the number of clusters.

Cons

Sensitive to hyperparameters (particularly eps)

KD tree

KD tree is a method to avoid computing the whole distance matrix (ie to compute for each point its distance to every other point). KD tree build a data structure that organises the data into a tree structure.

Pseudo code

Pick random feature x,
Find the median to split the data set (half the data will have a value x greater than the median),
Repeat until you have a predetermined number of points in each branch

Ressources

See: