Revision

Back to Machine Learning


Definition

OPTICS (for Ordering points to identify the clustering structure is a density based clustering) is a clustering method similar to DBSCAN.

The main difference between DBSCAN and OPTICS is that OPTICS does not require an hyperparameter \(\varepsilon\). It still requires an hyperparameter \(min_{points}\). It also used a optional \(max_\varepsilon\) hyperparameter.


Notation


\[\text{core-distance}_{\varepsilon, min_{points}}(p) = \begin{cases} Undefined && \text{if } \vert N_{\varepsilon}(p) \vert \lt min_{points}\\ min_{distance}(p) && \text{otherwise} \end{cases}\]


\[\text{reachability-distance}_{\varepsilon, min_{points}}(p, q) = \begin{cases} Undefined && \text{if } \vert N_{\varepsilon}(q) \vert \lt min_{points}\\ \max(\text{core-distance}_{\varepsilon, min_{points}}(q), \; D(p,q)) && \text{otherwise} \end{cases}\]


Note that the \(\text{reachability-distance}\) is not symmetric.


Idea of the algorithm

Using the hyperparameter \(min_{points}\) and the optional hyperparameter \(max_\varepsilon\), OPTICS will create for each point \(p\) a reachability-distance matrix.

If \(max_\varepsilon\) is not defined by the user its default value is \(\infty\). In this setup, every point will have a core-distance and reachability-distance to every points defined. Otherwise some points will have \(undefined\) distances.

Once the reachability-distance matrix is defined, a reachability plot is created. The reachability plot has in \(x\)-axis the points and in \(y\)-axis their reachability distance to the closest point.

The reachability plot is ordered and created incrementaly: for each current point the algorithm selects the next point as the point with minimum reachability-distance to the current point (initialisation is random).


Reachability plot

Here is a visualisation of reachability plot:



Using a reachability-plot (a special kind of dendrogram), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points as processed by OPTICS on the x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.

The image above illustrates this concept. In its upper left area, a synthetic example data set is shown. The upper right part visualizes the spanning tree produced by OPTICS, and the lower part shows the reachability plot as computed by OPTICS. Colors in this plot are labels, and not computed by the algorithm; but it is well visible how the valleys in the plot correspond to the clusters in above data set. The yellow points in this image are considered noise, and no valley is found in their reachability plot. They are usually not assigned to clusters, except the omnipresent “all data” cluster in a hierarchical result. From OPTICS Wikipedia page.


Selection of clusters

Extracting clusters from this plot can be done manually by selecting a range on the x-axis after visual inspection, by selecting a threshold on the y-axis (the result is then similar to a DBSCAN clustering result with the same \(\varepsilon\) and \(min_{points}\) parameters; here a value of 0.1 may yield good results), or by different algorithms that try to detect the valleys by steepness, knee detection, or local maxima. From OPTICS Wikipedia page.

Algorithms for cluster extraction are Xi and DBSCAN (see OPTICS on Scikit-Learn).


Pros and cons

Pros


Resources

See: