Definition
See CS229 course on K-means.
Algorithm
Based on EM algorithm.
K-means++
K-means is really sensitive to initialisation. Bad initialisation can lead to poor results.
Initialisation in K-means is done randomly.
K-means++ just modified a bit the initialisation. It works as follows:
- Choose one center uniformly at random among the data points,
- For each data point x not chosen yet, compute D(x), the distance between x and the nearest center that has already been chosen.
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
- Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
K-means++ part was extract from the Wikipedia page on K-means++.
Choosing K
- Elbow-point
- Validation set
- Clustering quality metrics
Pros and Cons
Pros
Cons
- Can’t handle complex structure shapes
Resources
See CS229 course on K-means.