Revision

Definition

See CS229 course on K-means.

Algorithm

Based on EM algorithm.

K-means++

K-means is really sensitive to initialisation. Bad initialisation can lead to poor results. Initialisation in K-means is done randomly.

K-means++ just modified a bit the initialisation. It works as follows:

Choose one center uniformly at random among the data points,
For each data point x not chosen yet, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.

Now that the initial centers have been chosen, proceed using standard k-means clustering.

K-means++ part was extract from the Wikipedia page on K-means++.

Choosing K

Elbow-point
Validation set
Clustering quality metrics

Pros and Cons

Pros

Easy to understand

Cons

Can’t handle complex structure shapes

Resources

See CS229 course on K-means.