Clustering: K-means Clustering, in Theory

This is part 2 of our series on Clustering, previous segments are available in our archives.

Perhaps the best place to start with the k-means clustering algorithm is to break down its name, as it helps understand what the algorithm is doing. Let’s start at the beginning; “k” refers to the number of clusters that will be created by the algorithm. So, even though the algorithm is unsupervised in how it creates the clusters, it does take the number of clusters (k) you want to create as an input. The term “means” in k-means refers to how each data point is joined to a cluster – each data point is assigned to the cluster with the closest mean. Putting it all together, k-means clustering gives you “k” clusters of data points, where each data point is assigned to the cluster its closest to.

The algorithm starts by choosing “k” points as the initial central values (often called centroids) [1]. Next, every point in the data is assigned to the central value it is closest to. Now every point is assigned a cluster, but we need to check if the initial guesses of central values are the best ones (very unlikely!). The way to check that is to compute the new central value of each cluster – if all of the recomputed central values are the same as the original ones, then you are at the best solution and the algorithm can stop. Otherwise, the algorithm tries again by reassigning points to the newly computed central values. This will continue until the recomputed central values don’t change.

The result of the algorithm is “k” clusters, where each of the data points you have is assigned uniquely to one, and only one, cluster. That is why this is called a “partitioned”, or unnested, algorithm – every point is only in a single cluster.

Today was all theory – tomorrow, let’s see k-means in practice.

 

[1] One of the challenges with k-means is determining where to start. There are a number of different algorithms just to solve this alone, for example, choosing a random subset of values and taking the mean of those. Computer programs that compute k-means should be able to do this initialization for you.