Clustering: K-means Clustering, in Practice
Yesterday, I talked about the theory of k-means, but let’s put it into practice building using some sample customer sales data for the theoretical online table company we’ve talked about previously. Suppose we have data collected on our recent sales that we are trying to cluster into customer personas: Age (years), Average table size purchases (square inches), the number of purchases per year, and the amount per purchase (dollars). Plotting the data, we see that our customers might have a few groupings that are interesting.
In this case, it looks like the youngest and oldest customers are generally buying smaller, less expensive tables in lower volumes than middle-aged customers are buying the larger-sized models and sometimes in higher volumes.
Now let’s run the k-means algorithm on this data for a few different values of k, 2, 3, and 4, to see what the algorithm produces.
In all cases, the buyers of the 2160 cm^2 tables are in their own cluster, but the rest of the customers are a little more co-mingled depending on their characteristics. When k is equal to 2, the clusters look reasonable, but there is likely some more granularity that could be differentiated for the customers buying smaller tables. When k is equal to 3 and 4, these customers get split up into smaller segments.
So, what is the right number of k to choose? There aren’t great algorithmic approaches to answering this question, but what is commonly done is to run the k-means algorithm on different values of k and measuring the amount of error that is reduced by adding more clusters – the tradeoff being that as you add more clusters, you reduce the error, but as you add more clusters, you risk overfitting the data (and in the extreme case, end of up having each data point its own cluster!).
In our example, there is a massive drop in the error between k equals 2 and 3, so we should feel pretty confident that there are at least 3 clusters. There is another drop between 3 and 4 clusters, but much smaller than the first drop. Subsequent drops don’t seem to improve too much, so in this case, I’d consider creating either 3 or 4 customer personas.
If you are interested in seeing the R code I used to run the k-means algorithm and create these plots, everything is are all available on our Data Driven Daily GitHub page.
Tomorrow we will go back to theory and discuss a different type of clustering algorithm, agglomerative hierarchical clustering.
 For numeric data like shown here, this is usually measured as the sum of squared error of the distance between each point and its cluster’s central value.