# Clustering: Agglomerative Hierarchical Clustering, in Theory

This is part 4 of a 5 part series on Clustering.

I spent the past few days talking about a partitional clustering algorithm, k-means. Today, I’ll talk about agglomerative hierarchical clustering algorithms. Like for k-means, let’s break down the name of the algorithm to get a better idea of what it does. Let’s start with the word “hierarchical”. This is referring to the overall type of the clustering algorithm – in this case, it means that the algorithm creates the clusters by continually nesting data points. The word “agglomerative” describes the type of hierarchical clustering we are doing. There are two basic approaches to hierarchical clustering, agglomerative and divisive. Agglomerative clustering, the more common approach, means that the algorithm nests data points by building from the bottom up. In other words, each data point is its own cluster and then they are joined together to create larger clusters. Divisive clustering means that the algorithm nests data points by building from the top down. In other words, all data points start in a single cluster and then are broken apart to create smaller clusters.

The algorithm works by merging the two closest clusters and repeating until only one cluster remains. The result of agglomerative hierarchical clustering is a mapping of how each cluster was merged together each step of the way.

The biggest differences between k-means and agglomerative hierarchical clustering are due to their core approaches to solve the problem. In particular, in k-means clustering, data points can move between clusters as the algorithm improves its central values in each iteration. The agglomerative hierarchical clustering algorithm does not allow for any previous mergers to be undone. Another difference between the algorithms is that with k-means, because it uses guesses for its initial central values, you can get different answers each time you run the algorithm using the same value of k. Agglomerative hierarchical clustering, on the other hand, will always produce the same result because the distances between the data points do not change.

Tomorrow, I’ll show you an example of agglomerative hierarchical clustering in action!