Yesterday we reviewed how to use Principal Component Analysis for dimension reduction, and in doing so identify the most important dimensions in some data about traffic to a sample website. However, as an astute reader I’m sure you noticed that the only dimensions the PCA actually analyzed were the numeric columns! If you look at the data again, you’ll notice that not all our data was numeric:
The Age and Source columns are text values and were completely ignored in our PCA! That is because PCA works on numeric values and not categorical (text) values. To analyze the dimensions of non-numeric data we’ll need a different tool…
Latent Class Analysis (LCA) is just such a tool, designed to find the latent relationships (classes) between dimensions of your data . We’ve once again added an example of how to run an LCA using R to our Github repository so you can follow along.
You provide as input to the model the number of classes you want as output, and the LCA will analyze the categorical columns in the data, looking for similarities. With our example data and an input of three classes, you can visualize the results of one of the resulting classes like this:
The title tells us that Class3 makes up 52.7% of the entries in our data (over half). The chart is displaying which values for our categories (Age and Source) contributed to the creation of this class, in this case Source 7 (Facebook) and Age groups 3-6 (Ages 25-64). By looking at all classes in the full output (available here) you can see which dimensions are driving the creation of each latent class, which indicate they are the most interesting dimensions. In other words, Class 3 mostly consists of people between the ages of 25 and 64 who use Facebook.
In Review: Dimension reduction is a critical tool in figuring out where to start looking for drivers in a large amount of high dimension data. We’ve covered two techniques (PCA and LCA) which work in different circumstances on different kinds of data. Combined with the component analysis techniques we discussed last week you have everything you need to get to the bottom of big changes in your metrics quickly.
Next Week: Churn! Understanding churn and why it’s so hard to calculate is one of the most requested topics, so we’ll dive in and arm you with the knowledge to tackle it. Also, I will make some jokes about churning butter.
 If that sounds familiar, it’s because it is! LCA is very similar to the clustering techniques we covered previously, many of which can be used in a similar fashion. In those cases the relationships were called clusters, not classes.