Metric Component Analysis
This is part 1 of a 9 part series on Metric Component Analysis.
Quick, what was your total revenue last month? I’m sure you were able to answer that question; understanding your Key Performance Indicators (KPIs) is one of the first steps in using data to make better decisions. Metrics like total revenue, revenue growth and gross margins are the basic tools of running a business.
However, few companies can list what the primary drivers were of those metrics. Which customer segments contributed the most to revenue growth? Which products had the highest gross margins in the past two months? What segments are most likely to reduce total revenue this month? The answers to these questions are the drivers of your metrics, but finding those answers can feel like trying to solve a mystery.
Understanding the drivers of your metrics is known as component analysis . The components for your metrics are all of the different dimensions and customer segments you can use to break down the metric, such as Revenue by Country and Gross Margin by Product. Even a moderately sized business may have dozens of dimensions and hence hundreds of components, which can make component analysis difficult. Luckily, we can help you improve your detective skills and find the answers more quickly!
This topic is complex, so it will take two weeks to cover everything. We’ll start this week by understanding how to break down metrics into their components which we’ll then analyze next week to find drivers.
Tomorrow we will get started by trying to solve The Case of the Missing Revenue!
 Principal Component Analysis, which you may have heard of already, refers to a specific statistical technique for automatically extracting the primary drivers of a given data set. We’ll cover that next week.
Metric Component Analysis: Sum Metrics
This is part 2 of a 9 part series on Metric Component Analysis.
The Case of the Missing Revenue
Before you can jump in and determine the drivers of any given metric, you need to understand how that metric is calculated. Why, you ask? Because how a metric is calculated determines what kinds of changes will cause it to change in different ways. We’ll cover a few examples of calculated metrics in the coming days, starting today with a sum metric via The Case of the Missing Revenue!
Total Revenue is a sum metric, calculated by adding up all the revenue you earn from your business. Our case begins with the following evidence:
Something happened on January 18 that lead to a drop in Total Revenue! We need to get to the bottom of this problem, but where do we start?
Sum metrics are the simplest form of metric, as they are simply a summation of components.
We can take any dimension and the sum of revenue for those dimensions will equal Total Revenue. For example, if we add the revenue from each country we will get the Total Revenue. Similarly, if we add the revenue for each product we will get the same Total Revenue. This means that any change in the total metric is simply a result of the value change of any of its components (along a given dimension).
If we chart Total Revenue by country in a stacked area chart, we can check if there are any obvious changes:
Using this view, we can tell that the drop in Total Revenue on January 18 was caused by a drop in revenue in the United States (orange). This stacked chart also helps us understand how the various components combine to create the top level metric, represented by the top of the area (the top of France’s green area).
Case solved! Okay, I admit this was a really easy case but sum metrics are very simple . Tomorrow we’ll get more advanced when we examine mean metrics in The Case of the Lower Conversions!
 Assuming you know which dimensions to look at. We’ll cover that next week!
Metric Component Analysis: Mean Metrics
This is part 3 of a 9 part series on Metric Component Analysis.
The Case of the Lower Conversions
Today we’ll look at how to break down a more complex calculated metric, the mean, in The Case of the Lower Conversions! Our case starts with a drop in our conversion rate from visit to purchase, which is the percentage of visits that result in a purchase. The drop is obvious in the following evidence:
As you can see, the conversion rate drops on February 4th from around 50% to almost 45%. That’s a 5% drop in just one day – and a drop that persisted for the rest of the week! We need to solve this case, but where should we start?
When we examined sum metrics, we were able to find the answer by charting all the components for a given dimension, so let’s try that again here . Let’s plot the conversion rate for each product we sell as a line chart:
Hmmm… that is not as useful as the chart yesterday. I see that Product A’s conversion rate increased on the 4th, not dropped, while all of the other products dropped. What is going on?
Mean metrics are calculated in a different way than the sum metrics we discussed yesterday. In this particular case we are using a weighted mean. While the sum metric adds each individual component, a weighted mean adds a proportionate, or weighted, amount of each component – for example, the proportion of the total population:
Hence, charting the conversion rate for each product isn’t enough information – we need to know how many purchases each product involves. So to investigate this case, we also need to see how many transactions each product was involved with:
|Product||Percent of Transactions|
A massive change in a component that is only a very small part of the total population won’t move the weighted mean, nor will a very small change in a massive part of the population. For example, a big change in Product A likely won’t change the overall metric since it is such a small part of the population. We need to look for significant changes in significant populations.
To make this clear, let’s only consider the products with the largest (Product F) and smallest (Product A) populations in our chart above:
As you can see, Product A’s conversion rates actually increased when the overall metric decreased! However, since it’s population was so small it did not move the overall metric. Product F is a large part of the population so its dip in conversion rate on February 4th pulled down the overall metric. Product F is the culprit!
Case closed! This was a little harder, but we are just warming up. Tomorrow we’ll cover how to break down another type of metric in The Case of the Disappearing Users!
 A stacked area chart does not make much sense in this case because adding conversion rates does not have the same meaning as adding revenue.
Metric Component Analysis: Median Metrics
This is part 4 of a 9 part series on Metric Component Analysis.
The Case of the Disappearing Users
Today we’ll investigate median metrics, which are more robust than mean metrics as they are less sensitive to the distribution of values they are summarizing (see our post on means and medians for more information). However, they are complex in other ways, as will become clear when we start solving this case. We’ll start with the following evidence, which is a chart of the Median Session Length (in seconds) of users on our website:
Clearly something happened on March 4th that is causing users’ average time on our site to drop by almost 25% from around 160s to under 120s. Where shall we start? When we looked at mean metrics, the most useful view was of the largest population component (and smallest) which we can view below for the Source dimension (where did the users come from). In this case, our largest source was Google and our smallest was Twitter:
Hmmm… that is not as useful as it was for mean metrics. In fact, neither of those moved in ways even remotely like the overall metric! This is because medians are not calculated in the same way, and in fact aren’t really calculated at all. The median is simply the middle value  after sorting a set of values:
Since Medians are only concerned with the order of values, the top or bottom values can change significantly and not affect the Median because they don’t change the order. For example, Google might account for a large portion of users but if their values are all in the top quartile any changes that occur will not change the ordering of the middle value, so it does not affect the median. With medians, we need to focus on the composition of our users first and the metrics second.
Let’s look at how the population of sources (share of users) has changed around the time of the drop in Median Session Length:
|Source||Share as of March 3rd||Share as of March 4th|
Ah, this gives us more insight. Between March 3rd and 4th there was a big shift in the share of users between Facebook and Pinterest. Let’s look at those two sources compared to the overall metric (which is in faint blue so you can see the overlap):
Ah-ha! It looks like the Median Session Length shifted from the Facebook metric to the Pinterest metric on March 4th because Pinterest users became the median users instead of Facebook users. he population analysis was critical since median metrics are so dependent on populations.
Case closed! These kinds of mysteries can be fun, but only if you find the right answer. Finding that right answer gets harder as our metrics get more complex, so tomorrow we’ll cover how to do this kind of breakdown with even more complex metrics.
 If the data has an even number of values, then the median is computed as the mean of the two middle values.
Metric Component Analysis: Complex Metrics
This is part 5 of a 9 part series on Metric Component Analysis.
This week we’ve covered how to break down various types of metrics, including sums, means and medians. You will, of course, have some metrics that are more complex and are harder to break down into their components.
For example, churn is a difficult metric to calculate because there are many different factors involved. A typical calculation of churn  might look like the following:
Unsurprisingly, many companies struggle to break their churn down into components and hence have trouble doing further analysis on what is driving churn. While I cannot give you any single method to break down these metrics, as it depends on how you calculate them, there are some common lessons you can apply.
- Think in Components. Everything in your business is built of components, including your metrics. Use your intuition to help identify the components that should comprise your complex metrics and then determine how to break them down into those components. If you know what you are looking for, it’ll be easier to find.
- Represent Metrics as Series. You’ve probably noticed that this week I’ve used a series representation for each metric calculation. This is not a coincidence, if you can transform a metric formula into a mathematical series it will get easier to break them down into components.
- Track Populations. If you only track metric values, but not the populations represented by different customer segments, it may be hard or impossible to break down your metrics into components. As we have seen this week, sometimes the population is the most important factor you’ll have in your breakdown.
Easy, right? Nothing to it.
Hey, wait a second…
Okay, maybe we aren’t finished yet. This week we’ve only covered how to break down metrics into components. It’s been quite convenient that I always knew exactly what dimensions to use when breaking down our metric to find the right answers! In practice, the hardest part of understanding metric drivers is identifying those dimensions.
Next week we will dive into methods to do exactly that. They can get complex, but all of our detective work from this week should have you ready to tackle harder cases!
 Churn calculations are so complex we’ll cover a variety of them in the coming weeks. However, if you don’t want to wait I recommend this blog post about different ways to calculate churn.
Metric Component Analysis: Conducting Analysis
This is part 6 of a 9 part series on Metric Component Analysis.
Last week we explored how breaking down metrics into components made it easy to find the drivers when our metrics changed. However, in every case I magically knew exactly which dimensions to look at! In the real world, the hardest part of solving these mysteries is finding the right dimensions to analyze.
If you think about any metric in your business there are dozens of different dimensions you might analyze. For example, you might be able to analyze Revenue by Country, Product, Customer, Channel, etc. Most metrics will have far too many dimensions for you simply to look at them all.
Luckily, there are a number of powerful statistical tools you can use to analyze your data and identify the most important dimensions. They have fancy names like Principal Component Analysis and Latent Class Analysis, but they all focus on taking high dimensional data and extracting only the most important dimensions. We will not go into heavy detail but I will make sure you understand how they work so you can make an educated decision about which tools you should use.
This is a short week, so we will highlight two of the most important techniques.
Tomorrow we’ll start with a high level description of how dimension reduction works.
Metric Component Analysis: Dimension Reduction
This is part 7 of a 9 part series on Metric Component Analysis.
One of the biggest challenges of Component Analysis is that metrics can have so many components! For example, Revenue in the United States is a single metric, but if you break it down by State you now have 50 component metrics (Revenue per US State). If you have 10 products, you can again break Revenue into 10 component metrics (Revenue per Product). If you break Revenue down by both US State and Product, you end up with 500 component metrics! 
The process of reducing all of those options down to the few most important dimensions is called dimension reduction. Doing so can involve some complex mathematics, but let’s start with a general explanation of how those approaches work. Normally you would be working with dozens of dimensions, but I can’t chart those easily so we’ll use a simple example of only two dimensions.
Let’s take an example where we have Revenue for a large number of Products and a large number of Countries. While there are a lot of combinations of Product and Country, we can chart them in two dimensions such as the following:
It’s not obvious from this chart whether Product or Country is more interesting when finding patterns in Revenue. In fact, it’s hard to read anything from this chart at all! However, we can map this two-dimensional chart into the sub-dimensions (one-dimensional) as follows:
This allows us to view each dimension independently. For the Country dimension, there are no clear patterns as everything is fairly evenly distributed:
But the Product dimension shows two clear clusters developing:
If we were doing an investigation, we would start looking at the Product dimension first, as these two clusters might indicate an interesting behavior.
Obviously, this is an extremely simplistic example! While we did this in two dimension, it’s the same process you would use on data with dozens of dimensions where you map them onto small sets of dimensions looking for the interesting characteristics or behaviors.
By breaking down the higher dimension data into sub-dimensions, we are doing dimension reduction (well-named, huh?)! It’s not possible to do it manually, you will use some complex mathematical tools to do it for you – all you need to understand is how they work and when to use them.
The first one we will cover is called Principal Component Analysis and we’ll cover it tomorrow!
 If you want to know how complex this can get, check out our series on Combinatorics!
Metric Component Analysis: Principal Component Analysis
This is part 8 of a 9 part series on Metric Component Analysis.
One of the most common approaches to dimension reduction is a technique known as Principal Component Analysis (PCA). Given a very large amount of data, PCA will extract the principal components, which are dimensions that are the most descriptive of the relationships in the data (as determined by statistical correlation). The actual mathematics of PCA are complex, but I can walk you through the process here. You can follow along with the code and data in our Github repository.
We will start with a large amount of data on the sources of traffic to our website. Each record includes a lot of dimensions about our inbound traffic:
As you can see from the raw data file, there are 11 dimensions of our data which is too many for us to manually analyze. Instead, we will use the statistical analysis tool R to execute a PCA on the numeric data to see which dimensions are the principal components . The results come in two forms.
The first output is a dimensional clustering of all the entries in our original data, which you can see below:
What the PCA has done is mapped all of our raw entries into a new space – meaning that neither the x- nor y-axes represent any of the original data in our table – based on the relationships between the dimensions in the data. While this might be hard to interpret at first, it is clear that there is a significant clustering of most entries in the lower right. They seem to be clustered in a horizontal line and a smaller vertical line, which indicates that there are more than one dimensional relationship shaping the data. Also of note is that entries #49 and #57 are significant outliers from the other entries and are worth investigating on their own.
Still, this alone is not enough to be useful. Luckily, the PCA gives us even more information! The second output is a breakdown of the dimensions themselves and their relationships:
The arrows in this chart represent the dimensions from our original data, i.e. “Sessions”. The directions they point are related to the clustering in the first chart, so you can see the horizontal and vertical clusters were driven by the relationships between certain dimensions.
In this chart, the closer the arrows representing any two dimensions the more related those dimensions are. As we can see, the dimensions “Pages / Session” and “Avg. Session Duration” are highly related (arrows to the left), which we would expect as the more pages you visit per session the longer you spend on the website. Also, “% New Sessions” and “Bounce Rate” are highly related (arrows on the right) which we should investigate to understand why New Users would Bounce more frequently than other users.
You can even see negative relationships when the arrows are pointing in different directions. For example, “Avg. Session Duration” is negatively correlated with “Bounce Rate” (their arrows point in opposite directions) which makes sense since users that bounce will spend less time on the website.
The most interesting learning from this is that the New Users and Sessions dimensions are unrelated to all of the other dimensions! That gives us another interesting area for investigation.
So, starting with a large amount of raw data with a lot of dimensions we were able to use PCA to reduce those dimensions and extract some interesting areas for investigation:
- Why are entries #49 and #57 such outliers?
- Why are New Users and Sessions unrelated to the other dimensions?
- Why are New Sessions leading to higher Bounce Rates?
While our sample data here only has a few dozen entries, PCA really shines when you have hundreds or thousands of entries and it is not possible to examine the data using other techniques. I encourage you to try it on some of your data, often you will learn new things you didn’t expect!
If you would like to learn more about Principal Component Analysis, there is a great explanation and derivation available here. Tomorrow we’ll look at another approach at dimension reduction called Latent Class Analysis.
 There are other techniques, like Multiple Correspondence Analysis, to do a similar analysis for categorical variables.
Metric Component Analysis: Latent Class Analysis
This is part 9 of a 9 part series on Metric Component Analysis.
Yesterday we reviewed how to use Principal Component Analysis for dimension reduction, and in doing so identify the most important dimensions in some data about traffic to a sample website. However, as an astute reader I’m sure you noticed that the only dimensions the PCA actually analyzed were the numeric columns! If you look at the data again, you’ll notice that not all our data was numeric:
The Age and Source columns are text values and were completely ignored in our PCA! That is because PCA works on numeric values and not categorical (text) values. To analyze the dimensions of non-numeric data we’ll need a different tool…
Latent Class Analysis (LCA) is just such a tool, designed to find the latent relationships (classes) between dimensions of your data . We’ve once again added an example of how to run an LCA using R to our Github repository so you can follow along.
You provide as input to the model the number of classes you want as output, and the LCA will analyze the categorical columns in the data, looking for similarities. With our example data and an input of three classes, you can visualize the results of one of the resulting classes like this:
The title tells us that Class3 makes up 52.7% of the entries in our data (over half). The chart is displaying which values for our categories (Age and Source) contributed to the creation of this class, in this case Source 7 (Facebook) and Age groups 3-6 (Ages 25-64). By looking at all classes in the full output (available here) you can see which dimensions are driving the creation of each latent class, which indicate they are the most interesting dimensions. In other words, Class 3 mostly consists of people between the ages of 25 and 64 who use Facebook.
In Review: Dimension reduction is a critical tool in figuring out where to start looking for drivers in a large amount of high dimension data. We’ve covered two techniques (PCA and LCA) which work in different circumstances on different kinds of data. Combined with the component analysis techniques we discussed last week you have everything you need to get to the bottom of big changes in your metrics quickly.
Next Week: Churn! Understanding churn and why it’s so hard to calculate is one of the most requested topics, so we’ll dive in and arm you with the knowledge to tackle it. Also, I will make some jokes about churning butter.
 If that sounds familiar, it’s because it is! LCA is very similar to the clustering techniques we covered previously, many of which can be used in a similar fashion. In those cases the relationships were called clusters, not classes.