One of the most common approaches to dimension reduction is a technique known as Principal Component Analysis (PCA). Given a very large amount of data, PCA will extract the principal components, which are dimensions that are the most descriptive of the relationships in the data (as determined by statistical correlation). The actual mathematics of PCA are complex, but I can walk you through the process here. You can follow along with the code and data in our Github repository.
We will start with a large amount of data on the sources of traffic to our website. Each record includes a lot of dimensions about our inbound traffic:
As you can see from the raw data file, there are 11 dimensions of our data which is too many for us to manually analyze. Instead, we will use the statistical analysis tool R to execute a PCA on the numeric data to see which dimensions are the principal components . The results come in two forms.
The first output is a dimensional clustering of all the entries in our original data, which you can see below:
What the PCA has done is mapped all of our raw entries into a new space – meaning that neither the x- nor y-axes represent any of the original data in our table – based on the relationships between the dimensions in the data. While this might be hard to interpret at first, it is clear that there is a significant clustering of most entries in the lower right. They seem to be clustered in a horizontal line and a smaller vertical line, which indicates that there are more than one dimensional relationship shaping the data. Also of note is that entries #49 and #57 are significant outliers from the other entries and are worth investigating on their own.
Still, this alone is not enough to be useful. Luckily, the PCA gives us even more information! The second output is a breakdown of the dimensions themselves and their relationships:
The arrows in this chart represent the dimensions from our original data, i.e. “Sessions”. The directions they point are related to the clustering in the first chart, so you can see the horizontal and vertical clusters were driven by the relationships between certain dimensions.
In this chart, the closer the arrows representing any two dimensions the more related those dimensions are. As we can see, the dimensions “Pages / Session” and “Avg. Session Duration” are highly related (arrows to the left), which we would expect as the more pages you visit per session the longer you spend on the website. Also, “% New Sessions” and “Bounce Rate” are highly related (arrows on the right) which we should investigate to understand why New Users would Bounce more frequently than other users.
You can even see negative relationships when the arrows are pointing in different directions. For example, “Avg. Session Duration” is negatively correlated with “Bounce Rate” (their arrows point in opposite directions) which makes sense since users that bounce will spend less time on the website.
The most interesting learning from this is that the New Users and Sessions dimensions are unrelated to all of the other dimensions! That gives us another interesting area for investigation.
So, starting with a large amount of raw data with a lot of dimensions we were able to use PCA to reduce those dimensions and extract some interesting areas for investigation:
- Why are entries #49 and #57 such outliers?
- Why are New Users and Sessions unrelated to the other dimensions?
- Why are New Sessions leading to higher Bounce Rates?
While our sample data here only has a few dozen entries, PCA really shines when you have hundreds or thousands of entries and it is not possible to examine the data using other techniques. I encourage you to try it on some of your data, often you will learn new things you didn’t expect!
If you would like to learn more about Principal Component Analysis, there is a great explanation and derivation available here. Tomorrow we’ll look at another approach at dimension reduction called Latent Class Analysis.
 There are other techniques, like Multiple Correspondence Analysis, to do a similar analysis for categorical variables.
Quote of the Day: “The last three letters of principal spell ‘pal’” – Screech on Saved By the Bell (TV)