Data Silos: Correlations

This is part 5 of our series on Data Silos. Previous segments are available on our archives page. Have you found this series useful? If so, share the Data Driven Daily with someone else!


We’ve covered correlations previously in a number of different ways, including finding insights. If you remember, a correlation is a measure of the similarity of two metrics (if you don’t remember, see our series on Simple Statistics). That’s great, but how can correlations help us join data from different data sources and break down data silos?

When it’s not possible to join two data sources together using either unique identifiers or fingerprinting, you can instead think of them as two different sources of metrics. By looking for correlation relationships between the metrics from these different sources we can infer joins in the source data.

For example, let us assume that we want to understand how our email marketing campaigns affect our customer support traffic but have neither unique identifiers for a join nor reliable fingerprints to use. Consider the chart below, which plots two recent email marketing campaigns and two types of customer support requests:


If we measure the correlation coefficients between all of these metrics, we find that Email Campaign 1 (blue line) and Support Request: Discount Error (purple line) are highly related with a coefficient value of 0.85. Clearly something about that campaign is related to the rise in those types of requests, and we can delve into the individual records in our customer support and email systems to try and figure out why. There is no clear relation between Email Campaign 2 and either type of Support Requests, as their correlation coefficients are around 0.5.

Of course, it’s also obvious from looking at the chart, but you might have hundreds or thousands of metrics from each system.Measuring the correlation is a faster way to infer joins than visually comparing each pair of metrics!

This approach will let you answer high level questions, such as “Which marketing campaigns resulted in more customer support requests?” It will not help you figure out which specific users received those emails or opened those support requests, so it is only useful as a high-level tool. The good news is that once you find a correlated relationship, you can pull up the individual data from each system and look for patterns that might be hiding in them.

In Review: Data silos are everywhere and they can prevent you from doing important analysis by breaking your data up into multiple different places. You have many tools at your disposal to bring that data back together again, ranging from unique identifier joins to fingerprinting to correlation analysis. Often the best insights come from crossing between data silos, so mastering these techniques can produce big benefits.

Quote of the Day: “Not till we are lost, in other words not till we have lost the world, do we begin to find ourselves, and realize where we are and the infinite extent of our relations.” ― Henry David Thoreau