How to find insights using Data Exploration
This is part 1 of a 4 part series on Data Exploration.
“Why did revenue drop by 5%?” Even in the most data driven organizations, you can be surprised by sudden shifts and changes in your metrics. These incidents often result in “fire drills”, which are emergency projects involving members of your team dropping everything to try and find answers.
In our experience, many of these fire drills fail to produce any meaningful results and waste too much time. The root cause is a lack of expertise in exploring data, which is unsurprising as few people have opportunities to practice exploring data. This is even further complicated because you rarely even know what you are looking for!
Luckily, there are many best practices we can share with you to help you get ahead and avoid wasting your time. Data exploration can be thought of in two parts:
- Methods – How do you explore your data to maximize your chances of finding what you want?
- Insights – What exactly are you looking for and how do you know when you’ve found it?
It will take us two weeks to get through both, and this week we will start with the Methods you should use with exploring your data.
Tomorrow we’ll get started by how you decide where to start your exploration.
Data Exploration: Where to Start looking for insights
This is part 2 of a 4 part series on Data Exploration.
Every data exploration will start somewhere, and picking a good starting point is critical for saving time. Even a moderately complex business will have dozens of metrics with hundreds of dimensions, giving you thousands of different numbers to look at.  There isn’t enough time to look at all of them!
Where you start your exploration will depend a lot on why you are exploring your data in the first place.
1) If you are investigating a business incident…
… I hope whatever happened isn’t too bad. Luckily, you should have a smoking gun in the form of the metric which indicated the problem in the first place.
The first step is to do an impact analysis on that metric to understand all of the upstream changes that might affect it. This will help you reduce the number of places you need to look to only those that you know might have led to the change.
With your relationship model in hand (from your impact analysis), next you’ll want to do a component analysis of each of the parts of the model. This will help you figure out which segments and dimensions of those metrics are the most important and, hence, the places to start.
2) If you are looking for new opportunities or hidden problems…
… you are doing a much less directed search of your data, and I applaud you! Too few companies have anyone who explores their data on a regular basis for hidden opportunities and problems, which can be the most valuable insights in your data.
It turns out the best approach here is not much different than when you are investigating an incident. Instead of starting with a specific metric, you start with all of the key metrics for your business and investigate their components. It will take more time, but if there are important insights in your data, the chances are they are somehow related to those key metrics.
This will, of course, mean that you are likely going to do many times more work than you do investigating an incident, but the good news is that you hopefully don’t have a deadline. Better yet, you might avoid having incidents altogether if you find problems early.
Now that we know where to start, tomorrow we’ll cover how to find insights!
 If any of this seems familiar, it’s because we covered Root Cause Analysis just a few months ago.
Data Exploration: Finding Insights deep in your data
This is part 3 of a 4 part series on Data Exploration.
One simple truth makes data exploration easier: there is a root cause of every change. Our goal in data exploration is to find those causes that underlie the changes we care about.
Another simple truth that helps us: if everything remains the same, nothing changes. This requires us to establish what is “normal” in our data, giving us a framework from which to look for new and irregular changes. Combining these truths gives us a straightforward algorithm to find and triage our insights.
1. Establish what is “normal”
While I’m sure your intuition is very good, you need a mathematical model of what is normal for each metric to be able to objectively identify changes. There are many ways to do do this, ranging from simple averages to more advanced techniques like linear regression.
Whatever you choose, you should be sure to look across many different time ranges to ensure you account for seasonality.
2. Detect what is not normal
Once you have a model that describes “normal”, it should be straightforward to identify everything that is not normal. Depending on your definition of “normal” this may result in a large number of possible findings. Adjusting your model to have the right sensitivity will take some trial and error, but is worth the effort.
3. Group them together
Not all findings will be independent. If you see shifts in revenue among all 50 states in the US, there is probably just a change affecting the entire country. One of the best ways to do grouping is through clustering, although you can also do it using your knowledge of your business and common traits.
After detecting and grouping, you will likely still have a large set of findings. You will need to evaluate which are the most interesting insights and pull those out, as it’s not possible to share everything you found with everyone at your company. If possible, it is useful to keep a record of all the insights you find as it might save you time in the future.
Tomorrow we’ll talk about how to make sure the insights you found are true insights and not just mirages, when we cover validation.
Data Exploration: Validating Insights after finding them
This is part 4 of a 4 part series on Data Exploration.
The best part of data exploration is when you find insights hiding in your data. It’s a lot like finding a treasure chest hiding in the sand, and the feelings behind it can be a powerful emotional rush.
Unfortunately, many people get carried away by that emotion and report on insights without first validating they are real. The validation step is critical to avoid misleading people about what the data is truly saying. Even the experts are vulnerable to this kind of oversight. A group that attempted to reproduce the results of 100 published studies of psychological science was only able to reproduce 39 of them .
How do you avoid this trap? Let us count the ways:
1. Look for similar insights in the past.
If you find a shift in coupon redemptions that you think is the cause of a drop in revenue, look for similar shifts in the past. Has it happened before? If so, did it result in a drop in revenue every time? If not, it might be a red herring or maybe just a part of the answer. Many insights you find in your data happen more often than you think, you just didn’t know to look for them in the past.
2. Consider the Null Hypothesis
The null hypothesis is the assumption that there is no relationship between what you are measuring. Your job is disprove the null hypothesis to find evidence that the changes you observe couldn’t happen by mere chance. Ensure that whatever insights you find are not the result of data quality problems or analysis error. You would be surprised at how often a simple Excel formula can lead you to vastly incorrect insights. 
3. Use your judgement
If you find an insight that indicates that changing the color of the chairs in your conference room led to a drop in revenue, you would be right to be skeptical. The data itself cannot tell you the difference between correlation and causality, so you will need to decide for yourself if an insight truly represents a cause and effect relationship.
Next Week: Even with the methods we’ve reviewed this week, you need to know what you are looking for in order to succeed. We will cover the common forms of Data Insights which are going to be the building blocks of the findings you will produce from your exploration.
 This is known as the Reproducibility Project and was started due to a crisis in the reproducibility of studies in many fields. The pressures of academic research have created a severe incentive to skip validation since a two-year long data exploration that reveals no validated insights can be a career-ending event.
 One of the most influential studies in economics was later found to have significant errors in the Excel spreadsheet used to analyze the data. It’s unlikely any of your errors will cost billions of dollars, but you never know.