When doing a data audit, the first step is choosing the criteria that you will use to evaluate your data. Your criteria will be the checklist you complete during your audit to ensure you don’t have any problems (or identify them if you do). Since no data is perfect, in practice this means you need to decide how much error you are willing to accept in your data.
For example, let us say we work for a mobile gaming company that collects usage data from the phones of customers playing our game. That usage data contains the action the user takes and a timestamp of when they take it, so that we can analyze behavior. This should be simple, but unfortunately a surprising number of mobile phones have bad clocks  which means the time they report an action as happening is not the time it actually happened! If even just 2% of devices have bad clocks, that means that the raw data associated with the timestamp we collect will be unreliable 2% of the time.
In this case, we have a few options:
- We can accept 2% of our data being bad, and hope that when we calculate aggregate statistics, the bad data does not significantly skew our metrics.
- We can attempt to identify that 2% and remove it from our dataset. Unfortunately that means our data set will be incomplete and potentially biased.
- We can attempt to correct the 2% of bad data. While adjusting the data would fix the original problem, doing so possibly introduces new forms of error or bias into the data.
These options assume that we can actually identify the 2% of our data that is bad in the first place, which is not always the case. Identifying and understanding these systemic issues is critical when choosing your criteria.
We’ve now seen some of the tradeoffs you need to make when choosing your audit criteria. In general, good criteria will follow a few rules to help you navigate these tradeoffs:
- Your data is as accurate as the metrics you calculate from it require. If you calculate daily metrics, then it is fine to not be able to rely on the hours and minutes.
- Acceptable data errors are clearly defined and understood. Since you cannot fix every issue, the next best thing is to clearly communicate exactly what you won’t fix and why. This can be as important as the metrics you report on your data.
- Your scope is independent of limitations in your data. If you do an audit, but avoid certain parts of your data because they are too difficult or too dirty, then your audit isn’t really an audit.
With those rules, you should create a criteria checklist for all of your data sources based on what you expect of them. The detail and content will be up to you, but the more thorough you are the faster and easier it will be to complete the audit (and repeat it in the future). Bonus points if you automate it!
With your criteria in hand, tomorrow we’ll discuss how to create a baseline for your data, which will tell you what you should expect. It is actually shocking how many mobile phones have bad clocks. Whether the result of being dropped, shocked or manufactured poorly some tests see up to 2% of mobile devices have some form of significant clock skew.
Quote of the Day: “Men acquire a particular quality by constantly acting a particular way.” ― Aristotle