Data Audits: Identifying Problems with a Baseline
Establishing the Baseline
You won’t be able to identify any problems in your data until you establish what your data should look like. Establishing a baseline is, in principle, very similar to how we built an expected range for data in our review of Anomaly Detection. You create a confidence interval that the data should fall between to be considered “expected”.
As a reminder, the easiest way to do this is by adding a trendline (green line) to your data (blue line), making that the basis for a range of expected values (green area):
There are many ways to fit your data, for example, using moving averages or more advanced techniques, like regressions. For advice on this see our series on Simple Statistics.
The big difference with data quality analysis is that you need to establish more than just a single expected model. For example, if your data lives in a SQL database you may need the following baselines:
- Number of rows of data added every day
- Number of different users performing actions
- Number of different user actions being tracked
Each of these needs to be baselined individually, since a failure in one may be covered up by a failure in another. For example, if the number of rows added each day is within normal ranges, you might think your data is in good shape. However, if a bug was introduced into your database that sets all user locations to a default value (meaning they are all the same), there would be a serious data quality problem!
How many different dimensions you need to baseline to be confident your data is in good shape will depend on your data and the dimensions of the data itself. The more baselines you have, the more likely you will find any problems, but also the more effort you need to put into your audit. Most audits will include baselines for these categories:
- Data Volume: How much data do you expect to collect?
- Data Diversity: How many different data values do you expect?
- Data Consistency: How closely do dimensions of your data need to align?
As always, beware creating baselines from bad data. If there are data problems already in your data, and you use that bad data to create baselines, your baselines will be as bad as your data!
Tomorrow we’ll cover how to identify data quality problems using your baselines.
Quote of the Day: “It is quality rather than quantity that matters.” ― Seneca