Data Audit: Detecting Low-Quality Data
Identifying Quality Problems
Now that you have a baseline, the next step in your data audit is to see what doesn’t match. As we mentioned yesterday, finding problems is much like Anomaly Detection, in that you simply need to identify when the data falls outside of your expected range (orange dot):
However, your data may fall outside of your expected range for reasons that have nothing to do with data quality! Telling the difference between true data quality problems and simple anomalies is challenging and something you will need to do. To help, let’s review all the different ways that data quality problems might manifest and how you would identify them in your detection:
This is the easiest data quality problem to detect, since missing data would appear as a big deviation from your data volume baseline. Most companies will create thresholds for deviations in data volume that, if exceeded, represent something well outside the normal course of business and likely a data quality problem. For example, if revenue dips 5% below your expectations that might be a business problem, but if it dips 80% below the expectation it’s hopefully a data quality problem. If your data is cyclical thresholding can be problematic so more advanced tools (like Outlier) might be required.
Corrupted data is hard to find as it is typically hiding in plain sight. Perhaps data fields are set incorrectly, or placed in the wrong location, or perhaps customers are confused and enter the wrong data. Whatever the cause, the result is data that doesn’t fit your expectations for what it should contain. You can detect some of this doing simple content-type matching, but it is usually easier to find by comparing the data to your Data Diversity baseline (see yesterday). Like with missing data, extremely large deviations from your expected data diversity can indicate data quality problems, while small deviations might be business or product problems.
This is the hardest problem to detect, since it requires you to match different data sources together. For example, if your payment provider reports that you made $4k on a given day, but the total of transactions in your revenue database is only $3.6k, you have inconsistent data. The good news is that if you have data inconsistency problems, they will typically manifest in many different metrics between data sources, so typically you only need to check a few to find these problems (the bad news is that fixing these problems are challenging because they exist in so many locations). You will want to identify the key metrics for each data source and look for deviations to deal with inconsistent data.
There will be cases where your data audit will reveal business problems and not data quality problems, which is still success! For problems where the actual cause is hard to discern, perhaps because of a scarcity of data, you will need to use your judgement to decide the likely cause and whether it is worth the effort to get to the root cause (see our series on Root Cause Analysis).
Tomorrow we’ll go to the last step in your data audit, which is verification of the definitions you use for your data.
Quote of the Day: “Quality, it seems, is a necessary, but insufficient attribute for success.” ― Derek Thompson