Dirty Data: Dirt Detectors
So far this week we’ve covered common types of dirty data including data that has duplicates, missing values and inconsistencies. There are, of course, many more ways your data can get dirty! Instead of enumerating every type of dirty data, it’s good to have a strategy to have a more general check of your data. One great way to do this is Cross-Checking.
Conceptually, cross-checking is very simple. You use two or more different sources of data to ensure they all report the same values. For example, you might have data on the number of purchases from your website in both your analytics system (tracking the user actions) and also in your payments system (tracking monetary transactions). By comparing these two metrics, which are really tracking the same thing, you can tell if there are dirty data issues.
In practice, cross-checking is harder than it sounds because of one simple question: If two of your data sources disagree, which one is correct? There is no easy answer, you will need to fall back on some of the techniques we’ve discussed earlier this week. The good news is that you know there is a problem! Hence, cross checking is useful to detect data issues but won’t solve them for you.
There are a number of other techniques to detect and correct dirty data which I didn’t have time to cover this week but I encourage you to read about:
- Data Audits are periodic, manual, reviews of your data to verify assumptions and consistency across your data stores.
- Normalization is a statistical technique that can help you adjust your metrics and statistics if there is small but consistent error in your data.
- Smoothing is a numerical technique that reduces noise in data to make it easier to see trends.
Next Week: It’s the end of 2016, which means everyone is thinking about 2017 planning. We’ll review how data can improve your business planning process with our coverage of Data Driven Planning!