When you think of dirty data, you are probably thinking about inaccurate or malformed data. The opposite is also very dangerous, as many data teams are hindered by data that is missing entirely! If you rely on certain data to perform your analysis but those values are missing from a significant portion of your data records it can be hard or impossible to do your analysis.
Imagine trying to do a geographic analysis of your customers if 10% of them have no address on record. Worse, imagine that those 10% are missing because there is an error in your lead form that discards addresses in California. Your analysis would be bias in showing no leads in the Golden State.
How do we find and measure missing values?
Missing values are especially dangerous when working with SQL systems because they will be ignored by many queries you might be using to generate metrics or do larger analysis. The data records themselves will be valid but the missing data will be invisible to you because the values you expect are actually missing! This results in hidden bias.
The good news is that it’s easy to find missing values if you are proactive and look for them. The following query will tell you the ratio between how many records have values (num_Valid) and how many have missing values (num_Missing):
SELECT COUNT(*) - COUNT(field_name) AS num_Missing, COUNT(field_name) AS num_Valid FROM your_table;
Understanding if the missing values ratio is acceptable for a given field will require your judgement. It can be effective to benchmark the missing values ratio across all of your fields to understand which fields are sparse (many missing values) or dense (few missing values). This can affect what kinds of analysis you perform from a given set of data.
Okay, what do I do if I’m missing too many values?
Unfortunately, unlike duplicates, you cannot remove missing values as they are already missing! You don’t have a lot of options to try and deal with missing values:
- Data Augmentation. Many times you can determine the correct value for a missing field from the data that is present. For example, if a data record has the state value “California” but no country value, you know the country is the “United States” and fill that value. Using these kind of semantic rules and data matching is called data augmentation and there are products and services which will do it for you.
- Interpolating Values. Missing values in numeric data is very frustrating, since there is typically no way to go back and time and recover the missing days, weeks or months. If there gaps are significant issues for you, you can attempt to interpolate values by using moving averages or linear regressions on the data before and after the gap. This will help you fill in the gap with values that would be accurate if the general trend continued, but could be misleading if some abnormalities happened during that gap.
- Backup, Backup, Backup. The cost of data storage is cheap these days, so backup everything. Not just in your master data warehouse, backup all of your data in all of your systems so that you can recreate your data warehouse if it turns out you are missing too many values. Nothing better than the raw data to fill in the gaps.
While missing data is frustrating and dangerous, there are even more insidious sources of dirty data. We’ll cover data consistency tomorrow!
Quote of the Day: “Are you anybody else’s missing piece?’ ‘Not that I know of.’” ― Shel Silverstein