Dirty Data: Data Consistency
So far this week we’ve covered duplicates and missing values in your data. Even more insidious are problems with consistency in your data. Consistency failures happen when the same data means different things depending on when or where the data was found. For example, you might track the creation date of sales leads to understand how your sales team is generating leads over time. However, if some salespeople record the creation date as when they first find the client’s contact information and others record the creation date as the first time they speak to the client you will get vastly skewed results.
How do I measure data consistency?
A very effective way to understand consistency is to measure how many unique values you have for a given data field, a metric I refer to as data entropy. To measure data entropy, you want to measure the number of unique values in a set of records compared to how many records there are. Specifically, you might run the following query:
COUNT(*) AS total_count,
COUNT(DISTINCT(field_name)) AS unique_count
This will return both the total number of records (total_count) and the number of unique values (unique_count) for a given field (field_name). If all values for the given field were unique then total_count and unique_count would be equal, such as with a unique identifier field.
Why is data entropy useful?
This data entropy measure is useful when you have an expectation on the distribution of values for a given field. For example, you know there are only 50 states so a column containing the state should not have more than 50 unique values (for US records). If it does, it means that there is inconsistency in your data (perhaps some records use “CA” and others “California”). Another example would be when you ask potential customers to enter their job title in a sales lead form. If the ratio of total_count to unique_count is too large, it means that customers are entering titles in inconsistent ways and you might want to provide a drop-down list instead of a free form text field.
If you have a lot of data in your tables you might not be able to run the data entropy query/calculation on all the records. In that case you can sample the data in whatever method you think appropriate, such as limiting it to a random selection of 1,000 rows.
So far we’ve focused on specific kinds of dirty data. Tomorrow we’ll discuss how to detect a wider variety of dirty data!
Quote of the Day: “Consistency is the hallmark of the unimaginative.” ― Oscar Wilde