Product
Outlier is automated data insights for your entire business.
Book a Demo
Industries
Outlier is automated data insights for your entire business.
Book a Demo
Solutions
Outlier is automated data insights for your entire business.
Book a Demo
Resources
Outlier is automated data insights for your entire business.
Book a Demo

Dirty Data

This is part 1 of a 5 part series on Dirty Data.

If data is the basis of sound decision making, then we need to be sure the data we use is accurate and reliable. Previously we’ve covered how to handle systemic bias and large scale error in your data, the kinds of errors that can result in massive shifts in your metrics. Much more common are many small errors and noise, which result in what we call dirty data.

Many analysts and data scientists spend up to 90% of their time transforming and scrubbing their data to make it ready for use in analysis and decision making. That hints at how enormously difficult this challenge can be, especially if you want to utilize machine learning algorithms that are sensitive to data noise. While you might be able to identify noise in your data by looking at it, most automated systems cannot tell the difference and will make mistakes if the data is dirty.

Often, the cause of dirty data is as important as the dirt itself. Instead of cleaning your data every day, if you can identify the root cause of the noise and fix it you will ensure cleaner and cleaner data over time.

This week we’ll cover some common techniques for identifying dirty data and scrubbing it to remove that dirt and make it more useful.

Many of our examples this week will deal with data stored in a SQL database, but the same techniques can be applied to any database or data warehouse by using the tools and interface they make available.

Tomorrow we will get started with finding and fixing data duplication problems!

“Dirty deeds, done dirt cheap” 
– AC/DC

Dirty Data: Duplicate Data

This is part 2 of a 5 part series on Dirty Data.

One of the most common ways your data gets dirty is when a subset of your data records are duplicates. All of the data in those records is technically valid and correct, but the duplicates bring with them the risk of double counting and skewing of your analysis! For example, it’s common to find duplicate sales leads in your CRM system, which inflate your lead metrics artificially.

Data duplication can happen for a number of reasons:

  • Joining data sources. If you have the same data in two places and attempt to merge them together, you are bound to end up with at least a few duplicates.
  • Faulty software. If your software system automatically loads data but does not de-duplicate the data on load, it can easily add duplicates if the system is re-run in the future.
  • Manual error. If you rely on human beings to enter data or upload data files, they can easily lose track of their progress after taking a break and accidentally reload the same data twice! The more people you have entering data manually, the more this type of error can occur.

It would be easy to identify duplication if EVERY data record was duplicated, by simply looking through a sample of your data. But how do you find only small sets of duplicates in a much larger set? There are a lot of tools at your disposal:

  • Unique Identifiers. There are often fields in a given data record which should be unique. For example, contact information records should have unique mobile phone number for each person (although some people do have two mobile phones!). If two records have the same mobile phone number, there is a chance it’s a duplicate. If you are using a SQL database, it can be easy to find duplicate values that share the same unique identifier. The following query will show you all values of the field “email” shared by more than one record:

SELECT phone FROM your_table GROUP BY phone HAVING COUNT(*) > 1;

  • Multiple values. If there is no single field that uniquely identifies data records, you can use multiple fields together (or all of them) to identify the record. If you think that three fields (field_1, field_2, field_3) uniquely identify records you can check for duplicates easily:
SELECT field_1,field_2,field_3
FROM your_table GROUP BY field_1,field_2,field_3 HAVING COUNT(*) > 1;

This can be very slow if you have a large number of fields so you can utilize a hash to speed things up. A Hash function takes many different values and combines them into a unique number, allowing to compare numbers for uniqueness instead of all the fields themselves. Most databases support some kind of hash function.

  • Timestamps. If you have no unique identifiers and it is not practical to look for duplicates using multiple fields, you can fall back on looking at timestamps. Timestamps are added to most data records automatically and tell you the exact time (in microseconds) that the record was created. If you have data that is infrequently created (such as customer leads) then comparing timestamps should help identify duplicates since it unlikely you added two different records at exactly the same microsecond!

Duplication is dangerous as it adds in systematic bias to your data that is hard to detect. Many data teams will have software that checks for duplicates on a regular basis and it is worth setting up if you don’t do this already yourself.

Tomorrow we’ll talk about another insidious form of dirty data: missing values!

“Being a one of a kind means we are automatically the best in the world at what we do.” 

Dirty Data: Missing Values

This is part 3 of a 5 part series on Dirty Data.

When you think of dirty data, you are probably thinking about inaccurate or malformed data. The opposite is also very dangerous, as many data teams are hindered by data that is missing entirely! If you rely on certain data to perform your analysis but those values are missing from a significant portion of your data records it can be hard or impossible to do your analysis.

Imagine trying to do a geographic analysis of your customers if 10% of them have no address on record. Worse, imagine that those 10% are missing because there is an error in your lead form that discards addresses in California. Your analysis would be bias in showing no leads in the Golden State.

How do we find and measure missing values?

Missing values are especially dangerous when working with SQL systems because they will be ignored by many queries you might be using to generate metrics or do larger analysis. The data records themselves will be valid but the missing data will be invisible to you because the values you expect are actually missing! This results in hidden bias.

The good news is that it’s easy to find missing values if you are proactive and look for them. The following query will tell you the ratio between how many records have values (num_Valid) and how many have missing values (num_Missing):

SELECT COUNT(*) - COUNT(field_name) AS num_Missing, COUNT(field_name) AS num_Valid FROM your_table;

Understanding if the missing values ratio is acceptable for a given field will require your judgement. It can be effective to benchmark the missing values ratio across all of your fields to understand which fields are sparse (many missing values) or dense (few missing values). This can affect what kinds of analysis you perform from a given set of data.

Okay, what do I do if I’m missing too many values?

Unfortunately, unlike duplicates, you cannot remove missing values as they are already missing! You don’t have a lot of options to try and deal with missing values:

  • Data Augmentation. Many times you can determine the correct value for a missing field from the data that is present. For example, if a data record has the state value “California” but no country value, you know the country is the “United States” and fill that value. Using these kind of semantic rules and data matching is called data augmentation and there are products and services which will do it for you.
  • Interpolating Values. Missing values in numeric data is very frustrating, since there is typically no way to go back and time and recover the missing days, weeks or months. If there gaps are significant issues for you, you can attempt to interpolate values by using moving averages or linear regressions on the data before and after the gap. This will help you fill in the gap with values that would be accurate if the general trend continued, but could be misleading if some abnormalities happened during that gap.
  • Backup, Backup, Backup. The cost of data storage is cheap these days, so backup everything. Not just in your master data warehouse, backup all of your data in all of your systems so that you can recreate your data warehouse if it turns out you are missing too many values. Nothing better than the raw data to fill in the gaps.

While missing data is frustrating and dangerous, there are even more insidious sources of dirty data. We’ll cover data consistency tomorrow!

“Are you anybody else’s missing piece?’ ‘Not that I know of.’”

Dirty Data: Data Consistency

This is part 4 of a 5 part series on Dirty Data.

So far this week we’ve covered duplicates and missing values in your data. Even more insidious are problems with consistency in your data. Consistency failures happen when the same data means different things depending on when or where the data was found. For example, you might track the creation date of sales leads to understand how your sales team is generating leads over time. However, if some salespeople record the creation date as when they first find the client’s contact information and others record the creation date as the first time they speak to the client you will get vastly skewed results.

How do I measure data consistency?

A very effective way to understand consistency is to measure how many unique values you have for a given data field, a metric I refer to as data entropy. To measure data entropy, you want to measure the number of unique values in a set of records compared to how many records there are. Specifically, you might run the following query:

SELECT COUNT(*) AS total_count, COUNT(DISTINCT(field_name)) AS unique_count FROM your_table

This will return both the total number of records (total_count) and the number of unique values (unique_count) for a given field (field_name). If all values for the given field were unique then total_count and unique_count would be equal, such as with a unique identifier field.

Why is data entropy useful?

This data entropy measure is useful when you have an expectation on the distribution of values for a given field. For example, you know there are only 50 states so a column containing the state should not have more than 50 unique values (for US records). If it does, it means that there is inconsistency in your data (perhaps some records use “CA” and others “California”). Another example would be when you ask potential customers to enter their job title in a sales lead form. If the ratio of total_count to unique_count is too large, it means that customers are entering titles in inconsistent ways and you might want to provide a drop-down list instead of a free form text field.

If you have a lot of data in your tables you might not be able to run the data entropy query/calculation on all the records. In that case you can sample the data in whatever method you think appropriate, such as limiting it to a random selection of 1,000 rows.

So far we’ve focused on specific kinds of dirty data. Tomorrow we’ll discuss how to detect a wider variety of dirty data!

“Consistency is the hallmark of the unimaginative.” 

Dirty Data: Dirt Detectors

his is part 5 of a 5 part series on Dirty Data.

So far this week we’ve covered common types of dirty data including data that has duplicates, missing values and inconsistencies. There are, of course, many more ways your data can get dirty! Instead of enumerating every type of dirty data, it’s good to have a strategy to have a more general check of your data. One great way to do this is Cross-Checking.

Conceptually, cross-checking is very simple. You use two or more different sources of data to ensure they all report the same values. For example, you might have data on the number of purchases from your website in both your analytics system (tracking the user actions) and also in your payments system (tracking monetary transactions). By comparing these two metrics, which are really tracking the same thing, you can tell if there are dirty data issues.

In practice, cross-checking is harder than it sounds because of one simple question: If two of your data sources disagree, which one is correct? There is no easy answer, you will need to fall back on some of the techniques we’ve discussed earlier this week. The good news is that you know there is a problem! Hence, cross checking is useful to detect data issues but won’t solve them for you.

There are a number of other techniques to detect and correct dirty data which I didn’t have time to cover this week but I encourage you to read about:

  • Data Audits are periodic, manual, reviews of your data to verify assumptions and consistency across your data stores.
  • Normalization is a statistical technique that can help you adjust your metrics and statistics if there is small but consistent error in your data.
  • Smoothing is a numerical technique that reduces noise in data to make it easier to see trends.

Next Week: It’s the end of 2016, which means everyone is thinking about 2017 planning. We’ll review how data can improve your business planning process with our coverage of Data Driven Planning!

“Everything was perfectly healthy and normal here in Denial Land.” 


Sign up for a single idea in your inbox every Monday, to help you make better decisions using data.

Share this Post