Dirty Data: Duplicate Data

This is part 2 of our series on Dirty Data, previous segments are available in our archives.

One of the most common ways your data gets dirty is when a subset of your data records are duplicates. All of the data in those records is technically valid and correct, but the duplicates bring with them the risk of double counting and skewing of your analysis! For example, it’s common to find duplicate sales leads in your CRM system, which inflate your lead metrics artificially.

Data duplication can happen for a number of reasons:

  • Joining data sources. If you have the same data in two places and attempt to merge them together, you are bound to end up with at least a few duplicates.
  • Faulty software. If your software system automatically loads data but does not de-duplicate the data on load, it can easily add duplicates if the system is re-run in the future.
  • Manual error. If you rely on human beings to enter data or upload data files, they can easily lose track of their progress after taking a break and accidentally reload the same data twice! The more people you have entering data manually, the more this type of error can occur.

It would be easy to identify duplication if EVERY data record was duplicated, by simply looking through a sample of your data. But how do you find only small sets of duplicates in a much larger set? There are a lot of tools at your disposal:

  • Unique Identifiers. There are often fields in a given data record which should be unique. For example, contact information records should have unique mobile phone number for each person (although some people do have two mobile phones!). If two records have the same mobile phone number, there is a chance it’s a duplicate. If you are using a SQL database, it can be easy to find duplicate values that share the same unique identifier. The following query will show you all values of the field “email” shared by more than one record:

SELECT phone FROM your_table GROUP BY phone HAVING COUNT(*) > 1;

  • Multiple values. If there is no single field that uniquely identifies data records, you can use multiple fields together (or all of them) to identify the record. If you think that three fields (field_1, field_2, field_3) uniquely identify records you can check for duplicates easily:

SELECT field_1,field_2,field_3 FROM your_table GROUP BY field_1,field_2,field_3 HAVING COUNT(*) > 1;

This can be very slow if you have a large number of fields so you can utilize a hash to speed things up. A Hash function takes many different values and combines them into a unique number, allowing to compare numbers for uniqueness instead of all the fields themselves. Most databases support some kind of hash function.

  • Timestamps. If you have no unique identifiers and it is not practical to look for duplicates using multiple fields, you can fall back on looking at timestamps. Timestamps are added to most data records automatically and tell you the exact time (in microseconds) that the record was created. If you have data that is infrequently created (such as customer leads) then comparing timestamps should help identify duplicates since it unlikely you added two different records at exactly the same microsecond!

Duplication is dangerous as it adds in systematic bias to your data that is hard to detect. Many data teams will have software that checks for duplicates on a regular basis and it is worth setting up if you don’t do this already yourself.

Tomorrow we’ll talk about another insidious form of dirty data: missing values!


Quote of the Day: “Being a one of a kind means we are automatically the best in the world at what we do.” ― Victor WIlliamson