Data Silos: Fingerprinting
Joins seem like a great way to consolidate all of your data quickly and easily, but, unfortunately, you won’t always have the luxury of a common unique identifier. Your customer support system may identify customers by their email address, but your payments system might only have their physical address. How do you combine different sources of data with no common identifier?
Fingerprinting is a method of creating a new identifier from the information you do have in each system. Instead of using a single, unique, identifier you combine a few different pieces of data to create a composite identifier.
For example, returning to our online e-commerce company example, we might want to join the data from our Email Marketing System and our Website Logs to see which users visited our website from an email we sent them. Below are two example records from each:
As you can see, there are no unique identifiers in common across the two different systems. However, we can create a unique identifier (fingerprint) by combining a few different pieces of information together:
- IP Address – This is the network address of a computer, which changes frequently in many cases. A customer might have a different IP address every time they connect, but they will use the same IP address for the entirety of their session.
- Date/Time – This is the date and time that the user connected. Many different users might connect at the same time.
While neither the IP Address or the Date/Time is unique on their own, together they will uniquely identify a customer!
Combining different pieces of data together into fingerprints is a very common approach to joining data, but it does have some drawbacks:
- It may not be possible to guarantee the fingerprint you choose will be unique, so using them to join data can introduce errors into your data. You should expect to have duplicates or undercounting.
- Depending on the fingerprint, you may not be able to use it to identify the same user across multiple different interactions. In our example above, using IP address and Date/Time will not help us identify when a customer returns a week later because both their IP Address and Date/Time will be different.
The best fingerprints combine a wide variety of data to minimize these drawbacks, but no matter what you do it will not be a perfect substitute for a unique identifier.
Unfortunately, even fingerprinting is not possible in all cases, so what should else can you try? Tomorrow we’ll talk about that when we cover correlations.
Quote of the Day: “I’ve got a good mind to go out and join a club and beat you over the head with it.” ― Groucho Marx