If data is the basis of sound decision making, then we need to be sure the data we use is accurate and reliable. Previously we’ve covered how to handle systemic bias and large scale error in your data, the kinds of errors that can result in massive shifts in your metrics. Much more common are many small errors and noise, which result in what we call dirty data.
Many analysts and data scientists spend up to 90% of their time transforming and scrubbing their data to make it ready for use in analysis and decision making. That hints at how enormously difficult this challenge can be, especially if you want to utilize machine learning algorithms that are sensitive to data noise. While you might be able to identify noise in your data by looking at it, most automated systems cannot tell the difference and will make mistakes if the data is dirty.
Often, the cause of dirty data is as important as the dirt itself. Instead of cleaning your data every day, if you can identify the root cause of the noise and fix it you will ensure cleaner and cleaner data over time.
This week we’ll cover some common techniques for identifying dirty data and scrubbing it to remove that dirt and make it more useful. Specifically we will cover:
Many of our examples this week will deal with data stored in a SQL database, but the same techniques can be applied to any database or data warehouse by using the tools and interface they make available.
Tomorrow we will get started with finding and fixing data duplication problems!
Quote of the Day: “Dirty deeds, done dirt cheap” – AC/DC