Data Audits: Defining Data to Avoid Confusion
Validating Data Definitions
While it’s easy to think about data in a vacuum, data is only as good as the meanings that we attach to it. If you cannot be confident in those meanings, or in the general understanding of those meanings by the people using data, then you cannot be confident in your data.
Auditing your data definitions is critical, and the first step is to write them down. Sometimes this means defining the values of fields in a SQL database table or defining the metrics you use in your reporting tools. The key to a good definition is being specific, because there are so many ways data can be confused.
For example, take the following timestamp that you might find on a data entry:
This looks pretty straight-forward, but there are a bunch of questions you will want to validate:
- What is the date format (typically timestamps are “year-month-day”)?
- Is the time in the morning or afternoon (typically timestamps are in 24-hour time)?
- What time zone is it from?
- Does it include daylight savings time (depending on location)?
- Is the clock that recorded it reliable?
- What calendar format is being used?
If even just a simple timestamp can raise so many questions, imagine the questions raised about numbers, metrics and more complex data. You need to have clear and detailed definitions for the important parts of your data that include:
- Where it comes from?
- How it was collected?
- Was it calculated and if so how?
- Are there known errors or bias?
- How can you validate it is correct?
While you should have definitions for all of your data, you cannot go through and manually validate all your data adheres to all your definitions. Instead, you will need to randomly select data to check against your definitions and do so against enough samples to have confidence in all your data. This is often done using automated tests but can be done manually if your data is small not not overly complex.
Be sure to have a chance process in place when changing the definitions of your data, lest you confuse yourself and your audit by checking out of date definitions!
In Review: Regular data audits are critical in making your data reliable and hence useful. While you don’t want to audit the data constantly, regular audits can catch problems with both data quality and definitions. It is much easier to audit your own data than try to re-earn the trust of your organization if data issues arise.