You will never have perfect data.
It’s a sad but inevitable truth. Rare enough are the cases where can find the data you need for your project let alone data without a variety of errors that could compromise its quality.
The silver lining is that you probably don’t need perfect data to begin with. As long as you can effectively measure and test the quality of what you do have, you can develop an understanding of your data’s limitations and develop new and creative ways to address them.
I often hear people talk about measuring the ‘accuracy’ and ‘completeness’ of their data, and while these are certainly important concepts, I wanted to go beyond that to discuss some of the other aspects of data quality that I think are equally important. After all, what does it mean for your data to be accurate and complete if it isn’t available until after you’ve made a decision you needed it for? So, let me summarize the six measures I consider for data quality - Accuracy, Completeness, Integrity, Precision, Validity and Timeliness - and share some thoughts on how to test for them in practice.
Accuracy
Data is accurate when it can be verified against an accepted version of the truth, a trusted authoritative source. Without an external validation source, it is very difficult to prove your data is accurate.
Data can be inaccurate at capture or become inaccurate due to corruption over time. An example of inaccurate data at capture would be if my address was captured as 132 Rand Street when I actually lived at 123 Rand Street. Both are valid addresses, but only one is my actual home. Over time I might move home, but not have my new address updated in my records.
How can I test it?
Find an external trusted data source to validate against, if one is not available, find a proxy that you can compare against. For example, you might not be able to compare individual transactions against another source, but maybe you can compare them to the aggregate total or end of day values.
Completeness
Data is complete when each entity is counted exactly once. This means that you’re not missing any records or columns in the data that you need, and that you don’t have any extra records or values either. Nothing is being under or over reported.
Completeness issues often occur when data are transmitted between systems and records are either sent more than once or are missed altogether. In some cases, data is filtered before being shared with another system and so unknowingly, records are lost.
How can I test it?
When testing for completeness you will first need to decide on the things you are trying to measure. This might sound obvious, but consider the case when a customer signs up for an account on your website. If they sign up 5 times, how many records will you hold for that customer? What if they signed up using different email addresses? You will need to think about whether to consider them as individual accounts or as a single person.
If you have a unique ID or reference key, check if you have more than one record for that key to identify duplicate values.
If you have an external source of truth, reconcile your ID/reference key against that source to identify missing values or extra ones.
If you have a checksum or control total, confirm that your dataset has that many records or computes to the same checksum.
For column completeness, identify which columns should not be missing values and analyze any records that are missing data in those columns.
If you don’t have a reference number, consider other things you could use as a proxy to indicate if your data is over or under reported. For example, the volume of transactions per day, the aggregate value of transactions in a month or per user.
If you don’t have a reference number, consider what combination of attributes you could use to identify duplicates in your data. For example, email address, physical address or phone number paired with a name or date of birth.
Integrity
For users to be able to leverage data across multiple systems and reports it's important that the data is recorded in a consistent manner so that comparisons are meaningful. If two sources disagree on a particular value or calculate the same value in a different way, it can lead to inconsistencies in the data that make it hard to focus on bigger issues.
Integrity is a big data risk when information flows between systems and could either be miscommunicated or become stale over time.
How can I test it?
Try to obtain a data dictionary for any data provided so that there is a clear explanation of what each column is expected to contain. Compare these definitions across sources.
Where you have two sources, compare attributes between the sources to identify potential omissions or inaccuracies and define which you consider to be the authoritative source of that data.
Precision
Concepts such as precision measure how granular a result is. Precision is an important concept to consider when considering the reliability of measurements in their greater context and should be communicated using units and a tolerance – for example, height being measured in meters +/- 1cm.
Precision can lead to data quality risk if it is not maintained when flowing data between systems or there is a lack of clarity on the recorded precision (or the reasoning behind a particular level of precision). In low latency trading, for instance, decisions are made in milliseconds. If the time of a trade is recorded in seconds or minutes, the information shared would be meaningless for future decisions.
Note that precision is not the same as accuracy; a result can be emphatically precise and completely inaccurate (for example, I can report my height as 47.239476 meters – very precise, and utterly false).
How can I test it?
Examine the values you have - do they cluster around specific values that may indicate they are being rounded to a less granular level?
Be clear about the reasons why you have a particular level of precision for a recorded value. What level of precision do you need for your calculations?
Validity
Validity measures whether a data attribute falls within certain predefined validation criteria. These rules typically reflect business requirements and can be valuable in stopping bad data from entering a system in the first place. For example, birth dates should [probably] never be in the future and a person’s weight should not be a negative number.
If the data in our system don’t obey the rules we plan for them, they may break any information flows or systems that depend on that data.
How can I test it?
Where you’re expecting categorical values, check that your data falls within those categories.
If you require certain fields to be present, check they are populated.
For free text fields, analyze field lengths to look for instances where dummy data has been recorded.
Timeliness
Timeliness measures how relevant the information provided is in the timeframe you’re using it. Information quality may deteriorate over time and become less meaningful. For example, a business may prepare accounts every month to provide a picture of the company’s finances. If they did this only on an annual basis, it may be harder for the company to make financial decisions come the end of the year.
Depending on the type of decision or action being performed and the nature of the data, there may be risk if data is not being updated or provided in a timely manner.
How can I test it?
Look for expiration dates or last updated dates on key information fields. Things like contact information will inevitably change over time.
Check when a report was last refreshed or prepared.
Round-up
Accuracy, Completeness, Integrity, Precision, Validity and Timeliness - These are my top six data quality considerations and concepts I routinely rely on to help test the quality of data I encounter. If your responsibilities include information management more generally, you may also want to consider things like privacy and security of your data.
Let me know if you have any other things you watch out for when performing your analysis in the comments below, or on Twitter @eageranalyst.