Data quality checks: Why you need them
Explore how a data quality scorecard ensures high-quality, trustworthy data management.
data:image/s3,"s3://crabby-images/0fa0a/0fa0a459b6923bc7b55d6ce172500ea749f3da8a" alt="Data quality checks: Why you need them"
Everyoneâs heard the old saying âgarbage in, garbage out,â which is â how do we say this? â a frustrating oversimplification of a really hard problem to solve. You canât tell the difference between high- and low-quality data by just looking through it. You can be a great data engineer and still end up with data quality problems.
Also, who started throwing garbage into the data in the first place? Itâs not like we built a data connector to GarbageData.io and started loading up our data warehouses with this stuff.
âGarbage in, garbage outâ or âGIGOâ as I like to call it, is one of those blame-y, demoralizing write-off things people say about data that lacks nuance and ignores the fact that we have a lot of sophisticated tooling to manage âgarbage.â
Data quality checks are probably our biggest tool for knowing what data to use, fix, and avoid. They give us a way to actually quantify what we can know about the data so we can have a more nuanced and accurate conversation about data.
Understanding data quality
To begin with, data quality isnât a âthingâ on its own. Itâs kind of a constellation of metrics regarding the data across eight dimensions, nearly all of which can be quantified with real numbers and calculations:
â
data:image/s3,"s3://crabby-images/79d09/79d0940cb24207ade5e75667a5bad4cbc9f5d4e7" alt=""
- Accuracy: The data represents reality
- Completeness: All the required data is present
- Consistency: Data is consistent across different datasets and databases
- Reliability: The data is trustworthy and credible
- Timeliness: Data is up-to-date for its intended use
- Uniqueness: There are no data duplications. Â
- Usefulness: Data is applicable and relevant to problem-solving and decision-makingâ
- Differences: Users know exactly how and where data differs
â
data:image/s3,"s3://crabby-images/9a9ec/9a9ecc7fda003aa28fb713d6614f592876db4dff" alt=""
These arenât sorted in any particular order â theyâre all (nearly) equally important. The only dimension thatâs not quite measurable is âusefulness,â however, you could argue that adoption rates and actual usage metrics are a reflection of dataâs usefulness.
When all the data quality metrics are pointing in the right direction (high accuracy, high completeness, high consistency, etc.), you know you have high data quality. When all the metrics are in the wrong direction, you know you have low data quality.
But itâs certainly not binary. You arenât dealing with âgarbageâ or âhigh quality data.â Youâre most certainly somewhere between âunusable dataâ and âperfect data,â which doesnât exist.
You always want high data quality, but not all data quality is created equal. If youâre in finance and healthcare, for example, youâve got to meet compliance requirements and you need to make sure the downstream consumers have the highest-quality information possible. You donât want someone getting the wrong medication in a healthcare context because you sent bad data.
The role of data quality checks
In addition to measuring every dimension of data quality, there are other types of âdata quality checksâ to identify and correct common data quality issues as they occur (and hopefully before). Here are a few:
- Error Detection: Identify and correct inaccuracies such as typos, incorrect data values, or missing information to eliminate the risk of using inaccurate data. Make sure you can know (and prove!) that only accurate data is used.Â
- Consistency Enforcement: Since youâre probably dealing with data from dozens of different data sources, you need to maintain data consistency. This helps to significantly reduce data quality errors from inconsistent entries.
- Completeness Assurance: Make sure all necessary data fields are filled in, preventing incomplete data from skewing analysis and decisions. Also, check for null values to ensure no critical data is missing.
- Duplicate Identification: You donât need the same information twice, so a duplication check can make sure data is only represented once.
- Range and Value Checks: Data should fall within expected ranges or values to prevent anomalies that could distort analysis. Use range checks to quickly verify adherence to set data parameters.
Without overstating the obvious, data quality checks like these make sure youâre not inadvertently doing anything that will mess with your long-term data quality goals. You canât control whether someone types a sixth digit in a five-digit zip code field, but you can at least prevent that unnecessary sixth digit from getting into your warehouse.
Types of data quality checks
Data quality checks typically fall into one of five categories: descriptive, structural, integrity, accuracy, and timeliness. The whole point is to make sure youâre only getting accurate, trustworthy data thatâs formatted correctly so it can work across many systems and be available when itâs available.
Sounds simple, but it ainât easy. Letâs take a look at how these checks might work.
Descriptive checks: Does it represent the real-world value?
Descriptive checks ensure that the data correctly reflects real-world values. This involves verifying that data entries match the expected patterns and formats, like validating email addresses or phone numbers to ensure they are in the correct format.
This SQL query identifies email addresses that do not follow the standard email format:
â
Structural checks: Is it organized and formatted correctly?
Structural checks ensure that data is organized and formatted correctly according to the database schema. This includes verifying that all fields are present and that data types are correct.
This query checks for zip codes that do not meet the standard five-digit format.
Integrity checks: Is the data complete and accurate everywhere?
Integrity checks validate the relationships between different data sets to ensure data completeness and accuracy. This might involve ensuring that all foreign keys match primary keys in another table.
This query identifies orders with customer IDs that do not exist in the customers table.
Accuracy checks: Does the data match a source of truth?
Accuracy checks compare data against a reliable source of truth to ensure correctness. This might involve cross-referencing data with a verified dataset or an external source.
This query identifies user records that do not match the records in a verified users table.
Timeliness checks: Is the data up-to-date?
Timeliness checks ensure that data is current and reflects the latest information. This might involve verifying that data has been updated within a certain timeframe.
This query identifies users who have not logged in within the past 30 days.
Implementing data quality checks
Implementing effective data quality checks involves a systematic approach to ensure high data integrity and reliability. Hereâs a step-by-step process:
- Identify critical data elements: Pinpoint the most important data elements that have a significant impact on your company or data teamâs processes and decisions. Focus on areas where data quality issues could cause the most harm.
- Define acceptable quality thresholds: Establish clear and measurable quality thresholds for each critical data element. Determine what level of accuracy, completeness, and consistency is required for the data to be considered high quality. Incorporate into a regular data quality assessment.
- Select data quality tools: Choose appropriate tools and technologies to automate your data quality checks. Look for data quality software that integrates well with your existing data infrastructure and provides robust data validation, cleaning, and data quality monitoring capabilities.
- Integrate data quality checks: Embed data quality checks into your daily data processing workflows. Ensure that these checks are run regularly and automatically, so that data issues are caught and addressed promptly.
There are a million ways this process can play out, but at a high level, this is the work you need to do. Continuously monitor your data with automation. We canât stress this enough. You are not going to be able to keep up with your data using manual processes. If itâs not automated, you will not have high quality data. Here are some other articles we recommend reading:
- The difference between Datafold tests and dbt tests
- Datafold catches immutable changes to data
- Unlocking data quality with automated regression testing
- Data quality in data engineering workflows (why invest in automated data testing?)
- What is data validation?
Challenges in data quality checks
If data quality was easy, everyone would have high quality data. Unfortunately, that ainât the case. Managing data quality is hard work. Companies with lots of automation and best practices in place can still end up with the following challenges:
- Data silos: Basically: âSorry, you canât access this data even though weâre at the same company.â You might face organizational boundaries (e.g. different lines of business), policy differences (e.g. âWe canât share this data with other teamsâ), or just plain technological differences (e.g. youâre all Snowflake and theyâre all Databricks).
- Lack of standardization: Everyone wants to build their data house differently. Sounds cool early on, but years down the road it can lead to data thatâs not easily interoperable. They might use varied formats and definitions for the same data, for example. Effective data transformations are key to harmonizing these varied formats.
- Varying data sources: One product team might have all of its usage data come from Google Analytics and another might use Adobe Analytics. Two totally different platforms capturing similar data in their own ways. Somebodyâs gotta figure out how to get the right data out of each system.
This stuff is inevitable. The bigger the company, the more likely these challenges are to occur. There are ways of solving for them, but thatâs another article in and of itself.
The key takeaway here, though, is that data quality can still suffer even if you have all the right automation and checks in place. It will never hurt to implement them, but theyâre not always feasible or easily manageable. Data quality management is a very full time job.
Datafold catches all unintended changes
Integrating Datafold into your data quality checks offers major benefits beyond simple error detection.
Current testing tools like dbt tests, data "unit tests", or custom SQL tests for edge cases share one property: they only catch expected changes. Datafold, by contrast, compares both versions of the data to detect all changes, whether intended or not. This is only possible through data diffing, or value-level comparisons, that flag exactly what value changed and where it changed:
data:image/s3,"s3://crabby-images/ca358/ca3589131432de2c49282c4b10423002742640f0" alt=""
We make it easy to automate this data validation process and integrate it within your CI pipelines so you can proactively check your data assets for accuracy before use in any critical business decision. The time savings it creates allow your team to focus more on generating insights and creating value.
Discover how Datafold can enhance your data processes and uphold the highest standards of accuracy:
- Sign up for a 14 day free trial of Datafold to explore how data diffing can improve your data quality practices within, and beyond, data teams
- Book time with one of our solutions engineers to learn more about Datafold and how it will integrate with your tech stack
â