July 8, 2024

Data Testing, Data Quality Best Practices, CI/CD

Data quality checks: Why you need them

Explore how a data quality scorecard ensures high-quality, trustworthy data management.

No items found.

Datafold Team

Everyone’s heard the old saying “garbage in, garbage out,” which is — how do we say this? — a frustrating oversimplification of a really hard problem to solve. You can’t tell the difference between high- and low-quality data by just looking through it. You can be a great data engineer and still end up with data quality problems.

Also, who started throwing garbage into the data in the first place? It’s not like we built a data connector to GarbageData.io and started loading up our data warehouses with this stuff.

“Garbage in, garbage out” or “GIGO” as I like to call it, is one of those blame-y, demoralizing write-off things people say about data that lacks nuance and ignores the fact that we have a lot of sophisticated tooling to manage “garbage.”

Data quality checks are probably our biggest tool for knowing what data to use, fix, and avoid. They give us a way to actually quantify what we can know about the data so we can have a more nuanced and accurate conversation about data.

Understanding data quality

To begin with, data quality isn’t a “thing” on its own. It’s kind of a constellation of metrics regarding the data across eight dimensions, nearly all of which can be quantified with real numbers and calculations:

‍

Accuracy: The data represents reality
Completeness: All the required data is present
Consistency: Data is consistent across different datasets and databases
Reliability: The data is trustworthy and credible
Timeliness: Data is up-to-date for its intended use
Uniqueness: There are no data duplications.
Usefulness: Data is applicable and relevant to problem-solving and decision-making‍
Differences: Users know exactly how and where data differs

‍

Looking to level up data testing? Download our new guide to testing with dbt

Discover the goals of testing with dbt, types of tests, principles and best practices for an effective testing strategy.

These aren’t sorted in any particular order — they’re all (nearly) equally important. The only dimension that’s not quite measurable is “usefulness,” however, you could argue that adoption rates and actual usage metrics are a reflection of data’s usefulness.

When all the data quality metrics are pointing in the right direction (high accuracy, high completeness, high consistency, etc.), you know you have high data quality. When all the metrics are in the wrong direction, you know you have low data quality.

But it’s certainly not binary. You aren’t dealing with “garbage” or “high quality data.” You’re most certainly somewhere between “unusable data” and “perfect data,” which doesn’t exist.

You always want high data quality, but not all data quality is created equal. If you’re in finance and healthcare, for example, you’ve got to meet compliance requirements and you need to make sure the downstream consumers have the highest-quality information possible. You don’t want someone getting the wrong medication in a healthcare context because you sent bad data.

The role of data quality checks

In addition to measuring every dimension of data quality, there are other types of “data quality checks” to identify and correct common data quality issues as they occur (and hopefully before). Here are a few:

Error Detection: Identify and correct inaccuracies such as typos, incorrect data values, or missing information to eliminate the risk of using inaccurate data. Make sure you can know (and prove!) that only accurate data is used.
Consistency Enforcement: Since you’re probably dealing with data from dozens of different data sources, you need to maintain data consistency. This helps to significantly reduce data quality errors from inconsistent entries.
Completeness Assurance: Make sure all necessary data fields are filled in, preventing incomplete data from skewing analysis and decisions. Also, check for null values to ensure no critical data is missing.
Duplicate Identification: You don’t need the same information twice, so a duplication check can make sure data is only represented once.
Range and Value Checks: Data should fall within expected ranges or values to prevent anomalies that could distort analysis. Use range checks to quickly verify adherence to set data parameters.

Without overstating the obvious, data quality checks like these make sure you’re not inadvertently doing anything that will mess with your long-term data quality goals. You can’t control whether someone types a sixth digit in a five-digit zip code field, but you can at least prevent that unnecessary sixth digit from getting into your warehouse.

Types of data quality checks

Data quality checks typically fall into one of five categories: descriptive, structural, integrity, accuracy, and timeliness. The whole point is to make sure you’re only getting accurate, trustworthy data that’s formatted correctly so it can work across many systems and be available when it’s available.

Sounds simple, but it ain’t easy. Let’s take a look at how these checks might work.

Descriptive checks: Does it represent the real-world value?

Descriptive checks ensure that the data correctly reflects real-world values. This involves verifying that data entries match the expected patterns and formats, like validating email addresses or phone numbers to ensure they are in the correct format.

This SQL query identifies email addresses that do not follow the standard email format:

‍


SELECT email
FROM users
WHERE email NOT LIKE '%_@__%.__%';

Structural checks: Is it organized and formatted correctly?

Structural checks ensure that data is organized and formatted correctly according to the database schema. This includes verifying that all fields are present and that data types are correct.

This query checks for zip codes that do not meet the standard five-digit format.


SELECT *
FROM users
WHERE LENGTH(zip_code) != 5;

Integrity checks: Is the data complete and accurate everywhere?

Integrity checks validate the relationships between different data sets to ensure data completeness and accuracy. This might involve ensuring that all foreign keys match primary keys in another table.

This query identifies orders with customer IDs that do not exist in the customers table.


SELECT orders.order_id
FROM orders
LEFT JOIN customers ON orders.customer_id = customers.customer_id
WHERE customers.customer_id IS NULL;

Accuracy checks: Does the data match a source of truth?

Accuracy checks compare data against a reliable source of truth to ensure correctness. This might involve cross-referencing data with a verified dataset or an external source.

This query identifies user records that do not match the records in a verified users table.


SELECT user_id, first_name, last_name
FROM users
EXCEPT
SELECT user_id, first_name, last_name
FROM verified_users;

Timeliness checks: Is the data up-to-date?

Timeliness checks ensure that data is current and reflects the latest information. This might involve verifying that data has been updated within a certain timeframe.

This query identifies users who have not logged in within the past 30 days.


SELECT *
FROM user_activity
WHERE last_login < NOW() - INTERVAL '30 days';

Implementing data quality checks

Implementing effective data quality checks involves a systematic approach to ensure high data integrity and reliability. Here’s a step-by-step process:

Identify critical data elements: Pinpoint the most important data elements that have a significant impact on your company or data team’s processes and decisions. Focus on areas where data quality issues could cause the most harm.
Define acceptable quality thresholds: Establish clear and measurable quality thresholds for each critical data element. Determine what level of accuracy, completeness, and consistency is required for the data to be considered high quality. Incorporate into a regular data quality assessment.
Select data quality tools: Choose appropriate tools and technologies to automate your data quality checks. Look for data quality software that integrates well with your existing data infrastructure and provides robust data validation, cleaning, and data quality monitoring capabilities.
Integrate data quality checks: Embed data quality checks into your daily data processing workflows. Ensure that these checks are run regularly and automatically, so that data issues are caught and addressed promptly.

There are a million ways this process can play out, but at a high level, this is the work you need to do. Continuously monitor your data with automation. We can’t stress this enough. You are not going to be able to keep up with your data using manual processes. If it’s not automated, you will not have high quality data. Here are some other articles we recommend reading:

Challenges in data quality checks

If data quality was easy, everyone would have high quality data. Unfortunately, that ain’t the case. Managing data quality is hard work. Companies with lots of automation and best practices in place can still end up with the following challenges:

Data silos: Basically: “Sorry, you can’t access this data even though we’re at the same company.” You might face organizational boundaries (e.g. different lines of business), policy differences (e.g. “We can’t share this data with other teams”), or just plain technological differences (e.g. you’re all Snowflake and they’re all Databricks).
Lack of standardization: Everyone wants to build their data house differently. Sounds cool early on, but years down the road it can lead to data that’s not easily interoperable. They might use varied formats and definitions for the same data, for example. Effective data transformations are key to harmonizing these varied formats.
Varying data sources: One product team might have all of its usage data come from Google Analytics and another might use Adobe Analytics. Two totally different platforms capturing similar data in their own ways. Somebody’s gotta figure out how to get the right data out of each system.

This stuff is inevitable. The bigger the company, the more likely these challenges are to occur. There are ways of solving for them, but that’s another article in and of itself.

The key takeaway here, though, is that data quality can still suffer even if you have all the right automation and checks in place. It will never hurt to implement them, but they’re not always feasible or easily manageable. Data quality management is a very full time job.

Datafold catches all unintended changes

Integrating Datafold into your data quality checks offers major benefits beyond simple error detection.

Current testing tools like dbt tests, data "unit tests", or custom SQL tests for edge cases share one property: they only catch expected changes. Datafold, by contrast, compares both versions of the data to detect all changes, whether intended or not. This is only possible through data diffing, or value-level comparisons, that flag exactly what value changed and where it changed:

It's easy to see where the values have changed–down to the row-level

We make it easy to automate this data validation process and integrate it within your CI pipelines so you can proactively check your data assets for accuracy before use in any critical business decision. The time savings it creates allow your team to focus more on generating insights and creating value.

Discover how Datafold can enhance your data processes and uphold the highest standards of accuracy:

Sign up for a 14 day free trial of Datafold to explore how data diffing can improve your data quality practices within, and beyond, data teams
Book time with one of our solutions engineers to learn more about Datafold and how it will integrate with your tech stack

‍

Looking to level up data testing? Download our new guide to testing with dbt

Discover the goals of testing with dbt, types of tests, principles and best practices for an effective testing strategy.