The difference between Datafold tests and dbt tests
Understand the 3 critical differences between Datafold tests and dbt tests, and learn why combining them is crucial for complete data quality testing coverage.
Opening a pull request to modify a dbt model (or many at once!) can be a nerve-racking process. Even if your CI pipelines run smoothly and every ✅ dbt ✅ test ✅ passes, it doesn’t tell you about data changes that will be introduced.
Are you confident that your code changes won’t introduce errors into your data?
If your dbt tests pass, but breaking data changes sneak through, you could end up in crisis mode–with broken dashboards, malfunctioning pipelines, and lost stakeholder trust.
“Wait–whatever do you mean? I thought my dbt tests cover all scenarios that could break my pipelines,” you say.
Reader–would that it were so simple.
dbt tests prevent some data quality issues, but not all. Let’s go over three major differences between Datafold and dbt tests. We will clarify why data teams need Datafold in CI in addition to dbt tests: because these two techniques provide complementary test coverage and protect against fundamentally different data quality issues.
What is your benchmark for data quality?
Your CI pipeline should be able to verify the accuracy, completeness, and consistency of data whenever your code changes modify your data. Anything less means that you cannot guarantee quality data when you push data changes to production.
1. Datafold finds value-level differences between staging and production
Datafold compares two versions of the data and identifies differences, while dbt tests evaluate one version of the data and test assertions.
With Datafold, you can prevent issues such as:
- Errors in individual data values: Event timestamps or transaction amounts changing when they should be immutable
- Problematic distribution shifts: The distribution of customer ages shifting, even if individual values remain within an acceptable range
- Primary keys and rows dropped: Entire sections of tables missing due to faulty joins or filters
In contrast, dbt tests prevent issues such as:
- Values outside of a range you explicitly set
- PK-FK relationship violated
- PKs that are not unique and not null
2. Datafold prevents a broad range of downstream issues
Because dbt models are interlinked with each other, data sources, and BI tools, small changes in one file can create a ripple effect and wreak havoc on downstream dependencies. Data quality issues can also affect the underlying data infrastructure and computing resources.
Datafold Cloud’s column-level lineage identifies downstream tables and dashboards that will be impacted by data changes if the code in the Pull Request is deployed to production.
3. Datafold requires no manual test configuration or maintenance
Each dbt test is manually defined and maintained for every field you want coverage on. And if you have multiple tests per column, the time it takes to configure all of those tests can be significant (and hard to maintain as your dbt project scales!).
Let’s take a look at how Datafold validates your data during your CI workflow in deployment testing.
When you open a pull-request with some code changes, your CI pipeline kicks off several jobs, including Data Diff. The Datafold bot leaves a comment identifying modified tables and columns with differing values:
To investigate these discrepancies further, you can click on View details to scrutinize the value-level differences within the Datafold app:
Now that you’ve set up your workflow for Data Diffs, you don’t need to configure any more tests or invest in continued maintenance. This is quite different from dbt tests, which must be continuously updated as your dbt project evolves to ensure complete coverage.
Datafold and dbt tests work really well together
As you can see, Datafold and dbt test coverage is complementary, and each performs distinct and essential tests.
Integrating Datafold into your existing dbt project’s CI pipeline is straightforward. If you’re curious to learn more about how Datafold’s data diffing in CI can help your team prevent shipping code that breaks production data, here are a couple of ways to learn more:
- Set up a free CI consultation with Datafold’s CI experts to talk about what CI setup makes sense for your specific data environment and infrastructure.
- If you’re ready to start working on building CI for data, we have a free trial experience of Datafold Cloud, so you can start connecting your databases as soon as today.