October 10, 2024

Data Quality Best Practices, Data Testing, CI/CD

Best practices for data diffing with a shift-left approach

Learn strategies like Slim Diff, sampling, and automated monitors to improve data quality at scale.

Insung Ko

Elliot Gunn

Insung Ko, Elliot Gunn

Best practices for data diffing with a shift-left approach

If you’re like many data teams, ensuring that your data is reliable and consistent is a top priority. But with growing pipelines and more complex workflows, spotting issues before they cause problems in production can feel like a never-ending challenge.

That’s where a shift-left approach comes in—moving data quality checks earlier in the development process to catch issues sooner. One of the most effective ways to do this is through data diffing.

At Datafold, we’ve worked with many data teams who want to proactively detect data discrepancies before they reach production. Data diffing helps you compare datasets, spot differences, and resolve issues quickly, so your pipelines stay accurate and efficient.

In this article, we’ll walk through best practices for incorporating data diffing into your CI/CD pipeline, including how to handle large datasets and optimize performance with strategies like Slim Diff and sampling.

Optimize dbt projects with Slim Diff

If you’re using dbt, check that you have Slim CI configured. This allows dbt to build only the models that have been changed, along with their downstream models, rather than rebuilding your entire dbt project. Change the command in your PR job from dbt build to dbt build --select state:modified+. Additionally, if you are using dbt Cloud, you can enable Defer to a previous run state on your production jobs which will leverage your pre-existing state for even more efficient builds. (If you’re looking for a tutorial on how to implement Slim CI with a dbt Core project, we have you covered here.)

But you can also optimize performance on the Datafold side with Slim Diff. By default, Datafold performs a data diff on both the modified models in a PR and their downstream models. However, enabling Slim Diff ensures that only models with dbt code changes are diffed, excluding downstream models. This strategy offers significant performance improvements and cost reductions, particularly for projects with deep DAGs (Directed Acyclic Graphs), where downstream data processing can be expensive and time-consuming.

How this works in Datafold

In Datafold, Slim Diff can be enabled by adjusting your diff settings by navigating to Settings → Integrations → CI → Select your CI tool → Advanced Settings and check the Slim Diff box:

‍

‍

When you run a diff for a PR, Datafold will intelligently select only the tables impacted by the code changes, bypassing any downstream models that haven’t been directly affected. This can dramatically reduce both the runtime and the associated costs, especially in dbt projects with many layers of dependencies.

Building a culture of data quality that lasts is rare.

We share the mental models, metrics, and strategies to launch a proactive data quality system that enables data-driven decisions.

Large dataset?

Many of our customers deal with big data and worry about the performance impact of data diffing. But we have you covered with several strategies to ensure efficient and scalable data diffing.

Efficient hashing algorithm

Datafold leverages a highly efficient hashing algorithm that ensures rapid comparisons of large datasets. This algorithm reduces the computational overhead by creating lightweight fingerprints of your data, which allows for fast and reliable comparisons even with millions of rows.

The use of hashing enables Datafold to quickly identify differences between datasets without the need to process every individual row, making it ideal for large-scale data comparisons. If you’re curious to learn more, check out our technical deep dive here.

Sampling

Sampling is one of the most effective ways to optimize large dataset comparisons. Instead of comparing every row, you can configure Datafold to only compare a statistically significant subset of your data. In Settings -> Configuration -> Advanced Settings, you can set parameters like sampling tolerance, confidence, and threshold to fine-tune the sample size.

We ran some experiments on the speedup you can achieve with sampling in different databases:

Databases	vCPU	RAM, GB	Rows	Columns	Time full	Time sampled	Speedup	RDS type	Diff full	Diff sampled	Per-col noise
Oracle vs Snowflake	2	2	1,000,000	1	0:00:33	0:00:27	1.22	db.t3.small	5399	5400	0
Oracle vs Snowflake	8	32	1,000,000	1	0:07:23	0:00:18	24.61	db.m5.2xlarge	5422	5423	0.005
MySQL vs Snowflake	2	8	1,000,000	1	0:00:57	0:00:24	2.38	db.m5.large	5409	5413	0
MySQL vs Snowflake	2	8	1,000,000	29	0:40:00	0:02:14	17.91	db.m5.large	5412	5411	0

For example, if your customer_transactions table has millions of rows, applying sampling (e.g., 10% of rows) will still provide reliable results while dramatically reducing the computational load.

How this works in Datafold

In Datafold, you can configure sampling options when you create a new Data Diff or clone an existing one:

You can also set this up in your CI/CD configuration, under the Advanced settings:

Datafold allows you to define thresholds where sampling will be applied and determine the confidence level required to ensure that the sample is statistically representative of your data.

Exclude specific columns or tables

If there are columns or tables that you would like to exclude from your data diffs that are run in your CI/CD pipeline, you can specify them in the Advanced settings in your CI/CD configuration in Datafold:

You can also exclude columns when you create a new Data Diff, or when you clone an existing one:

For example, temp tables where the values are expected to be 100% different, or columns of metadata or timestamps that don't need to be diffed. This customization helps ensure your diffs are focused only on the critical data changes, further improving efficiency and accuracy.

Use SQL filters

If you only need to compare a specific subset of data, such as for a particular city or a recent time period, adding a SQL filter can streamline the diff process. This allows you to focus on the most relevant data without processing unnecessary rows. For instance, applying a filter to only compare data from the last 30 days can reduce processing time significantly.

How this works in Datafold

For example, to filter data for the last 7 days and for user IDs greater than 2350, your dbt YAML configuration would look like this:


models:
  - name: users
    meta:
      datafold:
        datadiff:
          filter: "user_id > 2350 AND source_timestamp >= current_date() - 7"

Monitors as code

In Datafold, Monitors can be created programmatically via code. This allows teams to define their Data Diff Monitors, Schema Change Monitors, and other data quality checks directly in their CI/CD pipelines or configuration files. This ensures consistency across environments and simplifies version control. By integrating monitors via code, you automate the setup process and reduce manual intervention, which is especially useful for large-scale projects.

How this works in Datafold

Suppose you want to create a Data Diff Monitor for the orders table. You can define the monitor in your configuration as follows:


monitors:
  - name: "Orders Table Data Diff"
    type: "data_diff"
    table: "orders"
    key_columns: ["order_id"]
    comparison_mode: "row_level"
    thresholds:
      - type: "row_diff"
        value: 0.01 # 1% threshold for differences
    schedule: "daily"

This YAML configuration can be defined in your code repository, ensuring that you have an audit history and change management workflow for your monitors. Here, the YAML is configured to run a Data Diff on a daily basis without having to trigger it through a PR each time. This automation reduces manual intervention, ensuring data quality is continuously monitored throughout the development cycle.

Automating monitors strategically

When it comes to maintaining data quality in production, one of the most effective strategies is automating your monitors to catch issues before they escalate. Following a shift-left approach, data diffs make it possible to move data validation earlier in your CI/CD pipeline. For identifying anomalies in production data, catching schema changes in source data, or identifying other data quality issues outside of your CI/CD pipeline, automated monitors ensure that you're covered end-to-end.

We recommend a few ways that you can create monitors with a scalable and strategic approach.

How this works in Datafold

Use Schema Change Monitors where unexpected changes are most likely–particularly your raw and source data. When you ingest data from external sources or systems that you don’t fully control, there’s a higher risk of unexpected schema changes. For example, a vendor (e.g., Salesforce) might change the structure of their integration, or updates to an app database can be replicated into your warehouse without prior notice.

Create custom Metric Monitors that align with the unique data quality thresholds of your pipeline. By customizing metrics, you can catch anomalies specific to your data environment and ensure that you're tracking the issues most relevant to your use case. Our ML algorithm will identify anomalies based on your historical data, accounting for periodic trends. We also have options to specify how sensitive you want the algorithm to be to potential outliers, which makes finding the right balance between signal and noise a simple process.

Regularly review and update your Data Test Monitors to reflect changes in your data model. As new fields or schema updates are introduced, ensure your validation rules—like checking for nulls in key columns or verifying correct data formats—are kept up to date to maintain data integrity.

Getting started with Datafold

For teams looking to optimize their data pipelines and ensure data quality at scale, Datafold offers a single unified platform for data testing and monitoring.

If you're interested in learning more about data observability tooling and how Datafold’s data diffing can help your team proactively detect bad code before it breaks production data:

Book time with one of our solutions engineers to learn more about how Datafold enables effective quality testing at scale and what makes sense for your specific data environment and infrastructure.
If you're ready to get started, sign up for a 14 day free trial of Datafold to explore how data diffing can improve your data quality practices within, and beyond, data teams.

Building a culture of data quality that lasts is rare.

We share the mental models, metrics, and strategies to launch a proactive data quality system that enables data-driven decisions.