How we’re evolving cross-database diffing in Datafold

Leading data teams use Datafold to validate parity between databases to ensure a migration is successful. As we continue to improve this product experience and evolve the testing of data reconciliation efforts, we are invested to making cross-database validation as efficient and impactful as possible for data teams.

I’m excited to share three major improvements in Datafold to make source-to-target validation faster and more impactful to data teams:

  • Faster cross-database data diffing: Up to 10x faster data diffing for cross-database diffs, reducing time-to-insight and compute costs against your warehouses.
  • Real-time diff results: See data differences live as Datafold identifies them, rather than waiting for the entire diff to complete.
  • Representative samples: Establish difference "thresholds" to stop diffs once the set number of differences has been found per column, saving your compute costs and time.

Data teams performing data reconciliation at scale—whether it’s for a large migration for a modernization effort, or ongoing data replication for analytics purposes—can validate parity across systems, at any scale, with new unlocked speed.

Why it matters

Data diffing in CI empowers data engineers to prevent data quality issues from ever entering their production pipelines. And six months ago, we extended that value of “find data quality issues before they hit production” to one of the most taxing and intimidating projects a data team can undergo: data migrations.

We get it. We know how important it is to get the data right during migrations or regular data replication. This is the data that powers the analytics, reporting, and machine learning models of your business. Your stakeholders depend on it.

We know that when undergoing migrations, proving parity of tables across systems is a core requirement to receive stakeholder sign-off. We know that when your team is replicating data between databases and something goes wrong, reporting in your BI tool is greatly impacted or production systems break…leaving stakeholders confused and frustrated.

We understand that for source-to-target validation, teams need the information that matters to them, fast.

Which is why we’re excited to share three considerable improvements to the cross-database diffing experience in Datafold Cloud: all aimed to serve up the critical information your team needs to deliver excellence during a data migration.

Performance improvements

Our engineering team has spent months fine-tuning and adjusting our data diffing algorithm and we’re excited to share that teams can now experience up to 10x faster cross-database diffing.

Whether you’re undergoing a migration or performing ongoing data replication, leverage Datafold’s proprietary and performant cross-database diffing algorithm to validate parity across databases faster than manual testing ever could. Not only does this save data teams valuable time, but it reduces compute load and costs on their warehouses.

Alongside improving performance, we’ve been regularly adding new integrations to support cross-database data diffing. Cross-database diffing in Datafold Cloud is now compatible with 13 databases (and more being added monthly!), enabling your team to provide value-level comparisons between the databases that matter to your business.

Real-time diff results

One of the innovations I’m personally most excited to share is real-time diff results. For large data diffs, we understand that you can act on partial information before the entire table diff is complete.

Now, with real-time diff results, the Overview and Value Tabs will populate as Datafold finds differences. How does this impact you?

  • If you start seeing real-time value-level differences that you know are wrong, you can stop a diff in its tracks, and identify and fix the problem sooner.
  • Leveraging the Overview tab in Datafold, quickly understand the magnitude of differences. For many teams, we recognize that there is often an error threshold/acceptable lack of parity. With real-time diff results, find out sooner if the diff you’re running is meeting those error expectations, and stop a diff if it’s exceeding it.

No more waiting for a longer-running diff to complete. Simply start seeing differences as we identify them.

The functionality of real-time diff results allows your team to get the information that matters to them faster by providing value-level differences required for investigating data quality issues as they become available.

In the video below, real time diff results are populated in the Values tab and can be "pulled" with the use of the "Update with latest results" button.

Find differences faster with the representative samples

With Datafold’s new Per-Column Diff Limit, you can now automatically stop a running a data diff once a configurable threshold value of differences has been found per column. Similar to all of these new cross-database diffing improvements, the goal of this feature is to enable your team to find data quality issues that arise during data reconciliation faster by providing a representative sample of your data differences, while reducing load on your databases.

Getting started

If you’re interested in learning more about cross-database data diffing for your data migration or replication efforts, there’s a few ways to get started:

Happy diffing!

- Gleb Mezhanskiy, CEO of Datafold

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes