Datafold catches unintended changes to immutable data
Understand how Datafold's value-level data diffs detect and prevent unintended changes to immutable data.
Datafold catches unintended changes to immutable data
As data engineers, we talk a lot about what it takes to optimize our complex data pipelines to efficiently process and transform data at scale. We often focus on improving performance and streamlining workflows to create data pipelines that enable our colleagues in analytics, operations, and strategy to work better together.
But the issue of data quality often gets overlooked because it’s a messy problem. How do you validate, at scale, the accuracy of data produced across all your data pipelines? Popular methods include dbt tests, unit tests, manual SQL queries to verify ground truth – or perhaps more commonly than we like to admit, just shipping changes to production without any validation because it’s just too hard and time consuming to figure out.
You can set up your pipeline meticulously to ensure everything is extracted, transformed, and loaded efficiently, but if you don’t have a way to compare the two versions of your data during deployment, you have no way of knowing if your immutable data changed when it should not change.
It doesn’t have to be this way–and we have a pretty simple solution for how you can proactively catch unintended changes to immutable data before anything gets deployed to production.
When immutable data changes
First, a word on immutable data. What is it? It’s a type of data that once created, should not be modified because it’s expected to remain constant over time:
- Names
- Birthdays
- Event timestamps
But immutable data can and does change–and that’s not as rare as you might imagine. There are four main reasons why immutable data can inadvertently change:
1. Coding errors: An error in an automated data processing script may overwrite existing timestamps or names with incorrect values, leading to unexpected changes in the data.
2. Data integration failure: When data from multiple sources is merged or synchronized, discrepancies can arise. For example, if conflicting birthdate information is received from different data sources, it could result in an incorrect update to the birthdate field.
3. Data transformation errors: The process of data cleaning or normalization can inadvertently modify immutable data. For instance, if a data transformation rule incorrectly updates name formatting or date representations, it could result in unintended changes to the immutable data.
4. Data migration errors: During data migration, data is transferred between systems or platforms, which increases the likelihood of data integrity issues. We’ve seen how inaccurate mapping or transformation of data during migration can result in unintended changes to immutable data fields.
Typical strategies don’t protect your data quality
So there are plenty of opportunities for immutable data to start changing. What are your options for catching this before bad data hits production?
If you’ve already guessed it, there’s one clear reason why these approaches all fail to guarantee data quality: they don’t compare the two versions of the data to detect unintended changes.
Value-level data diffs guarantee peace of mind
You need value-level data diffs to compare individual data points or records between two datasets to find differences. This is absolutely critical for detecting unintended changes in immutable data because it allows for a granular examination of data at the most fundamental level.
There’s no way of checking for changes outside tests that you have predefined and scripted upfront. It’s just not possible for data engineers to provide complete custom test coverage for the hundreds or thousands of models they’re responsible for.
Even if there was only one data model, it would not be possible to write a test for every value.
Through a value-level comparison, data engineers can pinpoint exactly which immutable data fields have been altered, allowing for targeted investigation and remediation. You will be able to figure out if it was just a single character change in a name field or a minute adjustment in a timestamp–changes that might be small, but which could have significant implications for data integrity.
Also, unlike other testing methods that focus on validating data transformations or integrity rules, value-level diffs compare the entire dataset or selected subsets across two versions. This comprehensive comparison ensures that no changes to immutable data go unnoticed, regardless of the scale or complexity of the dataset.
Automating value-level data diffs with Datafold Cloud
Hopefully, we’ve now convinced you that value-level data diffs serve an important need that no existing testing tools can meet.
Setting up value-level diffs to run during each deployment cycle can be tricky, unless you automate the process. Datafold Cloud makes it easy to do so with its integration into your development workflow.
Whenever you open a new pull request with some code changes, our Datafold bot comments with diff summary statistics and value-level diffs that you and any reviewers can glance at:
Similarly, the highlighted values make it easy to see how the data changed at a very granular level:
If you’re curious to learn more about how Datafold’s data diffing in CI can help your team prevent shipping code that breaks production data, here are a couple of ways to learn more:
- Set up a free CI consultation with Datafold’s CI experts to talk about what CI setup makes sense for your specific data environment and infrastructure.
- If you’re ready to start working on building CI for data, we have a free trial experience of Datafold Cloud, so you can start connecting your databases as soon as today.