Datafold catches unintended changes to immutable data
Understand how Datafold's value-level data diffs detect and prevent unintended changes to immutable data.
Datafold catches unintended changes to immutable data
As data engineers, we talk a lot about what it takes to optimize our complex data pipelines to efficiently process and transform data at scale. We often focus on improving performance and streamlining workflows to create data pipelines that enable our colleagues in analytics, operations, and strategy to work better together.Â
But the issue of data quality often gets overlooked because itâs a messy problem. How do you validate, at scale, the accuracy of data produced across all your data pipelines? Popular methods include dbt tests, unit tests, manual SQL queries to verify ground truth â or perhaps more commonly than we like to admit, just shipping changes to production without any validation because itâs just too hard and time consuming to figure out.Â
You can set up your pipeline meticulously to ensure everything is extracted, transformed, and loaded efficiently, but if you donât have a way to compare the two versions of your data during deployment, you have no way of knowing if your immutable data changed when it should not change.
It doesnât have to be this wayâand we have a pretty simple solution for how you can proactively catch unintended changes to immutable data before anything gets deployed to production.Â
When immutable data changes
First, a word on immutable data. What is it? Itâs a type of data that once created, should not be modified because itâs expected to remain constant over time:
- Names
- Birthdays
- Event timestamps
But immutable data can and does changeâand thatâs not as rare as you might imagine. There are four main reasons why immutable data can inadvertently change:
1. Coding errors: An error in an automated data processing script may overwrite existing timestamps or names with incorrect values, leading to unexpected changes in the data.
2. Data integration failure: When data from multiple sources is merged or synchronized, discrepancies can arise. For example, if conflicting birthdate information is received from different data sources, it could result in an incorrect update to the birthdate field.
3. Data transformation errors: The process of data cleaning or normalization can inadvertently modify immutable data. For instance, if a data transformation rule incorrectly updates name formatting or date representations, it could result in unintended changes to the immutable data.
4. Data migration errors: During data migration, data is transferred between systems or platforms, which increases the likelihood of data integrity issues. Weâve seen how inaccurate mapping or transformation of data during migration can result in unintended changes to immutable data fields.
Typical strategies donât protect your data quality
So there are plenty of opportunities for immutable data to start changing. What are your options for catching this before bad data hits production?
If youâve already guessed it, thereâs one clear reason why these approaches all fail to guarantee data quality: they donât compare the two versions of the data to detect unintended changes.Â
Value-level data diffs guarantee peace of mind
You need value-level data diffs to compare individual data points or records between two datasets to find differences. This is absolutely critical for detecting unintended changes in immutable data because it allows for a granular examination of data at the most fundamental level. Â
Thereâs no way of checking for changes outside tests that you have predefined and scripted upfront. Itâs just not possible for data engineers to provide complete custom test coverage for the hundreds or thousands of models theyâre responsible for.
Even if there was only one data model, it would not be possible to write a test for every value.Â
Through a value-level comparison, data engineers can pinpoint exactly which immutable data fields have been altered, allowing for targeted investigation and remediation. You will be able to figure out if it was just a single character change in a name field or a minute adjustment in a timestampâchanges that might be small, but which could have significant implications for data integrity.Â
Also, unlike other testing methods that focus on validating data transformations or integrity rules, value-level diffs compare the entire dataset or selected subsets across two versions. This comprehensive comparison ensures that no changes to immutable data go unnoticed, regardless of the scale or complexity of the dataset.
Automating value-level data diffs with Datafold Cloud
Hopefully, weâve now convinced you that value-level data diffs serve an important need that no existing testing tools can meet.Â
Setting up value-level diffs to run during each deployment cycle can be tricky, unless you automate the process. Datafold Cloud makes it easy to do so with its integration into your development workflow.Â
Whenever you open a new pull request with some code changes, our Datafold bot comments with diff summary statistics and value-level diffs that you and any reviewers can glance at:
â
Similarly, the highlighted values make it easy to see how the data changed at a very granular level:
If youâre curious to learn more about how Datafoldâs data diffing in CI can help your team prevent shipping code that breaks production data, here are a couple of ways to learn more:
- Set up a free CI consultation with Datafoldâs CI experts to talk about what CI setup makes sense for your specific data environment and infrastructure.
- If youâre ready to start working on building CI for data, we have a free trial experience of Datafold Cloud, so you can start connecting your databases as soon as today.