Data reconciliation: Technical best practices
Three technical best practices for data reconciliation—selecting validation metrics, efficient resource management, and automating data quality testing—that ensure data integrity in data replication and migration.
Many aspects of data reconciliation are learned through hands-on experience. Unfortunately, this means that best practices are usually developed and circulated only within the best data teams. But what if you’re the founding data engineer tasked with creating the company’s first data infrastructure from the ground up? Or if you’re the newest hire recruited to help with a data reconciliation project gone off schedule?
When you search the web, many resources are overly focused on non-technical advice relating to defining objectives and scope; stakeholder engagement; documentation and training; compliance and governance; and so on.
Too often, technical best practices are glossed over in vague terms and overly abstract diagrams. So we won’t focus on non-technical best practices here. In our final article of a three-part series on data reconciliation, we’ll wrap things up by turning towards three technical best practices: selecting validation metrics, efficient resource management, and automating data quality testing.
To recap, part one looked at data reconciliation use cases, techniques, and challenges, while part two was an in-depth look at five data reconciliation challenges and solutions.
Types of data reconciliation
The types of best practices will differ depending on the specific use cases. Whether you’re migrating databases or replicating data, you’ll need to verify that the data in one system matches the data in another.
Data reconciliation during replication testing
Replication testing is all about making sure that when you copy data from one database to another (say during replication between an OLTP and OLAP database) on an ongoing basis, everything matches as you expect.
When you photocopy a document, you want it to match the original exactly. But many things can get in the way and produce imperfect copies. With physical paper, the only way to verify it is by eyeballing where the imperfections are.
Things are a little better with databases: you can programmatically try and validate that the replication was a success. Data reconciliation techniques are focused on two things:
- Validation: Tools (such as Datafold) and scripts can check that the data in the source and target databases are synchronized. This might involve checksums, row counts, or more complex data comparisons.
- Data integrity checks: Ensuring that not only the quantity but the quality of data matches. This involves detailed data diffs to catch any data corruption or transformation errors.
In data replication, data reconciliation is an ongoing activity due to the continuous nature of replication and the necessity of monitoring to ensure that the replication still functions correctly over time.
Data reconciliation during data migration
Migration involves moving data from one database environment to another, often involving a transition between different database systems (such as from Oracle to Databricks). Like replication testing, the objective in migration is to ensure that all data moves correctly, maintaining its integrity and functionality in the new environment.
The validation process in migration is comprehensive and often more complex due to the significant changes in database structures. For example:
- Schema matching: It’s crucial to ensure that the database schema (structure of tables, fields, data types, constraints) matches between the source and target.
- Data type conversion checks: Given the potential for different data handling between systems (e.g., different date formats or numeric precision), it’s essential to use data diffs to validate that data types have been correctly converted and formatted.
Data reconciliation best practices
Every data engineer should learn about the three data reconciliation best practices because they translate to other data engineering workflows too. Each of them are key to ensuring data integrity, reliability, and quality across various data systems and processes.
Validation metrics to use, and to avoid
The right group of metrics will help in two ways: tell you whether the replication is directionally correct, and the size of the error if it exists. Here’s our take on which metrics you should use, and which you can discard without worrying. You’ll notice that a few of them are part of the 8 dimensions of data quality.
Validation metrics to use
Data completeness: Checks whether all expected data has arrived and is accounted for in the target system. It involves counting records and ensuring that all expected fields are populated. This goes beyond simple record counts, involving validation of completeness for each critical field.
Data accuracy: Measure the correctness of data by comparing source data to what is stored in the target system. This can involve field-by-field comparisons, checksums, or hash totals that verify data has not been altered or corrupted during transfer.
Data consistency: Ensure that the data across multiple systems or parts of the database remains consistent. This involves checking relational integrity, foreign key relationships, and cross-references to ensure that the data conforms to predefined rules and constraints.
Data timeliness: This measures the latency between when data is captured in the source system and when it appears in the target system. It’s crucial for operations that rely on real-time or near-real-time data processing.
Data conformity: Check that data adheres to specific formats, standards, or ranges. This could include validating date formats, telephone numbers, postal codes, and other standardized data against a set of rules.
Data uniqueness: Ensure there are no duplicates in your data, especially when consolidating data from multiple sources. This might involve checking key fields that should uniquely identify records, such as user IDs or transaction numbers.
Validation metrics to avoid
Simple record counts: This is too often the first metric that data reconciliation teams use, but it can be highly misleading. While counting the number of records in source and target systems is a basic starting point, it doesn’t catch errors like duplicate records or records with incorrect but non-missing fields.
Volume-based metrics: Just like simple counts, volume metrics (like total sales or total transactions) can overlook underlying data quality issues. They might show correct aggregates even if individual records are incorrect.
Overly broad error checks: Avoid validations that are too general to identify specific issues, such as simply checking if any data is present at all. While they might quickly flag major failures, they aren’t helpful in diagnosing subtle but critical data quality issues.
Static thresholds for dynamic data: Using fixed thresholds for anomaly detection in data that is naturally variable or seasonal can lead to frequent false positives or negatives. Instead, adaptive thresholds based on recent trends or statistical analysis are much better.
Efficient resource management
Data reconciliation is hard because it involves a tradeoff between the granularity of the reconciliation process and the resources required to perform it.
What does this mean? Data engineers know that as the level of detail in reconciliation increases, so does the complexity and resource demand. Checking every single data field and ensuring exact matches across systems requires more computational power, storage, and time. This thoroughness can be essential for mission-critical applications where even minor discrepancies can lead to significant issues.
However, higher granularity also means greater resource consumption, which can strain system capacity and inflate costs. On the other hand, lower granularity, such as using aggregate comparisons or sampling, can save resources but may miss subtle errors. This can be sufficient for non-critical systems but may be unacceptable in scenarios where data accuracy is paramount.
So where do you draw the line? What’s the optimal balance here? There's no one-size-fits-all answer, but here are some strategies to help determine the optimal balance:
Sampling with Datafold’s data diffs
When comparing data between databases, there’s often a trade-off between efficiency and detail. Sometimes, you need to quickly check if the overall data seems consistent without diving into every single detail. There’s good news: at Datafold, we’ve solved these challenges (specifically, five of them) using techniques like stochastic checksumming and sampling to compare data efficiently and provide high-level results fast.
This high-level, partial result can be fast and efficient, giving you a quick overview without consuming too many resources. Sampling data is a great way to choose the optimal speed and level of detail.
When sampling is enabled in Datafold, the tool doesn't compare every single row and value in the database. Instead, it selects a random subset of the data to compare and quickly get a sense of the magnitude of differences between source-to-target, and even pull out specific examples.
For example, imagine you're performing replication testing on a large customer database with millions of rows. Instead of comparing every record between the source and the target databases, Datafold can sample 10% of the data. If this sample shows that 2% of the records have discrepancies, you can infer that a similar rate of discrepancy might exist across the entire dataset.
This not only saves time but also provides actionable insights by identifying where inconsistencies are likely to occur. With this information, you can decide whether a deeper, full-scale comparison is necessary or if the sampled results are sufficient for your needs.
Automated monitoring alerts
Using automated tools to monitor data consistency and generate alerts for significant anomalies is an efficient approach to maintaining data integrity without the need for constant, exhaustive checks. This method leverages technology to continuously oversee data flows and detect potential issues in real-time or near-real-time.
Anomalies might include things like a sudden spike in transaction volumes, data fields with unexpected null values, or discrepancies between data sources. Advanced algorithms and machine learning models can be employed to identify anomalies, such as unexpected changes in data patterns, deviations from historical trends, or unusual data values.
Alerts can be prioritized based on the severity of the anomaly, helping data teams focus on the most critical issues first.
Automate the data quality testing process
It seems pretty obvious that data teams should try to automate the entire data reconciliation process, but there are several reasons why data teams may not have fully or partially automated these processes, and they range from technical challenges to organizational and strategic considerations.
It’s tough to automate reconciliation across a variety of data sources, especially if they are different databases with their respective formats and structures. Integrating automated tools with existing systems and databases can be technically challenging, especially if legacy systems are involved.
In highly regulated industries, there may be strict requirements for how data is handled and reconciled. Many also find that off-the-shelf automation tools may not fully meet the unique needs of an organization, requiring custom development, which can be costly and time-consuming.
It might be a tall order to try and automate the entire data reconciliation process, so if you have to focus your efforts in one area, we recommend starting with the data quality testing stage first.
Why? Because data quality testing is foundational to the entire data reconciliation process. By automating data quality checks, you can ensure that the data being ingested, transformed, and loaded into your systems meets the required standards of accuracy, consistency, and completeness.
This stage involves validating data against predefined rules and identifying any anomalies or discrepancies early on, which helps in preventing the propagation of errors throughout the data pipeline.
One common use case is when you need to replicate data across various geographical regions and cloud providers, and then also continuously verify that the replicated data remains consistent across all locations.
Here, manual solutions aren’t sufficient. Instead, data replication testing requires leveraging automated tools like Datafold’s cross-database diffing to streamline the validation process. It allows you to continuously run data diffs across different types of tables on a scheduled basis so you can get alerts once any discrepancies are found. And any inconsistencies are identified at the value-level, so you can pinpoint where exactly the two tables are not matching and come up with a targeted fix.
Getting started with data reconciliation
If you’re ready to learn more about how Datafold is changing the way data teams test their data during data reconciliation, there are a few options to getting started:
- Request a personalized demo with our team of data engineering experts. Tell us about your replication or migration pipelines, tech stack, scale, and concerns. We’re here to help you understand if data diffing is a solution to your data reconciliation efforts.
- For those who are ready to start playing with cross-database diffing today, we have a free trial experience of Datafold Cloud, so you can start connecting your databases as soon as today.