Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and time
for the demo
Get a 30-minute demo
and see datafold in action
///
October 16, 2024

Data monitors: Best practices for 3 data engineering scenarios

Learn how data monitors like Data Diff and Schema Change Monitors ensure data integrity across data engineering workflows to prevent costly data incidents.

Elliot Gunn
Nick Carchedi
Elliot Gunn, Nick Carchedi

How do you ensure data integrity at every stage of your pipeline? Picking the right set of data monitors isn’t just about ticking boxes; it’s about understanding the unique needs of each stage of your data lifecycle.

In modern data pipelines, much like in software engineering, shift-left practices—where testing and validation happen earlier in the lifecycle—help identify and resolve issues sooner. Whether you’re migrating data or maintaining production data, data monitors will save you time, prevent breakages, and ensure everything runs smoothly.

Let's explore three common scenarios in data engineering and how to select the right combination of data monitors for each.

Validating parity during data migration or replication

When you're migrating data from a legacy system or replicating an application database (such as Postgres) to a cloud warehouse, ensuring parity between the two systems is critical. Mismatches in your data can lead to delayed project timelines, costly rework, and downstream data issues that disrupt business operations. You can’t move forward with confidence without parity. 

Data Diff Monitors were purpose-built for this exact challenge. What sets Data Diffs apart from other tools in the market is their ability to provide granular, value-level comparisons that catch even the smallest differences between datasets. (If you're new to data diffing, check out our introductory article here.)

When you create Data Diff monitors, you can continuously check for value-level discrepancies between your source and target systems throughout the data migration or replication process. This precision is invaluable, saving you from the painstaking process of reconciling issues after the migration is complete.

For instance, some data teams use a migration strategy called double-writing, where data is written to both the legacy system and the new system for a set period of time. How do you validate that the data aligns perfectly across systems? Data Diff Monitors make it possible to continuously validate that the data in both systems is identical before fully switching off the legacy system. By identifying discrepancies in real time, you can resolve issues before making the final cutover, giving your team (and other relevant stakeholders) greater confidence in the migration process.

These monitors should be your default tool for ensuring data integrity—whether you're executing a lift-and-shift to the cloud or replicating production databases. With Data Diff Monitors in place, you can resolve discrepancies as they occur, ensuring that your migration is fast, accurate, and error-free.

Best practices 💡

  • When migrating data, lift-and-shift your code and data to the new system and continue writing data to both the legacy and new systems. During this time, use Data Diff Monitors to continuously monitor for discrepancies between the two systems, then deprecate the legacy system once all relevant stakeholders are confident parity has been achieved.
  • When continuously replicating data, set up Data Diff Monitors to run on a schedule that aligns with your replication intervals. This ensures that discrepancies between source and target systems are caught early, preventing data drift and minimizing the risk of misaligned reporting or downstream processes.

Catching upstream data changes

Once raw data lands in your warehouse—whether through a Fivetran ingestion pipeline or an app database replication—any unexpected data changes can quickly cause problems downstream. An altered data type or a dropped field can silently break your transformations, analytics pipelines, machine learning models, or C-suite dashboards 

Data changes shouldn’t catch you by surprise. Proactive monitoring is key. The earlier you catch a change, the faster you can prevent it from cascading into larger downstream issues.

Schema Change Monitors automatically track modifications to the structure of your tables, alerting you the moment an unexpected change happens. They notify you as soon as changes—like new columns, data type alterations, or dropped fields—hit the warehouse. This early detection gives you the chance to react and investigate before any downstream systems break. 

Metric Monitors help you track vital health indicators of your data, like row count and data freshness. These metrics are early indicators of potential data flow issues. Are you receiving the expected volume of data from your sources? Is the data arriving on time? You can also monitor column-specific metrics such as fill rate (the percentage of non-null values) and cardinality (the number of distinct values) for more granular monitoring. 

Data Test Monitors allow you to validate your data against custom business rules and surface records that fail your expectations. For instance, you can ensure primary key columns contain no null values, verify that certain columns conform to predefined ranges or formats, or check referential integrity between tables. This custom validation is crucial to ensuring that only high-quality data enters your warehouse.

Best practices 💡

  • Prioritize Schema Change Monitors where unexpected changes are most likely to occur—in your raw/source data. Ingesting data from external sources or systems that you don’t fully control introduces the possibility of unexpected schema modifications. For example, a vendor may change the format of data ingested from business tools like Salesforce, or your engineering team may make changes in an application database that would propagate into your warehouse without notice.
  • Create custom Metric Monitors to monitor for anomalies in your data such as late arriving data, a sudden drop in row count, or distribution shifts in numeric columns; as well as aggregated statistics that should not fall outside of a known acceptable range, such as insurance claims approval rate.
  • Run Data Tests on a schedule against production data or as part of your CI/CD workflow against staging data, allowing you to catch issues early. By setting up automated tests—such as checking for null values in primary key columns or validating that certain fields such as email or phone number meet expected formats—you can proactively flag data quality problems early. 

But data changes aren’t the only thing to monitor. You should also keep tabs on your data’s health in production to ensure downstream pipelines continue to function smoothly. 

Maintaining production data integrity 

Monitoring your final production data is critical—this is the data consumed by your AI or machine learning models, dashboards, and other systems that power business decisions. Inaccurate or incomplete data at this stage can have far-reaching consequences: skewed analytics, flawed predictions, broken products, and costly financial errors 

We recommend using these monitors to protect your production data:

Data Test Monitors are best suited for ensuring the accuracy and consistency of specific fields in your production data. By setting up validation rules—such as checking formatting, uniqueness, or referential integrity—you can prevent bad data from corrupting your final outputs.

And Metric Monitors are your go-to for tracking key indicators like row count, freshness, and column-level metrics like fill rate and cardinality. Add custom metrics to track things that are unique to your business, for example, the number of sales orders split by region or app downloads by operating system. These monitors help you stay on top of unexpected changes in your data, ensuring your production data is both trustworthy and up-to-date.

Best practices 💡

  • Regularly update your validation rules used in Data Test Monitors to reflect changes in your data model. As your production data evolves, so should your data tests. Review and refine your validation rules to account for schema updates or newly added data sources.
  • Incorporate automated anomaly detection with Metric Monitors to catch unexpected changes in your data that more explicitly defined tests might miss. 

Developing a monitoring strategy 

Monitoring every stage of your pipeline is critical to maintaining data integrity. The right monitors can help you catch issues before they escalate—whether you’re ensuring data parity in a migration, catching schema changes early, or maintaining the reliability of production data. To further streamline this process, we offer Monitors as Code, allowing you manage Datafold monitors via version-controlled YAML files. By integrating monitors into your infrastructure as code, you can automate and standardize monitoring across environments, ensuring consistency and reducing manual effort.

But with the complexity of many data engineering workflows, it can be challenging to determine what monitors you require for your specific needs. 

If you're unsure about where to start with setting up your monitors, need guidance on selecting the right monitor for specific scenarios, or want to develop a tailored monitoring strategy: