On making your OLAP database your source-of-truth
Data replication between transactional and analytical databases is critical for powering accurate reporting, analytics, and decision-making across your organization. Learn how to guarantee quality in your OLAP database at scale with modern practices and tooling.
Data replication, as we know it, is rarely perfect. Every data team I’ve ever talked to—and between my career as a data person ™️, Snowflake and Databricks’ summits, and talking to our customers, there is a lot of them–is replicating data between databases. Most commonly, it’s syncing data between some sort of transactional database (typically owned by an engineering team) and the analytical databases (owned and managed by them). And every data team, in their own way, cares about the accuracy and reliability of this data replication process.
The transactional database storing your business’s application, website, subscription center, etc. is vital to capturing the events, transactions, and moments that power the analytics of your organizations; OLAP databases like BigQuery, Databricks, and Snowflake enable data teams to build robust models for BI reporting, ML models, and more.
But what happens when the data replicated from Postgres to Snowflake gets messed up because of a broken ETL pipeline, bad API request, or imperfect data replication tools (us data tooling vendors often need some grace too!)?
A series of things typically happen:
- Downstream data modeling and orchestration breaks (after all, what happens when
subscription_created_utc
is renamed tosubscription_created_at_utc
in your 50+ dbt models that reference that column?) - Incorrect or missing data enters your OLAP database and production environments, so the inevitable “We normally have about 500 orders at this time during the day—what’s going on?”
- Data team members spend (wasted) hours troubleshooting the issue with engineering—is Fivetran broken? Or are events on the backend not firing properly?
- Trust between the data team, business, and engineering teams is slowly eroded
When data replication goes wrong, they compromise not only your data but also the relationships and workflows within your organization.
This is why I’m making the argument that data replication between your OLTP and OLAP databases—and ensuring the accuracy of these syncs—is not only vital, but possible with modern data engineering best practices and tooling.
The complexities of ongoing data replication
At this point, what do we know about data replication between OLTP and OLAP databases?
- It’s an essential process that most businesses, whether they’re a D2C or B2B business, undergo
- It’s an often untested or limitedly tested process, with data quality issues being found late into its lifecycle in the OLAP database
- It becomes increasingly challenging with data scale and complexity
Specifically, where does it often go wrong?
Staying ahead of unexpected backend changes
I would like to start with the pain point I have the greatest grievance with: unexpected schema changes to tables I’m syncing from my transactional to my analytical databases.
(What do you mean you dropped order_id
?)
In a perfect world, your engineering team is updating you before these changes are implemented to backend database table structure, so the data team has enough time to update downstream analytical models, tests, etc. But, we know the world is not perfect and communication can be challenging (especially on larger or distributed teams), and these issues do slip through from time-to-time.
Broken data replication/ELT pipelines
The crux of all replication is extracting data from one location and copying into another. Many data teams choose to use modern ETL tooling (e.g. Fivetran, Stitch, Airbyte) or custom EL (extraction-load) scripts to replicate data between systems.
Either way, both SaaS products and human-generated code are imperfect and have bad days (e.g., downtime and bugs). When this happens, replication can become out of sync, values can be corrupted, or rows can be dropped. Ultimately meaning the data that arrives in your OLAP database can be late, inaccurate, or completely missing.
When data replication between your transactional and analytical databases goes wrong—because it will go wrong—how does your team recover? How does your team identify issues, resolve them, and create infrastructure and processes to mitigate future issues?
The better way to approach data replication testing
If we’re all replicating vital, business-defining data on a regular basis between transactional and analytical databases, what can data teams practically do to ensure this replication is correct?
A few approaches I’ve seen data teams take in the past few years, from least to best in terms of scalability and robustness:
- Basic, source-level tests
- Source-level data observability
- Cross-database comparisons (what we call a data diff)
Let’s break each of those down.
Level 1: Source-level tests
When the data from Postgres/MongoDB/MySQL first lands in your OLAP database, run a series of key tests (e.g., primary keys are unique and not-null, order value’s are greater than zero).
- Good for catching data quality fundamentals and column/table renames or drops.
- Bad for large table volume (manual tests don’t scale well), value-level comparisons, and quick issue resolution.
Level 2: Observability
Data monitors at the source-level that check for deviations in data volume, column-type changes, or custom data quality requirements.
- Good for catching unexpected row volume differences and column renames or drops.
- Bad for row-level data comparisons.
Level 3: Automated cross-database comparisons
Run regular value-level comparisons between your OLTP and OLAP databases to identify value-level discrepancies that might exist between your source and replicated data. With Datafold, data teams can run value-level data diffs across transactional and analytical databases at regular and high cadences, so incorrect or missing data is found faster and with greater precision.
With Datafold, your team can run data diffs for your replicated data across databases in a scheduled, programmatic, or ad hoc way. This means no more assertion tests, increasing row count checks, or back-and-forth queries between two systems. With Datafold’s scheduled data diffs, know immediately whether something has gone wrong during your replication process.
Conclusion: Your OLAP database is our source-of-truth
Data replication between transactional and analytical databases is critical for powering accurate reporting, analytics, and decision-making across your organization. But replication is never perfect, and when it breaks, the ripple effects can disrupt workflows, delay insights, and erode trust across teams.
The key to solving this problem lies in proactively testing and validating your replication processes. Today, from the simplest of dbt tests to more automated and scalable technological solutions like Datafold, data teams now have the opportunity to catch data discrepancies faster, resolve issues more efficiently, and ensure your OLAP database remains a reliable source-of-truth.
After all, accurate data isn’t just a technical goal—it’s the foundation for building trust and driving business success.