Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and timeā€Øfor the demo
Get a 30-minute demoā€Øand see datafold in action
January 15, 2025

On making your OLAP database your source-of-truth

Data replication between transactional and analytical databases is critical for powering accurate reporting, analytics, and decision-making across your organization. Learn how to guarantee quality in your OLAP database at scale with modern practices and tooling.

Kira Furuichi
On making your OLAP database your source-of-truth

Data replication, as we know it, is rarely perfect. Every data team Iā€™ve ever talked toā€”and between my career as a data person ā„¢ļø, Snowflake and Databricksā€™ summits, and talking to our customers, there is a lot of themā€“is replicating data between databases. Most commonly, itā€™s syncing data between some sort of transactional database (typically owned by an engineering team) and the analytical databases (owned and managed by them). And every data team, in their own way, cares about the accuracy and reliability of this data replication process.

The transactional database storing your businessā€™s application, website, subscription center, etc. is vital to capturing the events, transactions, and moments that power the analytics of your organizations; OLAP databases like BigQuery, Databricks, and Snowflake enable data teams to build robust models for BI reporting, ML models, and more.

But what happens when the data replicated from Postgres to Snowflake gets messed up because of a broken ETL pipeline, bad API request, or imperfect data replication tools (us data tooling vendors often need some grace too!)?

A series of things typically happen:

  • Downstream data modeling and orchestration breaks (after all, what happens when subscription_created_utc is renamed to subscription_created_at_utc in your 50+ dbt models that reference that column?)
  • Incorrect or missing data enters your OLAP database and production environments, so the inevitable ā€œWe normally have about 500 orders at this time during the dayā€”whatā€™s going on?ā€
  • Data team members spend (wasted) hours troubleshooting the issue with engineeringā€”is Fivetran broken? Or are events on the backend not firing properly?
  • Trust between the data team, business, and engineering teams is slowly eroded

When data replication goes wrong, they compromise not only your data but also the relationships and workflows within your organization.

This is why Iā€™m making the argument that data replication between your OLTP and OLAP databasesā€”and ensuring the accuracy of these syncsā€”is not only vital, but possible with modern data engineering best practices and tooling.

The complexities of ongoing data replication

At this point, what do we know about data replication between OLTP and OLAP databases?

  • Itā€™s an essential process that most businesses, whether theyā€™re a D2C or B2B business, undergo
  • Itā€™s an often untested or limitedly tested process, with data quality issues being found late into its lifecycle in the OLAP database
  • It becomes increasingly challenging with data scale and complexity

Specifically, where does it often go wrong?

Staying ahead of unexpected backend changes

I would like to start with the pain point I have the greatest grievance with: unexpected schema changes to tables Iā€™m syncing from my transactional to my analytical databases.

(What do you mean you dropped order_id ?)

In a perfect world, your engineering team is updating you before these changes are implemented to backend database table structure, so the data team has enough time to update downstream analytical models, tests, etc. But, we know the world is not perfect and communication can be challenging (especially on larger or distributed teams), and these issues do slip through from time-to-time.

Broken data replication/ELT pipelines

The crux of all replication is extracting data from one location and copying into another. Many data teams choose to use modern ETL tooling (e.g. Fivetran, Stitch, Airbyte) or custom EL (extraction-load) scripts to replicate data between systems.

Either way, both SaaS products and human-generated code are imperfect and have bad days (e.g., downtime and bugs). When this happens, replication can become out of sync, values can be corrupted, or rows can be dropped. Ultimately meaning the data that arrives in your OLAP database can be late, inaccurate, or completely missing.

When data replication between your transactional and analytical databases goes wrongā€”because it will go wrongā€”how does your team recover? How does your team identify issues, resolve them, and create infrastructure and processes to mitigate future issues?

The better way to approach data replication testing

If weā€™re all replicating vital, business-defining data on a regular basis between transactional and analytical databases, what can data teams practically do to ensure this replication is correct?

A few approaches Iā€™ve seen data teams take in the past few years, from least to best in terms of scalability and robustness:

  1. Basic, source-level tests
  2. Source-level data observability
  3. Cross-database comparisons (what we call a data diff)

Letā€™s break each of those down.

Level 1: Source-level tests

When the data from Postgres/MongoDB/MySQL first lands in your OLAP database, run a series of key tests (e.g., primary keys are unique and not-null, order valueā€™s are greater than zero).

  • Good for catching data quality fundamentals and column/table renames or drops.
  • Bad for large table volume (manual tests donā€™t scale well), value-level comparisons, and quick issue resolution.

Level 2: Observability

Data monitors at the source-level that check for deviations in data volume, column-type changes, or custom data quality requirements.

  • Good for catching unexpected row volume differences and column renames or drops.
  • Bad for row-level data comparisons.

Level 3: Automated cross-database comparisons

Run regular value-level comparisons between your OLTP and OLAP databases to identify value-level discrepancies that might exist between your source and replicated data. With Datafold, data teams can run value-level data diffs across transactional and analytical databases at regular and high cadences, so incorrect or missing data is found faster and with greater precision.

With Datafold, your team can run data diffs for your replicated data across databases in a scheduled, programmatic, or ad hoc way. This means no more assertion tests, increasing row count checks, or back-and-forth queries between two systems. With Datafoldā€™s scheduled data diffs, know immediately whether something has gone wrong during your replication process.

ā€

A value-level data diff between a Postgres and Snowflake table

Conclusion: Your OLAP database is our source-of-truth

Data replication between transactional and analytical databases is critical for powering accurate reporting, analytics, and decision-making across your organization. But replication is never perfect, and when it breaks, the ripple effects can disrupt workflows, delay insights, and erode trust across teams.

The key to solving this problem lies in proactively testing and validating your replication processes. Today, from the simplest of dbt tests to more automated and scalable technological solutions like Datafold, data teams now have the opportunity to catch data discrepancies faster, resolve issues more efficiently, and ensure your OLAP database remains a reliable source-of-truth.

After all, accurate data isnā€™t just a technical goalā€”itā€™s the foundation for building trust and driving business success.

In this article