Testing data pipelines: The Modern Data Stack challenge
Learn about common challenges and solutions to test data pipelines that spread across multiple layers and tools. In the future, data testing efforts may consolidate on the transformation layer while the orchestration layer simplifies creating testing environments.
Testing data pipelines is the process of building confidence that changes to the pipeline will have the desired effect on the data. Will you ever be 100% certain that changes will have the desired effect and won’t introduce any regression? Probably not.
Balancing testing efforts with productivity can be tricky. If you are a data-informed company (data informs actions), thorough testing can slow you down from adding value. On the other hand, if you are a data-driven company (data automates actions) with multiple contributors, lack of testing can stop you from deploying changes more often and delay the impact of your work.
A modern data stack is composed of multiple layers and tools: storage (Snowflake, Databricks, …), orchestration (Airflow, Dagster, …), integration (Fivetran, Airbyte), transformation (dbt, Spark, …), visualization (Looker, Metabase, …) and activation (Hightouch, Census, …).
How do you test a data pipeline that spreads across multiple layers?
In this article, I try to answer the question above and review some of the evolutions that could make testing data pipelines easier in the future.
How to test data pipelines
To test data pipelines you first need to set up a development environment to be able to inspect data changes as you develop and a staging environment to share with colleagues reviewing your work. Given that the modern data stack is composed of multiple layers, you may set up multiple environments for each tool. Once you have your environments set up, here are some actions you can perform to build up confidence about changes.
Review code changes
Reviewing code changes is a common best practice that has been adopted by most data developers. But this is only possible if you can access the code. Some modern data stack tools do a better job allowing you to version control the state of a data pipeline component. Reviewing code changes that are spread over multiple repositories can be a challenge.
Run the pipeline in development and staging
Running the pipeline requires that you first set up separate environments with data. A common challenge is to configure the pipeline to run fast enough in development and staging. To a minimum, you should be able to run a part of the pipeline that is related to your code changes during development with some sample of production data.
Query and write tests for your data
Each layer of the data pipeline will output new data. To a minimum you should be able to query this data. In some cases, you may save these queries and attach a notebook to a PR for colleagues to interact with your changes. You may as well turn these queries into automated tests in your CI to check for future regressions. If your tests are not tightly coupled to your development environment, you may add tests to your production data test suite.
Review data changes
Reviewing data changes requires that you materialize all data across the development pipeline and compare it to the data in a different environment, such as production. This is essential to validate that unexpected changes in values don’t occur, which are hardly detected with tests. Ideally, you want to be able to know which tables have been added, updated and removed. And for each table that has been updated, see the changes across rows, columns and values.
Testing data storage
This is likely the first layer that you will put into place, whether it’s a database, a data warehouse, a data lake or a lakehouse. All other layers of the modern data stack integrate with the storage layer.
One of the most common challenges of testing data pipelines is being able to run the data pipeline in a development environment with production data, even if the data is obfuscated. In Snowflake you can use zero-copy clones and in BigQuery you can use table clones to copy production data with only new or modified data incurring in storage costs. lakefs provides zero-copy cloning for data lakes that rely on open-source file formats. In the future, I expect more cloud data warehouses to adopt open-source storage formats like Databricks is doing with Delta Lake.
General testing tools like Great Expectations can be used to write tests for your data warehouse, but these are most often applied to testing production environments and are of less help during development. Instead, during development, you will likely write queries, unit tests for complex processing logic, or navigate data differences to validate the changes.
When it comes to testing data in production environments, sometimes referred to as a different category (data monitoring or data observability), there are plenty of tools (Monte Carlo, Metaplane, Soda, Elementary Data) that focus on observing your production data warehouse, but these tools are more focused on detecting errors on production rather than catching errors during the development phase.
Testing data orchestrators
Airflow was the first data orchestrator to become massively popular. While Airflow was originally conceived as an orchestrator of tasks, it quickly added operators to move and transform data with ETL/ELT pipelines. Many teams coded their first data pipelines in Airflow, but as they looked to add more data sources and destinations, Airflow fell short, and other tools like Airbyte and Hightouch were added to the stack to cover the long tail of integrations.
Airflow has also been criticized for being hard to run locally and test. By design, Airflow was not supposed to be aware of data assets. Since then, Airflow original limitations have been partly addressed by Astronomer with the Astro CLI to facilitate managing Airflow environments and data lineage support among other enhancements. In the meantime, other orchestrators like Prefect, Dagster, Kestra and Mage have capitalized on these limitations. For example, Dagster has released several exciting features that simplify testing such as software-defined data assets and branch deployments in Dagster Cloud.
Unbundling Airflow means that data teams are now required to set up more tools and environments across tools to be able to test pipelines end-to-end. Will orchestrators evolve to make it easier to set up and test end-to-end pipelines during development?
Dagster branch deployments can already be used to deploy a staging environment with a subset of your data stack (Dagster and Snowflake), but it is not clear to me how much work is required for Dagster to set up your whole data stack in a staging environment.
Testing data integrations
Data integration tools like Fivetran and Airbyte are commonly used to replicate data from applications, files and databases to data warehouses.
One approach to testing data integrations during development is to use a separate Fivetran or Airbyte instance. You can programmatically create multiple Fivetran environments with the official Fivetran Terraform provider. Airbyte users can leverage the Airbyte CLI to apply Airbyte configuration changes (not infrastructure changes) across environments. On top of that Airbyte connectors are open-source, so you can version control, customize and fix connectors yourself.
Fivetran and Airbyte provide no specific data testing library to validate syncs. Still, you can use data comparison tools like Datafold's cross-database diffing to validate database to data warehouse syncs by reporting all rows that don’t match between a source and destination.
One challenge of testing data integrations is that data quality issues will not only come from changes on the sync configuration or connectors code but also from changes in the source data (such as schema changes) on scheduled production syncs. While most data teams today stage all code changes through pull requests, very few teams fully treat Data as Code and stage data changes before promoting them to production.
Testing data transformations
If there is one modern data transformation tool that has quickly gained adoption, that is dbt. While Spark leaned on the power of Scala (and Python) and cloud computing, dbt relied on extending SQL with Jinja (and macros) and the scalability of modern data warehouses.
If I have to highlight one thing that has contributed to dbt’s adoption, I would say it’s a better developer experience than Spark. You can install dbt with pip and the configuration system makes it easy to use multiple environments. dbt makes it easy to test data transformations with multiple dbt testing libraries available, and you can review data changes in dbt with Datafold's data diffing directly in the CI process.
But dbt has also thought deeply about your CI/CD experiences. For example, Slim CI and the defer feature can significantly speed up rebuilding dbt lineage in your CI process, making it easier to test dbt code changes with production data. At Datafold, we add to these with automated data diff reports for dbt that you can access from the command line, Pull Request and directly within the Datafold UI.
I get the impression that most of the testing efforts today are moving to the transformation layer. This is because the transformation layer has been removing complexity from other layers. For example, the switch from ETL to ELT pipelines move data transformations from the data integration layer to the transformation layer. You may already be testing integrated data with the dbt source freshness command and generic dbt tests to catch data issues early on the pipeline.
Testing other layers of the modern data stack
BI tools (Looker, …) and data activation tools (Hightouch, …) are not only used by data developers, but also business users across the company through a GUI. While many teams have removed the transformation logic out of data integration tools, it’s still common to transform data through the BI and activation layers. This makes testing changes harder for data teams.
But data transformation tools like dbt are also providing an alternative to moving transformations out of BI tools with the semantics layer. By doing all your metrics aggregations on the transformation layer instead of the BI tool, then you only need to test and document data in one tool. Data activation tools like Hightouch integrate with dbt and you can already do most of the data modeling within dbt.
For the transformation layer to gather all data transformations across the company, it still requires significantly lowering the barrier to entry for business users to contribute to a dbt code base. dbt Cloud may be a partial answer to this adoption problem with its browser based IDE. But I still see a big gap between tools that provide a good experience for developers and tools that focus on business users. DataOps tools (Y42…) are working to reconcile these two worlds.
The future of testing data pipelines
If data teams keep moving all their data transformations that previously were spread on the data integration, BI and activation layers to the transformation layer. Then only raw data will ever enter the transformation layer and only ready to be consumed data will exit it. In this case, I see how testing efforts can keep further consolidating on the transformation layer.
As I have argued throughout the article, I see testing data pipelines as closely tied to being able to set up development and staging environments. When it comes to provisioning and configuring other tools, the data orchestration layer is still ahead of the transformation layer. Another overlapping area between the data transformation and orchestration layer is data lineage, which is provided by both.
Data orchestrators already provide features to transform data. One open question is whether dbt will start providing more advanced scheduling, such as scheduling data integrations, and make it less appealing to add a data orchestrator to your stack.
Finally, I believe that we will see efforts towards bridging the gap between testing data engineering pipelines and machine learning pipelines. Does the testing experience have to be different if your model is a simple SQL query or an ML model?
Looking forward to how the next evolution of best practices and tools makes testing easier and a more enjoyable experience. I would love to hear from others experiences testing their data pipelines.