Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and time
for the demo
Get a 30-minute demo
and see datafold in action
///
September 7, 2022
Data Quality Best Practices, Data Testing

The Best Data Contract is the Pull Request

Data contracts can help us prevent data quality issues by formalizing interactions and handovers between different systems (and teams) handling data.

No items found.
Gleb Mezhanskiy

I’ve come across multiple conversations and opinions about data contracts lately and have been thinking about how they would apply to the current data team’s realities.

The idea is that data contracts can help us prevent data quality issues by formalizing interactions and handovers between different systems (and teams) handling data.

Tristan suggested a powerful analogy with the software world:

“This is how our world works today in the MDS: for example, Fivetran loads data into a table, dbt reads data from a table. If Fivetran changes the schema of that table, it can easily break the dbt code reading from that table. No contracts, no interfaces, no guarantees. [...]

This works very differently in a well-architected software system. The interfaces between systems are well-defined, and when you change something that breaks something else you can actually see that error at compile-time rather than at runtime. [...]

The importance of catching errors at compile-time cannot be understated—knowing that something is broken when you are writing the code is exactly when you need that information! Finding out that something is broken once it’s been pushed to production causes havoc.”

Why data contracts and why now?

There are three powerful trends that force us look for solutions for breaking data:

Analytical data becomes increasingly complex

An average Datafold customer (mid-size modern data stack user) has over 15K tables and ~500K columns in their warehouse. At the top end, we see data warehouses with 30M+ tables and over 1B+ columns. At this scale, metadata becomes big data.

Data undergoes multiple steps of sophisticated transformations with domain-specific business logic. For example, Gitlab’s internal analytics repo has over 132K lines of SQL code.

This means when something breaks, figuring out what the origin is becomes a huge task.

Analytical data becomes software

It is increasingly used for business automation by feeding into software that runs the business:

  • Online machine learning for search, recommendations, fraud detection, etc.
  • User-facing analytics
  • Data activation in business tools (e.g. Salesforce)

This means errors don't just impact internal dashboards, they can have real financial consequences. We should not let data break in production anymore.

Analytical data becomes more realtime

As a byproduct of becoming an input to software that runs businesses (and not just human decisions as in BI), analytical data demands lower latency –  in other words, data needs become more and more realtime. Reducing the time from data collection to acting based on data leaves little time to identify and react to bugs and makes data quality issues far more costly.

With such trends at play, we can’t rely on humans to ensure data quality, nor can we afford to find out about errors in production, even if those are detected fast. Data bugs are mostly software bugs introduced by people, and data contracts, although designed to be consumed by machines, help prevent human mistakes.

What is a data contract?

Data contract formally describes a programmatic interface to data. Wait, Gleb, isn’t sending SQL to a warehouse a programmatic access? It is, but this kind of query lacks the formal interface and any guarantees. If I run a SELECT * FROM MYTABLE today and tomorrow, how can I be sure I get the same structure of the result back and that no one has modified the schema of the table or the definitions of its columns?

To provide another analogy, data contracts are what API is for the web services. Say we want to get data from Twitter. One way is to scrape it by downloading and parsing the HTML of Twitter’s webpage. This may work, but our scraper will likely break occasionally, if Twitter, for instance, changes a name of a CSS class or HTML structure. There is no contract between Twitter’s web page and our scraper. However, if we access the same data via Twitter’s API, we know exactly the structure of the response we’re going to get. An API has required inputs, predictable outputs, error codes, SLAs (service level agreements – e.g. uptime), and terms of use, and other important properties. Importantly, API is also versioned which helps ensure that changes to the API won’t break end user’s applications, and to take advantage of those changes users would graciously migrate to the new version.

So what an actual data contract may look like? This is a toy example of a data contract for a table containing temperature readings in a data warehouse:

This data contract is essentially a structured description of a table that can be published somewhere and consumed programmatically. What can we do with it?

Data contracts use cases:

Validate data in production

In production, we can use contracts to validate data on the fly at the source: if the table contract says that the temperature is always in the [-100;60] range, we can have this checked as soon as the table is refreshed. A BI dashboard application reading this table doesn’t need to worry about messed up temperature since the contract sets the expectation.

Same for update time: if SLAs are defined in the contract, data consumers know when can they expect the data, and data producers will be notified if the table is late so they can take action.

Prevent breaking changes in development

In development, data contracts help us prevent breaking changes by validating the new versions of data against the contract. For example, we can have a CI (e.g. Github Actions) check the schema of the table against the contract and raise an error if someone attempts to modify the schema without updating the contract.

Wait, you may think, but what if someone updates the table schema by modifying the contract, how would that help from breaking downstream data uses? Just like a compiler for a statically typed programming language (or a sophisticated IDE for any language) would check the dependencies between function and variable uses and their definitions, the system that orchestrates our data pipelines would check required inputs of downstream tasks against the interfaces of their upstream dependencies. E.g. Dagster, a competitor to Airflow, allows to define inputs and outputs of any task so that they can be verified before those actually run data – addressing one of Airflow’s chronic pain points.

Improve discoverability and data understanding

Interestingly, data catalogs (they like to call themselves metadata platforms these days), attempt to aggregate and present similar types of information about data assets as data contracts without formalizing them. The primary difference between having data contracts in place vs. a catalog is that catalog is descriptive whereas contracts are both descriptive and prescriptive i.e. they define not just how the data looks, but how it must look. 

Catalogs are made for humans, and contracts are made primarily for machines (so that they can do the hard work of validation for humans). However, the metadata from data contracts is an excellent source of information for a data catalog to help people discover and understand the data better.

Data contracts in modern data stack

We have seen elements of that in particular layers of the data stack: Iteratively and Avo make it easy to define, evolve, and validate event schemas (collaborating on event schemas in spreadsheets felt too much even for Excel lovers like me). Dagster brings defined inputs and outputs to data orchestration. 

However, these solutions each work in particular layers of the data stack which means they offer only partial solutions to the problem of end-to-end data reliability. The weak link in the data chain is almost always the interface between different technologies or products. 

Let’s imagine what an end-to-end data contract may look like if defined through the entire stack: if our event collection tool was aware that the “promo_code” field in the “user_signup” event is used in dbt model A which gets pulled to dbt models B and C and eventually into a Hightouch sync to a Mailchimp campaign (while also being renamed to “discount_id” in somewhere in the middle) – then it could show a warning to a software engineer working on the Identity microservice that fires the “user_signup” event right in IDE.

Contracts across the stack?

At the moment it seems unlikely that modern data stack vendors, who mostly speak to each other in SQL, API calls or pointers to database tables, would adopt a unified data contract interface. Unless emerging platform frameworks such as dbt will establish and force such upon everyone.

Widespread, vertical implementations of data contracts seem to only exist in FAANG-level fully integrated data platforms such as Airbnb, which employ internal data tools teams larger than most modern data stack vendors do, as well as in some unicorns such as Convoy, who built Chassis (Schemata is a philosophically similar nascent open source project). It’s unlikely that an average data team would be able to build something like that internally.

The idea of data contracts throughout the stack is extremely powerful but remains largely aspirational for most data teams: Modern Data Stack has no signs of converging into a unified cross-tool contract interface, and outside of MDS everything is possible but most of us are not Uber.

I believe over the next year, data contracts will become increasingly wide spread. Most promising frameworks will be open-source built by teams at/from big data platforms. The adoption of data contracts will start with most business-critical and time-sensitive data, and most likely close to the source (events). And if you are curious how this may unfold, Chris Riccomini draws interesting parallels from the distributed software engineering world.

Handshakes before contracts

If having data contracts everywhere is not attainable anytime soon, what can we do to prevent data from breaking?

We can use data handshakes. In business, handshake agreements are used to agree on key terms in a deal before formalizing it in a scrupulous legal detail. We can apply the same concept to data.

If Silicon Valley runs on handshake deals, data teams can too. 

So what’s a handshake for data?

It’s a pull (or merge, okay, okay, Gitlab fans) request.

Pull request is a critical step in software development process where someone who wants to introduce a change effectively follows a handshake protocol:

  1. Here is my change (description)
  2. Here’s is the code it modifies (git diff)
  3. It doesn’t break anything (tests pass)
  4. I would like you, owner/contributor/stakeholder, to review it
  5. Once you give +1 it, I will merge it and it will go into production

Within the pull request process, once can validate most of things that data contracts would do:

  1. Will the schema change and if so, how? (schema diff)
  2. Will the data change and if so, how? (data diff)
  3. What impact will this change have on downstream data applications? (data lineage)

If every part of the data processing pipeline is version-controlled

If one can easily know the impact of every change (Know What You Ship principle)

If every change is staged and reviewed as a pull request before getting into production

>> Then you can achieve data reliability without implementing data contracts throughout the entire data platform!

We still need contracts, but handshakes get you far enough.

If your PR concerns data-processing code, e.g. SQL within dbt/Airflow/Dagster, you know there is a caveat: it is very hard to tell from reading just the source code diff how the data will change, let alone grok the impact on dashboards and ML models multiple steps downstream. This is what Datafold solves with powerful data diffs to understand changes to data and column-level lineage for impact analysis. Take it for a spin – it’s free to try.