dbt exposures: What they are and how to use them
Learn more about how dbt Exposures extend your dbt DAG to identify downstream assets like BI dashboards and must be manually defined and maintained in YAML.
The output of our work as data practitioners are data products – datasets, dashboards, reports, ML models – no matter how complex or lengthy the pipelines are; it's the final data products that make an impact on the business.
Therefore, it’s essential to understand how data consumers use the data produced by the dbt pipelines—usually through some form of data lineage. While dbt ships with automatic table-level lineage and dbt Cloud provides column-level lineage in the dbt project via dbt docs, it only tracks (automatically) dbt source tables and models, and not how the models are used by the business.
dbt Exposures extend dbt's native docs and allow dbt developers to document end-user data products and their dependencies within the dbt DAG (directed acyclic graph). Exposures can help answer questions, including:
- What dbt models are upstream to dashboard X
- Who is the owner of the dashboard X
- What downstream applications may break if we update model Y
dbt exposures: Example of usage
Exposures are defined in YAML files nested under the "exposures" key.
Code example of a BI dashboard as a dbt Exposure
Exposures in the dbt DAG
Once added to your dbt project via YAML, exposures will appear in your dbt DAG (both in the local docs site as well as the dbt Cloud documentation site) as orange nodes. Like most lineage graphs, you can click on specific nodes and paths for your exposures to clearly understand upstream and downstream dependencies.
Exposures best practices
As with any governance feature, it’s important to think about best practices when implementing exposures.
#1: Start small with the business-critical exposures
While exposures are a powerful feature, adding exposure tracking for a mature project with thousands of BI and other dependencies can be overwhelming. When starting using exposures, it’s best to start by adding exposures for the top 10 most important data products. These are commonly known, e.g., an executive KPI dashboard or a reverse-ETL sync into CRM. Starting small allows you to familiarize yourself and the team with the exposures framework and facilitate wider adoption.
#2: Establish team guidelines
Once you’ve added exposure tracking for the essential assets, it may be a good time to establish team guidelines, e.g., "every data/analytics engineer should maintain exposures for the BI assets they own" or "when creating a dashboard for stakeholders, always add an exposure."
Having clear guidelines makes it easy to maintain and enforce team-wide curation of exposures.
#3: Keep exposures healthy
As with dbt tests, it’s essential to keep exposures up-to-date. Once the information in exposures becomes stale, e.g. owners are no longer with the company, the BI tool url is broken, the dashboard was deprecated but is still tracked in exposures, data team members and business users will eventually lose trust in exposures and stop using them, which is the opposite of what we want. Returning to best practice #1 – it’s best to have fewer high-quality exposures that stay up-to-date than hundreds that are stale and untrusted.
dbt exposures limitations
While exposures are a simple and powerful way to document downstream data applications in a dbt project, they have two fundamental limitations.
dbt exposures must be manually created and maintained with YAML, which does not scale effectively
The more widely data is adopted in the organization (good thing), the harder it is for the data team to keep track of all downstream uses of the data they produce. In my data engineering days at Lyft, we used to have over 100 major dashboards across Looker and Tableau and over 10,000 reports in Mode.
Exposures don’t detect potential breakages during code changes
One big reason to have visibility into the downstream data uses is to prevent breaking data products when changing dbt code upstream. While exposures make defined dependencies visible in dbt docs, it still requires someone to go through a (sometimes giant) graph of dependencies to identify potential breaking changes.
Automating exposures with dbt + Datafold
Datafold complements dbt with the automated column-level lineage that implements with all major BI tools. Unlike exposures (that need to be defined manually) and dbt’s own data lineage (that is limited to dbt-project assets), Datafold relies on full semantic parsing of SQL logs from your data warehouse and combining that with metadata from BI tools to form a complete dependency graph that covers the entire data warehouse, including, but not limited to, data models and BI assets.
Furthermore, integrated in CI, Datafold automatically computes data diffs showing how the data changes when the changes to dbt code are made, and identifies impacted downstream applications such as Looker or Tableau dashboards directly in the pull request.
Conclusion
dbt Exposures, code-defined extensions of your dbt project that can be created to identify downstream data assets (like a reverse ETL sync, BI dashboard, or data science model), are a useful way to extend your default dbt DAG. They must be created and maintained with YAML, and manually defined for each exposure you want to add to your DAG. While they can be an effective way to understand the downstream use of your dbt models, they can be challenging to implement at scale.