June 20, 2023

dbt, Modern Data Stack, GitHub, Data Management, Data Quality Best Practices

Top dbt utils and how to use them in your project

Why write complicated logic in dbt when you can use someone else’s? This post will run through how to install and use some popular (and some unsung) dbt utils in your project.

No items found.

Datafold Team

Top dbt utils and how to use them in your project

Why write complicated logic in dbt when you can use someone else’s? dbt-utils is a collection of pre-written macros that helps you with things like pivoting, writing generic tests, generating a data spine, and a lot more. This post will run through how to install and use some popular (and some unsung) dbt utils in your project.

Different types of dbt utils + installation

The dbt-utils project in general is maintained by (duh) dbt Labs. Its 125 contributors include a mix of developers from both dbt Labs and the wider data community. At the time of writing, the project repo on GitHub has a little under 900 stars.

Generally speaking, you can categorize most dbt utils into five major groups:

SQL generators
Generic tests
Jinja helpers
Web macros
Introspective macros

This list is not exhaustive, but it encompasses most of the commonly used (and widely used) utils chosen by data teams working with dbt. The GitHub README shows the full list of utils, but we’ll highlight some here to give an idea of the range of what’s available:

SQL generators
i. star
ii. group_by
iii. pivot
iv. union_relations
Generic tests
i. equal_rowcount
ii. at_least_one
iii. not_accepted_values
iv. not_null_proportion
Jinja helpers
i. pretty_time
ii. log_info
Web macros
i. get_url_parameter
ii. get_url_host
iii. get_url_path
Introspective macros
i. get_column_values
ii. get_filtered_columns_in_relation
iii. get_relations_by_pattern
iv. get_query_results_as_dict

To get started with dbt utils, you’ll need to add a '''packages.yml''' file to your project folder. It should contain the following:


packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1

At the time of writing, the latest version is 1.1.1. Run the following command to install the new dependencies.


$ dbt deps

20:58:33  Running with dbt=1.5.1
20:58:33  Installing dbt-labs/dbt_utils
20:58:34  Installed from version 1.1.1
20:58:34  Up to date!

Now, we’re ready to walk through some examples. Let’s start with SQL generators.

SQL generators

Imagine you're a Data Scientist at Amazon, and you need to organize your data into a few downstream tables to prep it for analysis. You have tables related to orders, users, and products. Using star and pivot macros can help simplify your SQL queries and make them easier to maintain.

You want to create a consolidated view from your orders, users, and products tables—but you want to exclude repetitive ID fields, prefix the column names from the users and products tables to avoid confusion. You can use dbt utils!

Star macro

The star macro is used to generate a comma-separated list of fields in the FROM clause of a SELECT statement for multiple tables, excluding certain columns if indicated. You can optionally set a relation_alias, prefix, and suffix.

Let's say you have a fact table called orders, along with two dimension tables for users and products. You can use the star macro like this:


WITH base AS (
  SELECT 
    {{ dbt_utils.star(from=ref('orders'),
                      except=['users_id', 'products_id']) }},
    {{ dbt_utils.star(from=ref('users'),
                      except=['id'],
                      prefix='user_') }},
    {{ dbt_utils.star(from=ref('products'),
                      except=['id'], prefix='product_') }}
  FROM {{ ref('orders') }}
  LEFT JOIN {{ ref('users') }} ON orders.users_id = users.id
  LEFT JOIN {{ ref('products') }} ON orders.products_id = products.id
)

SELECT * FROM base

In this scenario, the star macro helps you to select all columns from each table while excluding specified columns (like id), and add prefixes to all columns from the users and products tables to avoid naming conflicts.

Pivot macro

The pivot macro is another popular dbt util that helps with a common yet hard-to-remember task: writing pivot functions with SQL (sigh). The pivot macro makes this task very easy, pivoting values from rows to columns.

‍

Let's say you've conducted A/B tests on three different types of product displays: Display A, Display B, and Display C. You've collected data in a table called display_test with date, display_type, and click_through_rate columns. Now, you want to pivot this data such that each display type becomes a separate column. You can use the pivot macro to do this:


WITH base AS (
  SELECT date, display_type, click_through_rate
  FROM {{ ref('display_test') }}
),

pivoted AS (
  SELECT 
    date,
    {{ dbt_utils.pivot(column_name='display_type',
                       values=['Display A', 'Display B', 'Display C'],
                       column_prefix='ctr_for_') }}
  FROM base
)

SELECT * FROM pivoted

Here, the pivot macro pivots the display_test table on the display_type column, turning each unique display type value into its own column prefixed with ctr_for_, and populating these columns with the corresponding click_through_rate.

Under the hood, the pivot macro is just taking the values that were passed and iterating through them with ({% for value in values %}). With each iteration, it applies the aggregation ({{ agg }}) to each value, and continues until the end of the loop of values is reached.

Generic Testing

If you’ve set up any dbt tests, you’ve probably used some of the generic ones like not_null, of which dbt natively supports only 4. dbt utils gives you, well, more of them. We’ll cover two examples but there are tons more.

equal_rowcount

Although it’s quite basic, the equal_rowcount macro is probably one of the most important tests that can be applied to any data set. This macro simply asserts that two relations—tables, views, or models—have the same number of rows. It returns true if the row count in both relations is the same and false if it is not. Super useful when you’re making a model change and want to verify that the table in dev matches the one in prod.

In this example, we're testing the orders model against the stg_orders model (which could be a staging table for the orders data). The test will pass if both models have the same number of rows, and it will fail otherwise.


version: 2

models:
  - name: orders
    columns:
      - name: id
        tests:
          - dbt_utils.equal_rowcount:
              compare_model: ref('stg_orders')

After defining the test, you can run it at the command line with dbt test. dbt outputs the results of the test (among your other tests), showing you whether it passed or failed.

not_accepted_values

Another macro that is perfect for testing is not_accepted_values, allowing you to assert that certain columns of a table never take on values from among a specified set. This macro helps to prevent incorrect or inconsistent data from being saved in the database. Note that dbt ships with the accepted_values generic test, and this is just the inverse of that.

Let's assume you have a users table, and you want to ensure that the status column never has the value inactive. You would define this as follows:


version: 2

models:
  - name: users
    columns:
      - name: status
        tests:
          - dbt_utils.not_accepted_values:
              values: ['inactive']

The values parameter is a list of unacceptable values.

Looking to level up data testing? Download our new guide to testing with dbt

Discover the goals of testing with dbt, types of tests, principles and best practices for an effective testing strategy.

Jinja helpers

dbt utils include a bunch of useful Jinja helpers. Let’s look at two of them.

pretty_time

pretty_time returns the current timestamp according to the format that you specify. This can be helpful when logging time, troubleshooting, or creating more readable time information. Here's how you would use the pretty_time macro:


SELECT
  {{ dbt_utils.pretty_time(format='%Y-%m-%d %H:%M:%S') }} AS pretty_now

log_info

Another extremely common task in pipelines is the logging of formatted messages to STDOUT. This is useful for debugging and monitoring data transformations and ETL pipelines. One such macro to help with this is log_info. For example, let's say you have a transformation for a table called orders, and you want to log a message before and after the transformation. Here is how you would do that:


{{ dbt_utils.log_info('Starting transformation for the orders table') }}

WITH base AS (
  SELECT * FROM {{ ref('orders') }}
),

transformations AS (
  SELECT
    id,
    date,
    product_id,
    quantity
  FROM base
)

SELECT * FROM transformations

{{ dbt_utils.log_info('Finished transformation for the orders table') }}

Web Macros

Another set of extremely useful macros helps you work with URLs in strings. If you’re working with web traffic data, these can be quite clutch.

get_url_parameter

The get_url_parameter macro extracts a URL parameter out of a column containing a web URL.


SELECT
    url,
    {{ dbt_utils.get_url_parameter(
        field='page_url',
        url_parameter='utm_source'
       ) }} as utm_source
FROM 
    {{ ref('web_traffic') }}

get_url_host

The get_url_host web macro is similar to the previous macro, but it extracts only the host parameter out of a column containing a web URL. A use case could be counting the number of visitors that come from a specific host.


SELECT
    COUNT(*) as cnt,
    {{ dbt_utils.get_url_host(field='page_url') }} as host

FROM 
    {{ ref('web_traffic') }}
    GROUP BY {{ dbt_utils.get_url_host(field='page_url') }}

get_url_path

The last of the three web macros available in dbt utils extracts a page path parameter out of a column that has a web URL.


SELECT
    {{ dbt_utils.get_url_path(field='page_url') }}
 as path

FROM 
    {{ ref('web_traffic') }}

Introspective Macros

Other more complex macros in dbt utils are introspective macros. They are described this way:

"These macros run a query and return the results of the query as objects. They are typically abstractions over the statement blocks in dbt."

A statement block in Jinja is surrounded by {% %}. Because the results are objects, which in turn can be acted on, this is a very powerful abstraction. Let’s consider a few examples.

get_column_values

The get_column_values macro allows you to get a list of unique values in a column and then use these values dynamically elsewhere in your SQL code. For example, suppose you have a users table with a country column, and you want to get a list of all unique countries.

Here is how you might use the get_column_values macro:


{% set unique_countries = dbt_utils.get_column_values(
  table='users',
  column='country',
  max_records=100
) %}

-- now you can use `unique_countries` elsewhere in your code
select *
from {{ ref('orders') }}
where country in ({{ unique_countries | join(', ') }})

This helps you avoid creating your own CTE to get the unique country data yourself.

get_filtered_columns_in_relation

The get_filtered_columns_in_relation macro returns an iterable Jinja list of columns for a specified relation. In addition, columns can be excluded. Note that the method is case-insensitive. A "relation" refers to any kind of database object that contains data and can be queried. This could be a table, a view, or a materialized view, among other types of database objects. dbt uses the term “relation” as a way to refer to these objects generically, regardless of their specific type.


{% set column_names = 
     dbt_utils.get_filtered_columns_in_relation(from=ref('users'),
     except=["user_name", "password"])
%}


...


{% for column_name in column_names %}
     max({{ column_name }}) ... as max_'{{ column_name }}',
{% endfor %}

This can be helpful when renaming columns, building new columns, etc.

get_relations_by_pattern

The get_relations_by_pattern macro obtains a list of relations from your database schema that match a particular pattern. It can be useful in situations where you need to dynamically generate SQL based on the existing structure of your database.

Below is a basic example of how it might be used:


{% set relations = dbt_utils.get_relations_by_pattern(pattern='my_schema.my_prefix*') %}

{% for relation in relations %}
    {{ log("Found relation: " ~ relation, info=True) }}
{% endfor %}

In this example, get_relations_by_pattern is used to get a list of all relations in the my_schema schema that start with my_prefix. The log function is then used to print the name of each relation found.

get_query_results_as_dict

Lastly, let’s look at an introspective macro that deals with dictionaries. This macro returns a dictionary from a query.

In the example below, we’re pulling cities and states from our users table, converting that data into a dict, and creating a tally of the count of users in each city or state.


{% set sql_statement %}
    select city, state from {{ ref('users') }}
{% endset %}

{%- set places = dbt_utils.get_query_results_as_dict(sql_statement) -%}

select

    {% for city in places['CITY'] | unique -%}
      sum(case when city = '{{ city }}' then 1 else 0 end) as users_in_{{ dbt_utils.slugify(city) }},
    {% endfor %}

    {% for state in places['STATE'] | unique -%}
      sum(case when state = '{{ state }}' then 1 else 0 end) as users_in_{{ state }},
    {% endfor %}

    count(*) as total_total

from {{ ref('users') }}