Top dbt utils and how to use them in your project
Why write complicated logic in dbt when you can use someone else’s? This post will run through how to install and use some popular (and some unsung) dbt utils in your project.
Why write complicated logic in dbt when you can use someone else’s? dbt-utils is a collection of pre-written macros that helps you with things like pivoting, writing generic tests, generating a data spine, and a lot more. This post will run through how to install and use some popular (and some unsung) dbt utils in your project.
Different types of dbt utils + installation
The dbt-utils project in general is maintained by (duh) dbt Labs. Its 125 contributors include a mix of developers from both dbt Labs and the wider data community. At the time of writing, the project repo on GitHub has a little under 900 stars.
Generally speaking, you can categorize most dbt utils into five major groups:
- SQL generators
- Generic tests
- Jinja helpers
- Web macros
- Introspective macros
This list is not exhaustive, but it encompasses most of the commonly used (and widely used) utils chosen by data teams working with dbt. The GitHub README shows the full list of utils, but we’ll highlight some here to give an idea of the range of what’s available:
- SQL generators
i. star
ii. group_by
iii. pivot
iv. union_relations - Generic tests
i. equal_rowcount
ii. at_least_one
iii. not_accepted_values
iv. not_null_proportion - Jinja helpers
i. pretty_time
ii. log_info - Web macros
i. get_url_parameter
ii. get_url_host
iii. get_url_path - Introspective macros
i. get_column_values
ii. get_filtered_columns_in_relation
iii. get_relations_by_pattern
iv. get_query_results_as_dict
To get started with dbt utils, you’ll need to add a '''packages.yml''' file to your project folder. It should contain the following:
At the time of writing, the latest version is 1.1.1. Run the following command to install the new dependencies.
Now, we’re ready to walk through some examples. Let’s start with SQL generators.
SQL generators
Imagine you're a Data Scientist at Amazon, and you need to organize your data into a few downstream tables to prep it for analysis. You have tables related to orders, users, and products. Using star and pivot macros can help simplify your SQL queries and make them easier to maintain.
You want to create a consolidated view from your orders, users, and products tables—but you want to exclude repetitive ID fields, prefix the column names from the users and products tables to avoid confusion. You can use dbt utils!
Star macro
The star macro is used to generate a comma-separated list of fields in the FROM clause of a SELECT statement for multiple tables, excluding certain columns if indicated. You can optionally set a relation_alias, prefix, and suffix.
Let's say you have a fact table called orders, along with two dimension tables for users and products. You can use the star macro like this:
In this scenario, the star macro helps you to select all columns from each table while excluding specified columns (like id), and add prefixes to all columns from the users and products tables to avoid naming conflicts.
Pivot macro
The pivot macro is another popular dbt util that helps with a common yet hard-to-remember task: writing pivot functions with SQL (sigh). The pivot macro makes this task very easy, pivoting values from rows to columns.
Let's say you've conducted A/B tests on three different types of product displays: Display A, Display B, and Display C. You've collected data in a table called display_test with date, display_type, and click_through_rate columns. Now, you want to pivot this data such that each display type becomes a separate column. You can use the pivot macro to do this:
Here, the pivot macro pivots the display_test table on the display_type column, turning each unique display type value into its own column prefixed with ctr_for_, and populating these columns with the corresponding click_through_rate.
Under the hood, the pivot macro is just taking the values that were passed and iterating through them with ({% for value in values %}). With each iteration, it applies the aggregation ({{ agg }}) to each value, and continues until the end of the loop of values is reached.
Generic Testing
If you’ve set up any dbt tests, you’ve probably used some of the generic ones like not_null, of which dbt natively supports only 4. dbt utils gives you, well, more of them. We’ll cover two examples but there are tons more.
equal_rowcount
Although it’s quite basic, the equal_rowcount macro is probably one of the most important tests that can be applied to any data set. This macro simply asserts that two relations—tables, views, or models—have the same number of rows. It returns true if the row count in both relations is the same and false if it is not. Super useful when you’re making a model change and want to verify that the table in dev matches the one in prod.
In this example, we're testing the orders model against the stg_orders model (which could be a staging table for the orders data). The test will pass if both models have the same number of rows, and it will fail otherwise.
After defining the test, you can run it at the command line with dbt test. dbt outputs the results of the test (among your other tests), showing you whether it passed or failed.
not_accepted_values
Another macro that is perfect for testing is not_accepted_values, allowing you to assert that certain columns of a table never take on values from among a specified set. This macro helps to prevent incorrect or inconsistent data from being saved in the database. Note that dbt ships with the accepted_values generic test, and this is just the inverse of that.
Let's assume you have a users table, and you want to ensure that the status column never has the value inactive. You would define this as follows:
The values parameter is a list of unacceptable values.
Jinja helpers
dbt utils include a bunch of useful Jinja helpers. Let’s look at two of them.
pretty_time
pretty_time returns the current timestamp according to the format that you specify. This can be helpful when logging time, troubleshooting, or creating more readable time information. Here's how you would use the pretty_time macro:
log_info
Another extremely common task in pipelines is the logging of formatted messages to STDOUT. This is useful for debugging and monitoring data transformations and ETL pipelines. One such macro to help with this is log_info. For example, let's say you have a transformation for a table called orders, and you want to log a message before and after the transformation. Here is how you would do that:
Web Macros
Another set of extremely useful macros helps you work with URLs in strings. If you’re working with web traffic data, these can be quite clutch.
get_url_parameter
The get_url_parameter macro extracts a URL parameter out of a column containing a web URL.
get_url_host
The get_url_host web macro is similar to the previous macro, but it extracts only the host parameter out of a column containing a web URL. A use case could be counting the number of visitors that come from a specific host.
get_url_path
The last of the three web macros available in dbt utils extracts a page path parameter out of a column that has a web URL.
Introspective Macros
Other more complex macros in dbt utils are introspective macros. They are described this way:
"These macros run a query and return the results of the query as objects. They are typically abstractions over the statement blocks in dbt."
A statement block in Jinja is surrounded by {% %}. Because the results are objects, which in turn can be acted on, this is a very powerful abstraction. Let’s consider a few examples.
get_column_values
The get_column_values macro allows you to get a list of unique values in a column and then use these values dynamically elsewhere in your SQL code. For example, suppose you have a users table with a country column, and you want to get a list of all unique countries.
Here is how you might use the get_column_values macro:
This helps you avoid creating your own CTE to get the unique country data yourself.
get_filtered_columns_in_relation
The get_filtered_columns_in_relation macro returns an iterable Jinja list of columns for a specified relation. In addition, columns can be excluded. Note that the method is case-insensitive. A "relation" refers to any kind of database object that contains data and can be queried. This could be a table, a view, or a materialized view, among other types of database objects. dbt uses the term “relation” as a way to refer to these objects generically, regardless of their specific type.
This can be helpful when renaming columns, building new columns, etc.
get_relations_by_pattern
The get_relations_by_pattern macro obtains a list of relations from your database schema that match a particular pattern. It can be useful in situations where you need to dynamically generate SQL based on the existing structure of your database.
Below is a basic example of how it might be used:
In this example, get_relations_by_pattern is used to get a list of all relations in the my_schema schema that start with my_prefix. The log function is then used to print the name of each relation found.
get_query_results_as_dict
Lastly, let’s look at an introspective macro that deals with dictionaries. This macro returns a dictionary from a query.
In the example below, we’re pulling cities and states from our users table, converting that data into a dict, and creating a tally of the count of users in each city or state.
Any other dbt utils that we missed? Let us know!