How to use dbt source freshness tests to detect stale data

You’ve probably heard the term freshness thrown around in the data world. But what does freshness even mean? Freshness refers to how up-to-date your data is. It’s an important measure of data quality because it lets you know whether data has been ingested, transformed, or orchestrated on the schedule that you expect. Freshness becomes even more critical for businesses using near real-time data to ensure they are making the best decisions on the most recent data, not stale data

Importance of testing data sources for freshness 

While testing freshness at every level is a best practice, it is most important to test it at the source. When I say testing at the source, I mean testing the raw data that is being ingested into your data warehouse. This is essentially the first step in using data for analytics and eventually data transformation. 

Testing at the source helps you to detect issues in your data when it is ingested, before it has the chance to affect your data models. It also helps in the debugging process if one of your downstream data models fails. You can hone in on where the issue occurred -at the source or model level- rather than needing to check both. If a source and model fail, it’s most likely a source issue. If the model fails but the source does not, you know to debug the transformation logic first. 

Freshness ensures you have data being ingested and/or generated at the time cadence you expect. I’ve dealt with issues in the past where ingestion tools said data was being ingested into the warehouse, only to find out that it wasn’t properly working for over two weeks. This resulted in two and a half weeks of lost data! If I had added freshness tests at the source earlier, this would have never happened. 

Adding freshness tests to your sources

dbt source freshness is a dbt command to test whether your source tables have been updated within a desired frequency. dbt source freshness tests work are configured using a field named 'loaded_at_field'. This is a timestamp column you wish to use to check freshness. dbt looks at this timestamp column in your source and compares it to a time cadence you set. You can choose a time period of minute, hour, day, etc. and specify the numeric quantity. If data is older than the time the test is run minus that cadence, the test will either fail or warn you.

Luckily, dbt makes it easy to add freshness tests to all of your data sources. You simply add a freshness block to a source or source table in the YAML file where it is defined. 

sources:
  - name: google_sheets
    databse: raw
    schema: google_sheets
    tables:
      - name: customer_emails
        freshness:
          warn_after:
            count: 1
            period: day
        loaded_at_field: created_at

Here is where you would specify the column in your source table that you would want to use for the loaded_at_field. Because the loaded_at_field is being compared to the time the model is being run in your data warehouse, make sure the timezones match. I recommend converting all of the columns you use for this field to UTC time in order to prevent any timezone errors. 

It’s important to note that if you add a freshness test to an entire source, rather than an individual source table, the loaded_at_field column needs to be the same in every table within that source. You also need to ensure that the freshness time periods would be the same for every table. I tend to recommend against this, because it’s unlikely that every source table is updated in the same way. 

dbt source freshness tests give you the option for both error_after and warn_after tests, or one and not the other. An error_after test will fail your test if that condition is not met while a warn_after test will simply make you aware that the data doesn’t meet your expectations.

I typically like to set the warn_after to be more strict, so a lower time period, and the error_after a few hours or days greater depending on your use case. This way, you have time to investigate the issue and fix it before the test fails entirely. 

Filtering your freshness tests

While this is not necessary, dbt also gives you the option to filter the data that your freshness tests are being applied to. You simply specify a filter block and then the query you wish to filter your data by. This will add this query as a WHERE clause within your compiled SQL when the freshness test is run.

		tables:
      - name: customer_emails
        freshness:
          warn_after:
            count: 1
            period: day
          filter: is_deleted='false'
        loaded_at_field: created_at

Here, we are filtering for records that haven’t been deleted. If a record was deleted in our data source, we don’t care about its freshness. By filtering out all of the records that have been deleted, our freshness test will run faster and we may save on compute costs. 

Determining your freshness values

Choosing the right freshness values is the most important thing when implementing freshness tests. It is important to consider how often you refresh your data sources in your data warehouse. You need to consider what is normal for each source table. For example, you wouldn’t want to test the freshness of your users table for new records every hour if it’s unlikely for your website to get new users every hour.

I recommend running a query to look at the maximum time between your loaded_at_field column values, for each source table. This will help you to judge what is normal for a particular source where ingestion times are less predictable. 

WITH next_time AS (

  SELECT 
	  customer_id, 
	  created_at,
	  LEAD(created_at) OVER(ORDER BY created_at ASC) AS next_time 
  FROM {{ source(‘google_sheets’, ‘customer_emails’) }}
)

time_difference AS (

  SELECT 
	  datediff(hour, created_at, next_time) AS time_difference 
  FROM next_time
)

SELECT MAX(time_difference) FROM time_difference

Once you run this query, consider this number and then the jobs that refresh your sources/ingest new data. Remember freshness is also determined by how often a source’s data is ingested into your warehouse. For example, users may be visiting your website every minute, but if your application data only ingests twice daily, it’s not reasonable to set a freshness test for 1 hour. Instead, it should be every 12 hours when this table is refreshed. 

It’s also important to consider the potential for alert fatigue when determining your freshness values. If you are getting too many warnings or errors, ask yourself if these tests are even necessary. Can you readjust the freshness period you set? What’s the criticality of the data? Ask yourself, if you get an alert and don’t jump to fix it, do you need that alert at all?

In my last role, I had set a freshness alert on a Google Sheet that the marketing team was supposed to update weekly. In reality, the person was updating it monthly rather than weekly. I kept getting alerts and ignoring them because I knew they probably hadn’t gotten around to updating it yet. Rather than continually getting these alerts, I updated the freshness to a monthly check. 

How to run freshness tests

I always recommend that freshness tests be run before any of your dbt data models run. There are a few different ways to do this, depending on whether you are using dbt core or dbt Cloud.

The first thing to note is that the dbt build command does not include source freshness checks. While this is a popular command for running data models and their corresponding tests, it will not run freshness tests on your sources. 

For this reason, I always recommend specifying the command(s) that will run your freshness tests. dbt source freshness is the most popular command to run freshness tests because it runs all of the freshness tests on all of your sources. However, if you want to specify a specific source, you can use the select flag to do so. 

If I wanted to run all of the freshness tests on my Google Sheets source, I would do so like so:

dbt source freshness --select source:google_sheets 

I can also run a freshness test for a specific table within a source. If I wanted to run this for the customer emails table within Google Sheets, I would do so like so:

dbt source freshness --select source:google_sheets.customer_emails

With dbt Cloud, you have an option to run freshness checks on all of your sources before your job executes. By doing it this way, rather than using a dbt source freshness command, your jobs will still run even if a freshness test fails. When you use the dbt source freshness command within the job itself, if the test fails, then your sequential models will never be triggered to run. 

Once freshness tests are run, the results are written to a file called sources.json in the target directory, which is where dbt stores all of the compiled models. This will give you a good overview of the history of your freshness tests and which ones seem to fail most often.

Other ways to check for data freshness

In addition to checking sources for freshness, you can also add freshness tests to any of your data models. However, freshness tests for models are not offered as dbt generic tests. Instead you need to use dbt packages like dbt_expectations, dbt_utils, and re_data. 

Using dbt_expectations 

dbt_expectations also offers a way to test the freshness of your models. You can use the expect_row_values_to_have_recent_data test on any timestamp column in a model by specifying the time period to compare the most recent value in that column with. 

models:
  - name: stg_customer_emails
    columns:
      - name: created_at 
        tests:
          - dbt_expectations.expect_row_values_to_have_recent_data:
              datepart: day
              interval: 10

Here, if a column doesn’t have data as recent as 10 days ago, the test will fail. With this test, it is important that you specify it on a column that represents the freshness of the table as a whole. Unlike with freshness tests, you are testing an individual column rather than the source table. 

Using dbt_utils

The dbt_utils package contains a recency test which also tests for freshness. This test is added to a model and asserts that a timestamp column in the model contains data that is at least as recent as the specified date interval. 

models:
  - name: stg_customer_emails
    tests:
      - dbt_utils.recency:
          datepart: day
          field: created_at
          interval: 10

As you can see, this test uses similar inputs as the other two tests.

Using re_data

Lastly, the dbt package called re_data, also offers freshness tests. While re_data allows you to test for many other metrics other than freshness, it also includes this as one you can calculate. It uses the same timestamp column, which you specify as the time_filter, for all of its calculated metrics. Similarly with the other tests, this should be a column that represents when a record was created.

{{
   config(
     re_data_monitored=true,
     re_data_time_filter='created_at',
     re_data_metrics={'table': ['freshness'],
     re_data_anomaly_detector={'name': 'modified_z_score', 'threshold': 3.0} }}
   )
}}

re_data tests are configured in the dbt_project.yml file or added as a config at the top of the model you are applying the test to. Above is an example of what a config block for adding a freshness test would look like. 

Conclusion

Testing your data sources for freshness is essential for having accurate and reliable data models. When the source data is not fresh, you can’t count on the data quality of the downstream data models and dashboards to be high. Adding dbt generic freshness tests is a great first step to ensuring you are always ingesting the data you think you are.

I recommend re-evaluating and adjusting these freshness tests to fit your ever-changing needs. A certain warn and error after a period for these tests may not always make sense. As the data and how you use it changes, so should the level of freshness required of your data sources. Testing is an ever-changing game that needs to be paid close attention to.

And, when you are confident with your source freshness, you also have the option to add in freshness at other layers within your data stack using dbt packages and other data quality tools.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes