Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and time
for the demo
Get a 30-minute demo
and see datafold in action
///
September 27, 2022
dbt, Data Monitoring

How to Create dbt Freshness Alerts

Learn how to turn dbt freshness tests into production monitoring alerts using Datafold.

No items found.
Kyle McNair

dbt freshness tests are a fantastic way to monitor source data, and to ensure that new records are arriving at an expected cadence. However, these tests require you to define this expected cadence, which can be difficult to manage at scale.

So how can we turn dbt tests into alerts that are constantly monitoring production?

For example, one might configure an event source to expect new records every day, like so:



version: 2

sources:
- name: events
database: analytics
schema: events
freshness:
error_after:
count: 1
period: day
tables:
- name: user_signed_up
- name: page_viewed
- name: new_feature_configured


Great! We’d definitely expect page events every day, probably expect to see new users sign up every day, and maybe expect our new feature configuration event to occur every couple days. 

After digging around the events data, we realize that the daily period needs a little tuning, and we update the freshness thresholds to match reality. Users sign up every day, page views events should arrive every 12 hours, and our new feature configuration occurs once a week.



version: 2

sources:
- name: events
database: analytics
schema: events
tables:
- name: user_signed_up
freshness:
error_after:
count: 1
period: day
- name: page_viewed
freshness:
error_after:
count: 12
period: hour
- name: new_feature_configured
freshness:
error_after:
count: 7
period: day


There’s nothing inherently wrong with this, but it’s incredibly manual, and could change over time. What if adoption of our new feature increases dramatically? What if it’s a flop and usage declines even further? Either scenario would require us to manually update the freshness period, and could result in false alarms or silent failures.

Just the other day, I woke up to a Slack alert that a source table was stale. 😱

My work was interrupted by the alert as I frantically scrambled to investigate. I ran some ad-hoc queries, only to discover that the freshness period on this event simply needed to be updated. To close the loop, I opened a pull request to update the test, which required attention from a colleague. All of which added no real value to the business, and only served to distract from higher priority tasks. I’ve personally had this scenario happen to me at least a dozen times.

A potential downside of rigorous testing is alert fatigue. An overwhelming number of alerts, most of which are false alarms, trains the people responsible for responding to become desensitized, and ultimately ignore alerts.

What if . . . you could dynamically configure expected thresholds, and never have to deal with noisy alerts or manually determine thresholds again? Datafold alerts to the rescue!

Datafold alerts utilize an ML-based forecasting model that adapts to seasonality and trends in your data to create dynamic thresholds. You can adjust the sensitivity or add custom anomaly detection to optimize your signal-to-noise ratio.

 

Ok! So Datafold has a cool chart thingy that can detect anomalies - how can we use this to detect stale data?

High level steps:

  1. Create a dbt snapshot of the information schema to record table metadata over time
  2. Write a SQL query to summarize table changes
  3. Put query results into a Datafold alert

Create a dbt Snapshot of the Information Schema

Add the information schema as a source. In this example, I’m targeting our `analytics` database.



version: 2

sources:
- name: information_schema
database: analytics
schema: information_schema
tables:
- name: tables


Create a snapshot of the information schema tables source. This will allow us to see table metadata over time.



{% snapshot information_schema__tables_history %}
{{
config(
target_schema='analytics',
unique_key='unique_key',
strategy='check',
check_cols='all',
)
}}

select *
, table_schema||table_name as unique_key
from {{ source('information_schema', 'tables') }}
{% endsnapshot %}


Each update to the table will be captured in our snapshot

Write a SQL Query to Summarize Table Changes

I was interested in changing row counts per table, so this query focuses on changes in table record counts from day to day. 




with dates as ( --- date spine
select date as _date
from analytics.analytics.dim_date
where
_date between current_date - 90 and current_date
)

,metadata as (
select table_name,row_count,dbt_valid_from,dbt_valid_to
from analytics.analytics.information_schema__tables_history
where
table_schema = 'ANALYTICS'
and table_type = 'BASE TABLE'
and table_name in (
'GRAPHQL_REQUESTS'
,'CI_RUNS'
,'DIM_DATE'
,'ORGANIZATIONS'
,'PAGEVIEWS'
)
)

,daily_row_counts as (
select _date,table_name,max(row_count) as daily_row_count
from dates
inner join metadata
on _date between metadata.dbt_valid_from::date and ifnull(dbt_valid_to,current_timestamp)
group by 1,2
)

,new_row_counts as (
select *
,lag(daily_row_count,1) over (partition by table_name order by _date) as prior_day_rows
,daily_row_count - prior_day_rows as row_diff
from daily_row_counts
)

select _date,table_name,sum(row_diff) as daily_row_diff
from new_row_counts
where row_diff is not null
group by _date,table_name
;

Example output with date, table name, and the daily change in record counts:

Put Query Results into a Datafold Alert

Finally, creating a Datafold alert is as simple as dropping the above query into the Datafold alert UI. To properly scale, it might make sense to park this query in a dbt model.

Voila! 

Table with thousands of new daily records:

Table with few new daily records:

Table that should never change:

You now have alerts with dynamically determined thresholds, and can subscribe to Slack alerts or emails with just a few clicks.

Notes:

  • This method only applies to materialized tables. Views have no record counts, and thus we can’t detect changes in row counts for views. 
  • This guide is Snowflake specific, but could be easily adapted to the information schema of another database type.