Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and time
for the demo
Get a 30-minute demo
and see datafold in action

How a Global Credit Rating Agency ensures data quality for their AI strategy with Datafold

About

In the financial analytics industry, few players are as central to the ecosystem as this global credit rating and risk assessment leader, providing accurate data and AI solutions to help businesses and governments make data-driven decisions. Central to their mission is the commitment to providing clients with accurate data. Despite having a robust testing CI/CD workflow, the data engineering team was challenged with novel data issues slipping through and delaying client deliverables. By integrating Datafold Cloud's data diffing capabilities into their CI process and dbt project, the team was able to boost data quality and deployment confidence at scale.

Data team size
20+
Total Employees
1000+
Data stack
dbt Cloud
Amazon Redshift
PostgreSQL
GitHub
Key metrics
99%
Certainty changes won't break data

Introduction

In the financial analytics industry, few players are as central to the ecosystem as this global credit rating and risk assessment leader, providing accurate data and AI solutions to help businesses and governments make data-driven decisions. Central to their mission is the commitment to providing clients with accurate data. Despite having a robust testing CI/CD workflow, the data engineering team was challenged with novel data issues slipping through and delaying client deliverables. By integrating Datafold Cloud's data diffing capabilities into their CI process and dbt project, the team was able to boost data quality and deployment confidence at scale.

Customer quote

"With Datafold, we know that we are delivering reliable data to our downstream consumers and machine learning models. We don't have to worry as much."

Senior Data Engineer

The challenge: Preventing novel data issues from undermining client deliverables 

The data engineering team already had a CI/CD workflow with dbt tests to ensure that any code changes underwent standardized testing before deployment to production environments. Ensuring high data quality was critical to their AI strategy, as their ML chatbots depend on the integrity of the data pipelines feeding into them. 

Whenever a pull request was opened, the CI/CD pipeline would run dbt tests only on modified dbt models (i.e., Slim CI). Because the team averaged 25 PRs a week, they also conducted a second, more comprehensive round of checks overnight to avoid having too many concurrent PRs. During the overnight process, all open PRs were merged into a separate branch from the main repository, and then the dbt tests were executed on every single dbt model. These tests include referential integrity checks and other data quality checks to thoroughly validate all code changes and the overall integrity of the data pipeline. 

Despite this robust setup, the team ran into a significant data issue not covered by preexisting tests. During their month-end run, a QA engineer found a discrepancy between the aggregate numbers projected for the current month and the previous month, which halted data delivery to the customer. In their retrospective, they discovered that a single PR with a novel data issue was not detected by the CI/CD pipeline and was allowed to merge into production. 

This data quality incident underscored the need for a data quality testing tool that exceeded the limited scope of dbt tests, which can only catch known potential data problems and not a much more comprehensive range of scenarios of failure that developers can't anticipate. 

The solution: Datafold's data diffs surfaces the downstream impact of all code changes 

Datafold emerged as a perfect CI/CD integration for conducting more thorough and proactive testing of code changes. By integrating Datafold into their existing workflow, the company could not only run more checks in the background for open PRs, but also analyze the potential impact on data quality and integrity before merging to production. This proactive approach empowered the team to identify issues early in the development cycle, preventing data discrepancies from reaching production environments.

Scrutinizing each data point with Data Diffs

Unlike other data quality frameworks, Datafold Cloud runs value-level diffs, which means that the credit rating agency could now know whether the datasets changed on a value-level basis. This granularity allowed the team to capture anticipated and unexpected data changes. Because Datafold works by integrating into CI pipelines for their dbt projects, the organization could see the immediate impact that greater data transparency and higher data quality brought about. As a senior data engineer at the company remarked, integrating Datafold into their workflow was seamless: "Datafold [was] very easy to just snap into this architecture." 

This new architecture ensured efficient and comprehensive coverage of data quality testing, enabling the team to catch potential data issues early in the development process. 

Adding labels to run Datafold on PRs without merging them 

The company devised a workflow with custom labels to run data diffs on PRs without actually merging them. When a PR is labeled as "Datafold", Datafold is triggered to run in the background specifically for that PR. This ensured complete coverage of Datafold's data diffs for all PRs designated for further testing. 

Once completed, the Datafold bot comments directly on the PR with a high-level summary of the results for the team to review easily. They first focus on two metrics: row count and column values. 

For example, a significant increase in the row count indicates potential changes resulting from filter or join condition adjustments and prompts further scrutiny to understand the implications of any modifications made. Conversely, if no expected increase is observed despite code changes, it serves as a warning sign to investigate potential unintended consequences.

The team also carefully reviews differences in column values for records that were not created or deleted due to the PR to find unexpected alterations and potential red flags. 

When reviewing an intended change, the team continues to investigate the downstream impact of a PR within the user-friendly Datafold app. Datafold's visualization of changed values and the downstream impact allows the team to scan through and see which values changed, down to the row level. They can now assess the broader implications of the PR and make informed decisions regarding its implementation. 

The result: Greater confidence and transparency from development to deployment

By providing full visibility into the changes introduced by pull requests and their potential impact on the data pipeline, Datafold Cloud helped the company's developers feel more confident about what they deployed to production and delivered to clients. 

Democratized and improved PR reviews 

Datafold's integration into the company's CI/CD workflow has led to significant improvements in PR quality by democratizing the review process. As Datafold Cloud provides full visibility into the impact of code changes, developers did not need high levels of prior domain knowledge to review the impact of changes. 

Developer confidence to merge into production

By providing visibility into the entire data pipeline, from upstream dimension models to dashboard models, data engineers feel more confident about reviewing complex PRs. Reviewing Datafold Cloud's app gave the engineering team a "99% certainty" that their changes wouldn't break anything downstream and confidence about delivering reliable data to all internal downstream consumers and their client-facing AI chatbots. 

Transparency with stakeholders

Datafold Cloud's integration also played a pivotal role in increasing transparency with stakeholders, particularly business analysts, at the organization. Previously, when data engineers presented SQL code changes to business analysts, there was often a risk of misunderstanding as analysts did not fully understand the impact of these changes until they were deployed to production. Datafold Cloud's app made these code changes much less abstract: analysts could now see how the data would be affected before it was deployed, enabling them to make informed approvals. Business stakeholders could also verify first-hand how their various AI initiatives and apps are built on a foundation of high-quality data. 

Data-driven decision-making

Datafold's robust profiling capabilities led to unexpected efficiencies in the analytics workflow. Business analysts could now examine data attributes such as fill rates and value distributions on a large scale. This in-depth analysis allowed them to identify patterns, anomalies, and trends within the data more effectively. With a clearer understanding of the data, analysts could provide better business requirements to data engineers, reducing the need for rework and speeding up development cycles. 

Safeguarding their AI strategy against the long tail of unknown unknowns

Given the company's position as a foundational player in the global financial system, data quality is essential to how they operate. With hundreds of models and thousands of columns powering their LLM-powered AI chatbots, it's impractical to anticipate every potential scenario that could go awry. Datafold's capability to capture the long tail of unknown potential data issues, especially in complex data environments with numerous models and columns, empowers teams to cover the entirety of their code changes and better mitigate risks while enabling them to ship better data and production-grade AI applications. 

If you want to learn more about how this leading credit rating agency used Datafold Cloud to increase data quality and deployment confidence, contact us.

CI/CD Testing