Data quality is your moat, this is your guide
Fortify your data, fortify your business: Why high-quality data is your ultimate defense.
Why you should care about data quality
Before we begin our quest to grapple with one of the biggest challenges facing data practitioners, let’s first make the case for why anyone should care about data quality.
If you’re reading this, chances are that it’s obvious why.
But it’s one thing to acknowledge that data quality is important, and another to understand how to implement a systematic framework for maintaining high data quality standards. If you’re looking to learn how to start implementing data quality at your organization, searching the web provides no easy answers. Everyone’s doing it differently, with different philosophies and tool stacks.
This has arguably led to more confusion than clarity: a piece-by-piece approach doesn’t work for something as interdependent as data quality. New tools can provide a one-off boost: wooden palisades around a castle can be constructed quickly to offer protection similar to built-in dbt tests. But one-off, disjointed strategies are generally reactive and limited in capabilities. For example, these pre-packaged tests offer some level of assurance, but they may not cover all the nuances and intricacies of the data ecosystem, leaving potential blind spots in data quality assessment. This tendency to settle for convenience over thoroughness will eventually lead to gaps in data quality management.
Many data practitioners already use language such as ‘most of my day is spent putting out fires!’, which speaks to the tedious, and frankly, often thankless grind of validating and fixing data at each step of an analytics workflow. Instead, what you need is a systematic approach to data quality that asks: what could be as enduring as a well-designed moat?
Before we get into outlining how such an approach could look like for your organization, let’s first run through some scenarios you might already be all too familiar with.
The data engineer and analytics engineer in their endless battle against broken pipelines
Imagine you're a data engineer who starts your day with a cup of coffee and a hopeful outlook, ready to tackle the day's tasks. However, as soon as you dive into your work, you're greeted by a flood of frustrated DMs and error alerts signaling broken data pipelines. Each alert feels like a fire to put out, but just as you put out one, another starts. You find yourself in a constant battle against inconsistent data formats, missing values, and unexpected changes in source data.
With every setback, the frustration grows, and the pressure mounts to deliver critical data to stakeholders on time. It's a relentless cycle of troubleshooting, testing, and tweaking, with no end in sight. You find yourself spending precious time putting out fires and ultimately, losing the trust of the business and becoming fatigued from the job.
The data analyst/business user with misaligned metrics and breaking dashboards
You’re a newly hired data analyst and discover obstacles from day one. You’re eager to dive into the latest metrics and uncover actionable insights for the business to prove your analytical chops. However, what you find is far from what you expected. Sales figures don't match up, customer demographics are riddled with missing information, and key performance indicators fluctuate wildly without rhyme or reason. Your attempts to make sense of the data only lead to more confusion, leaving you feeling like you're navigating a maze with no clear path forward.
But you soon realize that the consequences are much worse than confusion because other teams rely on the data team’s core analytics work. The downstream trickle of bad data affects every inch of the business. The machine learning models that power your recommendation systems train on incorrect data and produce irrelevant suggestions. Reverse ETL syncs, which power expensive ad campaigns, rely on inaccurate data and lead to wasted ad spend.
With each discrepancy, the trust in your analyses wanes, and the pressure to provide accurate insights mounts. It's a frustrating ordeal that tests your patience and challenges your credibility as a data-driven decision-maker.
Data managers struggling to balance stakeholder trust and team velocity
You’re a data manager with your sights set on the next promotion (which means, more power to advocate for your team, allocate budget towards important data work, and further your own career). But lately, you find yourself increasingly caught between two critical priorities: maintaining stakeholder trust in the reliability and accuracy of the data ecosystem while ensuring your team's velocity in delivering valuable insights.
Despite your best efforts to enforce data governance policies and foster collaboration across teams, the pace of analytics development outpaces your team’s ability to maintain data quality standards.
With each new data incident, the dream of reinforcing the importance of the data team (and a promotion!) feels further out of reach, as the pressure to restore confidence in the data ecosystem intensifies. It's a race against time to regain control and ensure that your company can rely on you to make things right.
Data quality is your business' moat
Across each of these three examples, it's clear that data quality accomplishes three critical outcomes:
- Ensures business trust in the data: Don't let your team's work go to waste.
- Increases development velocity: By automating tedious testing, accelerating code reviews, and eliminating reactive firefighting.
- Enhances your team's quality of life: By removing the toil and firefighting, everyone can focus on creative and impactful work.
It's clear: Data quality is vital to the way data teams develop models, work with other teams, and businesses function. And as data practitioners, we're not just building the moat for ourselves and to improve the quality of our work lives.
The data produced by analytics and data engineering teams often powers machine learning models, AI initiatives, reverse ETL syncs to ad platform audiences, reporting in Salesforce, and a (truly endless) list of other use cases that extend beyond the data team. None of these efforts can succeed without the data powering them being accurate, timely, and usable. The success of these use cases depends on your data and its quality.
Data quality is your moat to create machine learning and data science models that have high accuracy.
Data quality is your moat to run ad campaigns that have incremental and efficient costs per acquisition.
Data quality is your ultimate defense to succeed as a data team, and succeed as a business.
How to use this guide
If you identify with any of these scenarios, you already know how hard it is to find resources that you can actually use. It’s not just about selecting the right framework, learning how to use the right set of tools, but also about implementing processes to raise data quality standards across your organization. And you feel the pressure of getting the data right.
We know that data quality is not just for data engineers, because it goes beyond creating sophisticated CI pipelines. It’s about cultivating a culture of data quality that taps on different roles and teams to build better safeguards. Trying to do all of this is really challenging when everything is always on fire and things are perpetually behind schedule.
We created this guide as a way to set out an integrated approach that approaches high data quality as part of both data pipelines and business processes. It’s relevant for data engineers, analytics engineers, data analysts and data scientists, data managers and the C-suite, and even accountants, as you’ll see why in our case studies later on.
Ultimately, what we want to give you is an enduring moat for your business, whether it’s an emerging e-commerce brand or the next generative AI company. Data quality is the key to building your data team’s trust and accelerating velocity of development.
Across our chapters, we’ll talk about three pillars that go into building a mature data quality system that serve as the moat for your business:
- Proactive data quality testing by shifting left
- Automated pre-production testing instead of inconsistent manual validation
- Fostering a data quality culture within and outside of the data team
We’ll look at proactive mental models and workflows; automating where possible to reduce error and increase efficiency; and what fostering a data quality culture looks like within data teams and across the organization.
We hope this guide will serve three purposes:
- A practical reference for implementing new data quality concepts, tools, and heuristics at the workplace
- A technical roadmap on applying these concepts effectively within your codebase
- A strategic blueprint to progressively improve data quality practices at work and increase organizational awareness of how data quality can impact high-level business outcomes
A note before you continue on this journey: We all know how the data industry has evolved rapidly over the last decade, and with it, has come a bloom of technologies that have revolutionized the way data teams store, transform, expose, and use data. As a result, this guide is a living document that will be added and updated to as data tooling, problems, and solutions mature.