Data quality is your moat, this is your guide
Fortify your data, fortify your business: Why high-quality data is your ultimate defense.
Data governance and data quality
Everyone in tech is talking about data governance and data quality as though these are normal, everyday terms. Data engineers are familiar with data quality, but have typically worked at a distance from governance — a term primarily used by security and compliance wonks until fairly recently.
There are plenty of other resources out there with their takes on data governance; we're not here to repeat any of that. Instead, this section aims to the real questions you have about data governance. Maybe you’re just starting on your journey to governing your data or maybe you’re getting pulled into meetings about data governance. You might have questions like:
1. What does governance have to do with data?
2. Do you really need to understand data governance to achieve great data quality?
3. How does data governance relate to a data quality strategy?
4. What are some data quality best practices typically used for governance initiatives?
We'll answer these questions (and others) to equip you with what you need to know to be seen as an informed leader in your organization's data governance roadmap.
What is data governance? How does governance relate to data?
Data governance is a set of practices and processes to make sure that data is managed properly throughout its lifecycle. Just as the council sets rules and policies to govern the kingdom, data governance establishes guidelines and protocols to manage data effectively.
When data governance works well, it ensures that data is secure, private, high-quality, available and usable. Lack of proper governance is like having a weak council that can't seem to agree on anything or agrees on the wrong things. Or maybe there's no council at all and decisions are being made in ad hoc ways. The castle becomes vulnerable to attacks, enemies slip through unnoticed, and trust in the kingdom's ability to protect its treasures diminishes.
Why is data governance increasingly important?
If it feels like you're hearing more about data governance at offline or online, it's because it has entered the spotlight from the attention and investment being given to AI and emphasis on regulations like GDPR and CCPA. There’s concern about how data is used, not so much in terms of reports and analytics, but in terms of ethics and data privacy and administrivia. Governance has become part of the data zeitgeist, and so it's important for everyone, from data engineers to business users, to understand exactly what it is and how it works.
Companies across the globe are collecting lots of data and using it in novel ways that may cause some concern. We’re seeing it crop up for a variety of reasons:
- New and evolving international regulation around data controls (think: GDPR, HIPAA)
- Businesses wishing to build predictive analytics for specific demographics and market conditions
- The increasing importance of data in our day-to-day lives (pandemics, airport security wait times)
- Fear that AI could lead to disastrous consequences
Data governance is a “big deal” right now. We’re handing over more data than ever and we don’t need a nefarious sorcerer plotting to exploit it against us.
Is data governance a technical or non-technical concept? Who's responsible for it?
Data governance is a bit of both. It involves both technical aspects, like setting up systems for data management, and qualitative considerations, such as defining policies and ensuring compliance with regulations.
It includes both technical and non-technical business rules for how data is collected, stored, accessed, and used across different projects and teams. This ranges from reporting and analytics work, security policies around users and roles, data privacy rules, to how data pipelines are configured.
As for who's responsible, the short answer is that it's a shared effort. Think of it like a round table in the medieval court—everyone has a role to play. Data executives set the vision and priorities, data engineers implement the technical solutions, data stewards oversee data quality and integrity, and legal and compliance teams ensure adherence to regulations.
Data governance vs. data quality: What’s the difference?
At a high-level, data governance and data quality are quite different. There are eight dimensions of data quality but there are approximately eight kajillion dimensions of data governance. (Yes that’s a real number, no we don’t have a link.) Data governance is a comprehensive business strategy that concerns any person, process, or technology that produces, consumes, or alters data.
Data quality is a component of data governance
Hence, data quality is one facet of data governance. Governance encompasses the management of data. This includes data quality and includes ensuring that data is accurate, complete, consistent, reliable, and relevant for the business processes it supports.
Governance is not always easily be measured, while quality can. Governance doesn’t necessarily have a specific set of activities to define it, and data quality does.
Data quality management fits into a broader data governance strategy, giving businesses data that is trustworthy and fit for purpose while supporting accurate analytics, decision-making, and operational processes. Improving and maintaining data quality involves several key activities, such as:
- Data Preparation: Collecting, transforming, cleaning, and other steps to correct errors, inconsistencies, and duplicates in data.
- Data Testing: Verifying the correctness of data preparation processes through rigorous testing before deployment to production environments.
- Data Monitoring: Continuously tracking data quality metrics in production to identify any deviations or anomalies that may require attention.
Data quality best practices that support data governance
As the old saying goes: you manage what you measure. Though data governance isn’t quantifiably measurable on its own, a broad suite of metrics and reporting can paint a picture of governance. People who monitor governance might ask: “Are we proactively managing our data quality? Do we have data quality testing standards in-place that everyone must follow?”
Because data quality is a component of data governance, it’s important that every measurable aspect of data quality is made available to governance reporting.
Circling back to one of our questions: Do you really need to understand data governance to achieve great data quality? The answer is: yes! While data engineering primarily focuses on the technical aspects of building data infrastructure, a solid understanding of data governance principles is indispensable for ensuring data reliability, compliance, efficiency, collaboration, and security.
A proactive data quality testing approach can enable effective data governance. We'll get into what proactive data quality is a little later, but we'll look at how the three principles directly contribute to effective data governance:
1. Shift-left testing
Effective data governance requires maintaining high data quality standards throughout the entire data lifecycle. Shift left testing involves testing data quality early in the data pipeline, ideally as close to the data source as possible. By catching and addressing data quality issues early on, organizations can prevent the propagation of errors downstream. This ensures that data quality is addressed proactively, aligning with governance goals of ensuring accurate and reliable data.
2. A data checklist before shipping anything
Before data is moved to production environments, it's essential to ensure that it meets predefined quality criteria. Checklists help enforce these criteria by specifying what needs to be validated before data is released.
Incorporating data quality checks into these checklists aligns with data governance objectives by establishing standardized procedures for assessing data quality before it's deployed. This ensures that only high-quality data is made available for consumption, supporting governance goals of maintaining data accuracy and integrity. And you don't need a long checklist; as we demonstrate, you only need to verify three things.
3. Automate wherever possible
Automation reduces manual intervention and human error, leading to more consistent and reliable data quality assessments. By automating data quality tests and checks, organizations can ensure that data governance policies are consistently applied across datasets and pipelines.
There are many ways to upgrade and automate processes across the data lifecycle to support good data governance, including:
- Continuous integration (CI): Build an automated pipeline for all of your data-related code, using a code repository (like GitHub and its built-in tooling), test environments, and automated workflows
- Automated testing: Rigorously test your data and any data-related changes in pre-production environments using fully-automated regression tests and data diffs
- Automated data monitoring: Proactively monitor your source and production data for anomalies and quality issues using data observability and quality tools
When you’re automated and standardized in your data management practices, you develop a set of best practices that are enforceable through processes like CI/CD checks. You build a data culture that includes the benefits of transparency and accountability. It gives stakeholders what they need, achieves your governance objectives, and helps you to identify where you can be doing better.
Data engineering, quality, and governance go hand-in-hand
The distance between data engineering and data governance is closing. But fear not! This is a good thing.
As companies push for better governance, they’re also pushing for better data quality, more transparency in data usage, and higher-quality decision making. It’s going to require that data engineers level up and treat data the way software engineers have treated their code.
Modern data teams incorporate software engineering best practices and use data orchestration and transformation frameworks such as dbt and Airflow, providing the groundwork for governance best practices. Incorporating these automated processes streamlines the management of data quality and keeps governance standards consistently applied and enforced throughout the data life cycle.
The more you can proactively test your data, automate and standardize, the more control you have over your data. When you implement, for example, Datafold in your CI process, you’re ensuring every single pull request undergoes the same level of testing and scrutiny. If you do it often and long enough, you set a standard for your data quality, which allows you to proactively find data quality issues faster over time.
Now is the time for data engineers to campaign for automated data monitoring, continuous integration pipelines, automated testing, sophisticated data modeling, and a whole host of other advanced data capabilities. And businesses are more likely than ever to invest because, well, governance.