Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and time
for the demo
Get a 30-minute demo
and see datafold in action
February 7, 2023
Modern Data Stack, Data Testing

Understanding Your Data: The Technology That Powers It

Whatever layer of the data stack you focus on, there is inherently one technology that is valuable to all layers: testing. Discover how testing throughout the stack leads not only to stronger proactivity, but also to more productivity.

No items found.
Emilie Schario

This is the third post in a series examining data teams and data leadership from the perspective of People, Process, and Technology. Catch up on the previous posts on Data People and Data Processes.

Technical leadership doesn’t come from your manager

As we’ve laid out, leadership has three parts to it: people, process, and technology. If we look at software engineering teams, we can see that these roles tend to fall to different people. In an extreme simplification, engineering managers are focused on developing people, product managers are focused on the development process, and technical leads, often manifested in the form of staff-level or architect-level roles, are the technical leaders.

Yet, when we talk about data teams, very rarely do we recognize that this division of labor would make our teams more effective. Instead, we conflate technical experience or technical leadership with the strategic skills necessary to build and grow a data organization. 

As we’ve laid out in the lead-up to this section on technical leadership, there is a lot more that goes into leading and growing an effective and strategic data organization. As teams grow in headcount and influence, we should lean into a division of responsibilities. By saying that we expect data team managers to lead people, process, and technology, we ask them to do more with fewer resources than we would ask of engineering counterparts. It should come as no surprise that when we stretch them too thin, they do none of those things less well. 

If technical leadership doesn’t need to be a data team’s manager then, what does technical leadership for a data team look like? I think the answer lies in both borrowing from our software engineering counterparts and recognizing the limits of that model. We know that there are data teams with staff-plus individual contributor roles. Yet, these roles feel like few and far between. Without diving too much into specific tools, let's discuss some of these roles.

The best way to develop technical leadership, thus, is to help create opportunities for technical leadership by senior individual contributors. In other words, if you are a manager, step back from the technical leadership decisions and create opportunities for a team member to lead those initiatives. If you are an individual contributor looking to grow in the technical route, consider broaching the conversations with your manager and stepping into more technical leadership, including formulating opinions on architectural decisions and injecting yourself into the decision process. If you are an individual contributor who doesn’t know how to broach this conversation with your manager, send your manager a link to this guide. 

In data work (like many others), it can feel like we are always working in the urgent- everything was needed yesterday. It can be tempting, thus, to try to defer to the people with the most experience, but doing so can limit the opportunities for team members to develop in their leadership experiences, technical and not. We are better when we create opportunities for everyone to grow. More distributed leadership will ultimately allow teams to go faster in the long term, and thus it’s worthwhile to go slow to go fast. 

Right-sizing complexity

When it comes to right-sizing complexity, there is never a right answer to a problem. There are always tradeoffs between fast-and-quick and slow-but-right. Between those two, though, there is a spectrum. Where to land on that continuum is specific to many different organizations. So, there is no one right answer. There are a number of right answers depending on the specifics of the organization, the needs, and the actual problems to be solved. For example, 5 millisecond delay at Amazon and 5 millisecond delay at SmallCo with 5 beta users are completely different outcomes. Building at SmallCo like you’re handling the traffic of Amazon or impacting as many users is wasteful. 

Instead, we should focus on right-sizing complexity to the needs of our organization. What is the fastest way possible to ship an impact? How can we improve today’s experience, exactness, or other performance criteria as quickly as possible? Allowing for future use cases we can’t anticipate is hard, nearly impossible, but also necessary.

Test throughout the stack

Whatever layer of the data stack (storage, ingestion, transformation, or BI) you focus on, there is inherently one technology that is valuable to all layers: testing. Testing data and code fast can be a multiplier for data teams focused on driving results. You want to trust that you can go fast and not break things.

Too often, early stage teams only test at the end of whatever they’re working on. You think you know what the upstream data is like, so there’s no need to test those assumptions. That’s the handoff from the data consumers we referenced in the People section. Or, we talk to a person, understand the state today, and ensure that the data looks like that today, failing to consider that the people creating the data and process or system it’s tracking likely will change. Testing helps alert and catch on changes before they’ve made their way to the production environment. Testing throughout the stack allows us to isolate where changes are made, as opposed to just the DAG where there was a change made.

Stronger testing throughout the stack leads not only to stronger proactivity (knowing that things will break instead of other people telling you they’re broken) but also to more productivity, in that we can more quickly diagnose, isolate, and remediate problems. 

When data quality issues are caught in production - it’s usually caught by an executive, a stakeholder, a customer or an observability tool. By this point, the damage is already done. Once data quality issues reach production, there are substantial business costs or risks such as erroneous executive or customer dashboards, inaccurate customer information and attrition of stakeholder trust.

For most companies, if a dashboard is broken, diagnosing the problem requires going from the dashboard to the SQL query behind it, then to the model it depends on, followed by investigating each intermediate step in the transformation process one-by-one in order to isolate the data. This process is as mundane and meticulous as it sounds, often similar to “finding a needle in a haystack”, executed by manual triaging efforts or writing ad-hoc test scripts. As such, it’s ripe for a more automated, comprehensive and proactive data testing workflow. Enabled in staging environments before shipping to production, automated data testing offers a more productive workflow conducive to quick localization, diagnoses, and remediation activities. 

Data should be diffable

It is not enough to know that data is different, we need to know why. That is the fundamental premise of the idea that data should be diffable: understanding the impact of code changes to data both upstream and downstream - before shipping to production. 

One of the most common tests used to compare data before and after any process is a simple count of rows. This works when it works, but when it doesn’t, it can go very wrong. There are many ways that it cannot work, while still producing the right results. For example, the number of rows may be expected - but changes to the data may have occurred.

Let’s take an example of a data transformation before and after. Before your transformation, you have 100 rows. Afterwards, you may still have 100 rows. If your transformation, though, included any logic- complexity not required- how can you be sure that you’ve covered every edge case? How do you know that those 100 primary keys are the same across both datasets? How do you know that you don’t have a fan out join? Or that you forgot to remove the limit 100 clause in the query and that is why your count of rows is the same?

It’s not enough to say that we have the same number of rows across two data sets and that should instill confidence in our data and the data transformation. We must be able to understand the nuances of our transformations, such as being able to explain the differences before and after any changes, as well as the downstream impact of those changes. 

Data must be diffable. We should anticipate the way in which things can go wrong before they actually do. With tools like Datafold and its data diffing capabilities, we are more empowered than ever before to be proactive rather than reactive in understanding the scope of our data changes. We don’t need to reinvent the wheel. We just want to ensure that our workflows and technological tools are focused on enabling us to go as quickly as possible, with confidence in the quality of our work. This principle helps us reshape what the standards for before and after analyses should look like. 

With the Data Diff approach, you and your team can ensure that every data code change has only intended consequences. Or, if issues are surfaced, you can proactively address data quality issues before they become production incidents.

In this article