Folding Data #32
There is such thing as too much data testing
Mainstream frameworks such as dbt and Dagster ship with test tooling, and great_expectations can be plugged in pretty much any stack. Writing tests finally became easy enough, and placing as many of them as possible may give data teams a sense of control and confidence in the world of constantly breaking data.
While maximizing unit test coverage makes a lot of sense in programming, data tests (assertions that validate particular business assumptions about the data, such as that user_email != NULL) are trickier: unlike software unit tests that operate on fixed inputs and outputs, data tests run against constantly changing data. That means having to deal with a lot of false positives. The problem with false positives is that investigating [frequently] failing tests feels much more like reactive firefighting than productive work. Then the critical point happens: once when we let a data pipeline run once with failing tests without fixing them, getting to "everything green" becomes orders of magnitude harder. That's why the "broken windows" analogy that Mikkel suggests so perfectly describes the treadmill that many data teams put themselves on by setting large data test coverage as a goal.
So how do we do better?
- Prioritizing writing tests for high-impact datasets. How do we know that? One way is to leverage data lineage which can tell us who/what, how, and how often uses each table/column.
- Using a data diff to automate regression testing. Data diff is a tool that identifies differences in datasets and can be used to validate SQL code changes by showing how a change in the source code affects the data being produced. While writing explicit tests for the most important tables and columns is still important, using data diff can help minimize the number of tests that need to be written and maintained to catch all possible edge cases.
Don't let your data become like NYC subway in 70s
Tool of the week: Enso, again
We in the data world seem to be so obsessed with tools and, unlike software engineers, show little interest in exploring new languages, having accepted SQL and Python as the default standards. Languages are fundamental because, unlike tools that get us from point A to point B by solving particular problems, programming languages have the power to shape our minds. They can move us across levels of abstraction and direct how we think about (model) the world, and dictate the rules of the creative process.
When it comes to new languages in the data domain, there are two directions I am personally excited about. One – transition from relational to semantic modeling as the backbone of analytics. Second – the emergence of new ways to define data pipelines (as in series of data transformations).
While Dagster and Hex modernized the approach to orchestrating and prototyping data applications respectively within the existing SQL/Python stack, Enso turned everything upside down and proposed both a new language and a new, highly visual, way to express data applications. Inventing a language enables you to do things previously considered impossible or impractical. For example, Enso assumes that you can do both ad-hoc data wrangling and productionize pipelines using the same toolkit. Whereas you don't want to think about merging Airflow with Jupyter in the Python ecosystem. Everything about Enso is weird – different. And that's what makes it so intriguing.
As PG once said, "if you want to expand your concept of what programming can be, one way to do it is by learning weird languages. [..] What can you say in this language that would be impossibly inconvenient to say in others? In the process of learning how to say things you couldn't previously say, you'll probably be learning how to think things you couldn't previously think."
It's exciting Enso just raised $16.5M, as such bold projects have the potential to evolve the data space exponentially by giving us new ways of thinking.
OK, I am intrigued
Before You Go
Good luck with those finals, kids