Folding Data #19
An Interesting Read: Reflecting on Four Years at Databricks
In a world where the average technical worker stays at a company for 1.5 years, this blog reflecting on four years at Databricks is a realistic peek at what it's like to experience multiple stages of a hyper-growth environment at one of the foundational companies in the data space.
Reflecting on Four Years at Databricks
Interesting Data: How Long Does It Take Ordinary People To "Get Good" At Chess?
If, like much of the world, you binge-watched the Queen’s Gambit and decided to buy a chessboard, you might be wondering how long it will take to “get good” at the game. Well, there’s data on that! With 5.5 years of data from 2.3 million players and 450 million games, research shows that most beginners will improve their rating by “100 lichess rating points” in 3-6 months. Returns diminish over time, though, with "experienced" chess players in the 1400-1800 rating range taking 3-4 years to improve their rating by the same amount.
Playing more games doesn’t make you better, faster
Tool of the Week: Temporal
Temporal is a commercial fork of Uber's Cadence workflow orchestration platform developed by the original team. The most intriguing aspect of Temporal is the variety of applications and use cases that can be built on top of it: from running CI/CD pipelines to processing UberEats orders to data integration workflows (Airbyte). Will it also replace Airflow?
Check out Temporal on GitHub ✨
What to Look for with Data Diffs
Although standard in software, the concept of regression testing – checking if changes to source code may have introduced errors (ideally, as part of CI/CD) – is still gradually making its way into the data world. Data diff is a tool that compares relational datasets and can show how a change in the source code (e.g. SQL) affects the resulting data. Our fellow community member Sarah Krasnik explains how to interpret data diff report to spot regressions in pipelines and dashboards.
What to Look for with Data Diffs
Before You Go