Folding Data #14
An Interesting Read: AirBnB’s Data Protection Platform
If the words "CCPA" and "GDPR" have appeared on your quarterly planning sessions, this one is for you. As if data teams haven't had enough to worry about, now they are also responsible for securing, de-personalizing, and otherwise protecting customer data. Rightly so, to be honest, because who else in the organization can make sense of the mess that we call data warehouse?
And of course, Airbnb has already built a sophisticated specialized platform full of cool-sounding names such as "Madoka", "Cipher", and "Angmar".
In the meantime, for the rest of us, a good start with data protection would be setting up column-level lineage in the warehouse.
Dive into AirBnB’s Data Protection Platform (DPP)
Tool of the Week: ClickHouse
With Snowflake & Databricks together worth over $120B, it's not that often that you hear about disruption in data warehousing tech, but ClickHouse definitely made news last week after the core team behind it raised a $50 million Series A round from Benchmark & Index. Originally developed at Yandex, ClickHouse is an OLAP RDBMS with coupled storage & compute written in C++ that aims to take on Redshift as well as open-source products such as Pinot & Druid (here's an excellent article that compares all three). With such institutional firepower, ClickHouse solidifies its position in the Modern [Open Source] Data Stack.
Check out ClickHouse on GitHub
What I'm Proud of
Whether or not you believe in a looming consolidation of all vendors comprising the Modern Data Stack into a few mega-platforms, one thing is clear: everyone wants better interoperability between the tools. For example, if you bought a data observability solution, why can't you leverage metadata from it in other tools? In the spirit of Benn Stancil's Modern Data Experience, a vision for a data stack where components don't just send around tables and SQL code but provide a cohesive user experience, we at Datafold opened up a GraphQL Metadata API that allows exporting valuable information such as column-level lineage from Datafold (which it computes by analyzing SQL queries from your warehouse) into other tools, for example, open-source data catalogs such as Amundsen & DataHub.
Tell me how to take my lineage into other platforms
Data People Pave the Path to a Transparent Job Market
The tech job market is as hot as ever, but we are still not too far away from the world where top tech execs are colluding to limit talent mobility in the market. One trend that is rapidly picking up is the increased transparency of comp levels, specifically in the data domain. For example, the Open Data Science Slack group that counts over 60K (!) members requires employers to post meaningful salary ranges for each open req on the job board. Other community-driven job boards, including AI Jobs, do the same, plus provide an open crowdsourced dataset of comp levels.
Before You Go