Folding Data #20
An Interesting Read: Benchmark Wars
Snowflake adds support for Python and DataFrame. Databricks adds serverless SQL. Mega data platforms are now on a head-on collision course which is also reflected in recent benchmark stints: Databricks claimed a "new world record" in the TPC-DS benchmark which consists of 99 queries on 100TB of data that aims to simulate analytical workload with a wide range of queries. Importantly, Databricks claimed to complete the benchmark in 2.7x less time than Snowflake and with 7.4x better price efficiency. 10 days later, Snowflake issued their own benchmark results that showed only 20% performance gap and 10% price gap, still in favor of Databricks. To clarify, Snowflake did not question Databricks's report on Databricks's performance but rather Databricks's results of Snowflake's power test.
My take: I expect the price efficiency and performance of major data warehousing platforms to converge to some market level over time, and the true differentiating factor being end-user experience and data ops functionality. Whatever platform you choose, it will be expensive at scale, and the data team's ability to monitor and control usage and to maintain their environment in a healthy state will determine the cost efficiency to a large extent.
Industry Benchmarks and Competing with Integrity
Tool of the Week: Data Profiler
It always strikes me how little even professional end-users know about the data they are working with, and before you can do actual analysis or write transformation logic, you most often need to do pre-analysis to understand data distributions, data quality, and edge cases.
The DataProfiler is a Python library developed at CapitalOne designed to make data analysis, monitoring, and sensitive data detection easier by providing key insights about any dataset.
Besides, Pandas DataFrames, the library supports popular open-source data formats such as Parquet and Avro, identifies the schema, statistics, entities (PII / NPI), and more.
Interestingly, the Data Profiler comes with a pre-trained deep learning model, used to efficiently identify sensitive data (PII / NPI). If desired, it's easy to add new entities to the existing pre-trained model or insert an entirely new pipeline for entity recognition.
Check out Data Profiler on GitHub ✨
Data Quality Meetup #6 Digest
Last week was a flurry of activity thanks to announcing our successful Series A funding. That meant that we waited an extra week before sharing our blog digest of our most recent Data Quality Meetup. If you missed the event, or you attended and want to recap, you can check out the blog and watch the corresponding videos.
Read the DQM #6 Digest
Before You Go
As seen on Twitter.