Folding Data #12
When I started this newsletter months ago, I wasn't sure if I'd be able to find some worthy and interesting tools and stories to share with you every week. But luckily, the data space is evolving increasingly fast, and the deeper you dive in what seems like a small domain, the more you find. At some point, we should talk about data tech singularity but I'll stop here for now. 🙂
An Interesting Read: Timeseries Anomaly Detection at Scale with Thirdeye
Today perhaps no one would argue that having people monitor metrics for anomalies is a good (or even attainable) idea. Whereas multiple proprietary solutions have evolved, for a while there wasn't much in the open-source space beside the good old Prophet. Thirdeye is interesting in that it provides an end-to-end flow for detecting anomalies in time series, breaking them down by dimensions for root cause analysis, and even has some basic collaboration features for issues triage. Although the velocity of the project hasn't been high lately, the code is well structured and easy to study for anyone looking to adopt or built something internally. But before you dive in the code, check out an awesome article about ABTasty's journey integrating Thirdeye in their BigQuery-based data pipeline.
Go on ABTasty's Data Quality Journey
Tool of the Week: Made with ML
What is better than a tool that helps you learn a new field? Made with ML is among the top ML repos on GitHub, offering introductory courses for anyone looking to get into or uplevel their machine learning skills. The field is evolving so fast that it never hurts to catch up on the latest trends in the space.
Give Made with ML a GitHub star ✨
Data Quality Management According to Lyft, Shopify, and Thumbtack
Managing a two-sided market is no joke, especially trying to grow one in a sustainable (i.e. not burning billions of cash a year) way. It is not a coincidence that marketplace tech companies are among the most invested in data and its quality: a decision based on incorrect data can easily throw the market off-balance and cause a poor experience for a large number of users. For example, if Lyft's driver's ETA prediction model drifts off to the higher end, pricing algorithms can start setting higher prices for passengers which can result in fewer rides requested and low earnings for drivers. We've got a chance to learn from these three data-driven companies how they approach data quality management. What's interesting: each has a unique approach but it all comes down to reliable change management and proactive testing of data.
Show me what Lyft, Shopify, and Thumbtack are doing
Before You Go
Yup, it's time for a meme about everyone's favorite thing - documentation!