Folding Data #3
Events
On May 20th at 9AM PST, I am hosting the 4th Data Quality Meetup: a gathering for hands-on data people to learn and discuss how to do data engineering better, from tools to culture. I started the meetup because data quality war stories are not uncommon, but finding the community to share with and learn from has been hard.
If you decide to attend, here’s what you can expect:
⚡ 7-minute talks
🎤 Discussion + Q&A with data leaders from dbt, HealthJoy, Lyft, Shopify, Convoy among others!
Check out the digest of our previous meetup
RSVP 🙌
Opinion: how Shopify, Spotify & Lyft
approach data quality
Enabling high-quality, rapid data product development on a 200+ person team is no joke. We spoke with data leaders from three incredibly data-driven companies to find out how their teams achieve that tricky balance. TL;DR – although the three companies have very different products and data platform technologies, all seem to follow similar principles to scaling data development:
- Decentralize the data team (embedded within business/product teams) as opposed to a centralized service-oriented model.
- Lower the bar for building high-quality data products: introduce tools that make it very easy to both build and follow best practices (e.g. dbt at Shopify).
- Automate testing of changes in CI using assertions, data diff, etc. – manual testing doesn't scale and simply isn't getting done.
Read/listen to the full conversation here (recorded on Clubhouse back when Clubhouse was a thing).
Tools: Notebooks
Data Science Notebooks – I like having multiple tea flavors around but don't know how many different implementations of Data Science Notebooks we need 😉 According to datasciencenotebook.org there are at least 20❗️, and many are far more powerful and collaborative than your plain old Jupyter.
SQL-centric notebooks
Mode pioneered the concept of a collaborative "SQL > Chart > Notebook" workflow 8 years ago and gained a lot of success among high-growth data teams. In fact, it's so easy to create SQL analysis in Mode that most teams eventually face the problem of exploding content, and some even built entire ETL pipelines to clean up the mess.
Over the past year, two challengers came in:
- PopSQL – built by an ex-Instacart team, offers proprietary desktop & web apps
- Querybook – open-sourced by Pinterest
Both tools feature built-in data documentation and discovery to help curate the content, as well as the ability to host them on-prem.
Buy not build: Reverse ETL
For those who place the comma after "buy": what are things that your team may be [considering] building that you likely can buy off the shelf and instead focus resources on things unique to your business?
Operationalizing the data, e.g. making your operations or sales teams act on it has been a challenge primarily because analytical data ended up locked up in everyone's DWH and BI tools. If you embarked on building custom pipelines to sync the data into ERP / CRM / CSM, etc., you no longer have to: HeadsUp, Census, Hightouch, among others, offer an almost 1-click solution to that problem.
Data Stories
Do you have a data war story that would benefit the data quality community?
Reply to this email to share the know-hows, clever hacks, and tools you have built or introduced to make your (and your team's) work more productive and enjoyable. The most interesting stories will be featured at our upcoming meetup and in the newsletter!