Thumbtack leverages Datafold to save hundreds of hours and improve data quality peace of mind
Founded in 2009, Thumbtack facilitates five million projects every year in nearly 1,100 unique categories for every zip code within the United States. With more than 10 million users and growing, Thumbtack raised more than $698.2 million dollars and it’s valued at $3.2 billion.
Introduction
Founded in 2009, Thumbtack facilitates five million projects every year in nearly 1,100 unique categories for every zip code within the United States. With more than 10 million users and growing, Thumbtack raised more than $698.2 million dollars and it’s valued at $3.2 billion.
You can see right off the bat whether your data quality is what you were expecting, and reviewers can see it, too. Now we’re at the rate where we’re automating code reviews, or close to it, on 100 pull requests per month. And this is just the start.
The Problem
Thumbtack, an online marketplace that connects local professionals with customers requiring services, built a highly successful data-driven product. Thumbtack’s data team consists of 50+ analysts and five data engineers, who together were submitting over 100 pull requests to the SQL pipelines per month.
The business was scaling quickly, which meant that the product could evolve as fast as data was available. However, data quality was a risk for the business, with one bug having the potential to corrupt an entire table which potentially drove other tables, leading to cascading issues. This wasn’t just a matter of messing with the CEO’s dashboards but could have serious business implications, as many parts of the product, such as search, have been powered by ML models trained on analytical data.
Whenever a data outage happened, the entire data team had to drop everything to find and fix the broken data, which compounded their existing workloads.
Thumbtack's manual testing: To minimize data outages and even the stress about their potential fallout, Thumbtack implemented a proactive manual process following a code review playbook to stop breaking data in the first place - data issues were a natural consequence of empowering so many analysts to submit so many SQL code changes.
Analysts would write SQL queries to check each pull request (PR) to ensure that it only impacted the rows and columns expected, loading each change’s details in spreadsheets for tracking. The process typically took between one to two hours per PR, although some could take as long as half a day, with some analysts simply skipping this process or not tracking changes adequately. Ensuring data quality and enforcing the change management workflow became increasingly difficult as the team grew. The manual review process helped reduce data outages but crippled the Thumbtack team from moving fast and at scale.
The Solution
Datafold's Data Diff was tested by the team following a two-hour deployment in Thumbtack’s own cloud environment. Based on the successful results of using Datafold’s Diff feature ad-hoc, Datafold was built into the continuous integration (CI) pipeline right in GitHub. Thus, every change to SQL code is validated through the Datafold API automatically, and the detailed impact analysis report is published for every change to the pull request discussion. Besides using Data Diff for testing changes in the code, the Thumbtack team has been leveraging Datafold’s column-level lineage feature to identify downstream implications of changes – a task that otherwise would take hours of chasing dependencies in the massive codebase.
The Results
- Improved data quality. Column-level lineage makes it easy for analysts and engineers to see when their changes could impact other people’s data and give advanced notice or warning. Data Diff ensures that all changes are thoroughly tested and easy to review, preventing breaking issues and ensuring data is always of the highest quality.
- 200+ hours saved per month. By automating the testing and review process with Datafold, the team is saving multiple hours per PR. According to a Thumbtack analyst, "When everything is correct, Datafold clearly saves time on testing; but when something is wrong or there’s an error, it saves unimaginable amounts of time that would go into finding and fixing bad data."
- Peace of mind. Whether the data team is merging to production or doing a review on a colleague’s PR, Datafold does the hard work of validation. Simply seeing the diff report from Datafold on GitHub gives Thumbtack’s data analysts, engineers, and team leaders greater confidence in their data quality.
- Increased productivity by 20+%. Datafold saves time and reduces anxiety or frustration by reducing or eliminating manual testing and tracking processes. Now, the data team and leaders can instead focus on creative data work – helping the product and company scale and evolve.
When everything is correct, Datafold clearly saves time on testing; but when something is wrong or there’s an error, it saves unimaginable amounts of time that would go into finding and fixing bad data.