Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and time
for the demo
Get a 30-minute demo
and see datafold in action
///
October 4, 2024
Modern Data Stack

The Modern Data Stack: Open-source edition

If you're looking to build the ideal modern data stack for analytics using only open-source tools, we cover the top alternatives to your favorite paid solutions.

No items found.
Gleb Mezhanskiy

Note: While some technologies listed in this guide are not open-source but source-available per their licenses, we believe it's important to list and evaluate them alongside since the license, in most cases, wouldn't have a material effect on how a typical data team would leverage the technology.

Putting together an effective, scalable, cost-efficient data stack is challenging. The industry has made enormous strides over the past 10 years, and many technologies and products exist.

So why is it hard to put together a cohesive data stack? 

  • Data stacks are fragmented: they require stitching multiple tools together.
  • Many technologies and vendors overlap, and it’s hard to understand what’s great for what.
  • The ecosystem is changing fast: new promising entrants pop up, and consolidation and economic hardship kill products. You need to place your bets wisely.

As data practitioners who also grapple with the same challenges, we aim to provide an overview of the top technology choices for each part of a modern data stack. While we do our best to base our judgment on certain objective criteria looking into the development and release velocity of projects and diving deep into documentation, it is inevitably a subjective opinion like any technology recommendation is. Opinions are our own based on decades of combined experience as data leaders at multiple companies including Autodesk & Lyft.

Why open source?

I firmly believe that open source alone is not a compelling enough reason to choose a technology for the data stack. However, there are a few good reasons to evaluate open source data stack elements:

  1. Some parts of the data stack, such as orchestrators or in-memory data processing engines, greatly benefit from being open source as their users are likely to embed them in their internal applications and/or extend them.
  2. For some organizations, open source may be the only possible solution, such as in a highly regulated industry or due to local privacy and data security legislation that may prohibit the use of foreign SaaS vendors.
  3. Unlike closed-source SaaS, which often hides its full functionality behind demos and layers of overpromising marketing that can be hard to cut through, open-source SaaS is fully transparent in its capabilities and, therefore, is easier to research and compare.

Elements of the modern data stack

Since even a basic data stack has many moving parts and technologies to choose from, this post organizes the analysis according to the steps in the data value chain.

Organizations need data to make better decisions, either human (analysis) or machine (algorithms and machine learning). Before data can be used for that, it needs to go through a sophisticated multi-step process, which typically follows these steps:

  1. Collection & Integration: getting data in the data warehouse/lake for further processing some text
    1. Event Processing: Collecting events and getting them in the warehouse
    2. Data Integration: Adding data from various sources into the warehouse and vice-versa (Reverse ETL)
  2. Data Lake: Storing, processing, and serving data in a lake architecture, along with its Storage, File, Metadata, and Compute layers.
  3. Data Warehousing: Similar to a data lake but an all-in-one solution with all the layers more tightly integrated.
  4. Streaming & Realtime: Emerging category for low-latency or operational use cases
  5. Data Orchestration: Managing how data pipelines are defined, structured, and orchestrated.
  6. Data Catalogs: Finding the right data asset for the problem
  7. BI & Data Apps: The consumption layer of the data stack with subcategories, including Notebooks, Self-serve, Dashboards, Code-first Data Apps, and Product/Web Analytics.

Most of the open-source products listed below are open-core, i.e., primarily maintained and developed by teams that profit from consulting about, hosting, and offering enterprise features for those technologies. In this post, we do not consider the features only available through the SaaS/Enterprise versions, thereby comparing only openly available solutions.

Overview of open-source modern data stack products

Evaluation criteria

For each open-source project, we use the following criteria:

Feature completeness – how closely the product matches best-in-class SaaS offering.

Traction

  • # of stargazers
  • # of contributors
  • Velocity that is expressed in commits and releases per month. If it's very low, you risk adopting a stale project, requiring you to fix bugs and build integrations on your own.

‍Maturity - beyond features, how easy and reliable it will be to use in production. This is a qualitative and subjective assessment.

  • Promising —the technology demonstrates basic functionality, but using it in production will likely require some development and deployment work.
  • Mature – the technology is production-ready out of the box, is used by multiple large organizations, and is considered one of the go-to solutions in its category.
  • Legacy – the technology is well-adopted, but its popularity and momentum have already peaked, and new entrants are actively disrupting it.

Collection and integration

Best-in-class SaaS: Segment (event processing), Fivetran (data integration), Hightouch (reverse ETL)

Collecting event data from client and server-side applications + third-party systems. 

Event collection and ingestion 

Rudderstack appears to be the most well-rounded, mature, and actively developed platform in an "open-source Segment alternative" class. It’s the open-source leader in integration at the event level. Snowplow is another popular solution pursuing the ETL (instead of ELT) pattern and enables in-flight transformations. While it had prominent open-source momentum a few years ago, these days, it seems to be focused on the cloud offering with almost no development on the open-source project. Jitsu, on the other hand, is a younger contender that is more focused on event ingestion into the warehouse. 

Data integration + reverse-ETL

Airbyte is currently leading open-source data integration solution followed by Meltano, a project that originated at GitLab..

Airbyte and Meltano used to share the same core—the Singer protocol for extracting and loading data. Some time ago, Airbyte pivoted to its own connector framework while Meltano continued building on Singer. Another distinction is that Airbyte’s database replication is based on Debezium, arguably more mature and scalable than Meltano’s CDC based on Singer “taps”. Airbyte and Meltano focus more on data integration – bringing data into the warehouse from multiple sources – but offer some reverse-ETL (pushing data from the warehouse into business apps) functionality.

Here are the five contenders for open-source alternatives to Segment and Fivetran:

Feature RudderStack Jitsu Snowplow Airbyte Meltano dlt
Closest SaaS Segment Segment Segment Fivetran Fivetran/Hightouch Fivetran
API as a source No No No Yes Yes (via Singer) Yes
Event collection Yes Yes Yes No No No
Database replication No No No Yes (via Debezium) Yes (via Singer) No
API as destination (reverse ETL) No No No No Yes (via Singer) Yes
Database as destination Yes Yes Yes Yes Yes (via Singer) Yes
Maturity Mature New Mature Mature Mature Mature
Repo rudderlabs/rudder-server jitsucom/jitsu snowplow/snowplow airbytehq/airbyte meltano/meltano dlt-hub/dlt
Stargazers 4.1k 4.1k 6.8k 15.7k 1.8k 2.4k
# of contributors (180d) 24 7 2 178 12 42
# of weekly commits (180d) 11 8 0 247 9 18
Primary language Go TypeScript Scala Python Python Python
License Elastic License 2.0 MIT Apache 2.0 MIT MIT Apache 2.0

Data lake & warehousing

The heart of any data stack! Let’s get clear with the definitions first:

Data lake vs. data warehouse

Data Lake Data Warehouse
Architecture Assembled from individual components, storage and compute are decoupled Storage and compute are more coupled
Query Engine Supports multiple query engines (e.g. Spark, DuckDB operating on a unified data lake) Typically has one query engine
Scalability Highly scalable, compute and storage can be scaled separately Depends on the architecture, usually less scalable than a data lake
Structure Supports both unstructured and structured (via metastore) data More structured
Performance Depends on the query engines and architecture Typically more performant due to tighter coupling of storage and compute
Management overhead Higher due to multiple components that need to work together Depends on the specific technology; usually lower due to fewer moving parts

Typically, a data lake approach is chosen when scalability and flexibility are more important than performance and low management overhead.

Data lake components

Data lake architecture is organized into layers each responsible for its part:

  1. Storage Layer – operates the physical storage
  2. File Layer – determines in what format the data is stored in the storage layer
  3. Metadata Layer – defines structure (e.g. what files correspond to what tables and simplifies data ops operations and governance)
  4. Compute Layer – performs data processing and querying

Storage layer

Best-in-class SaaS: AWS S3, GCP GCS, Azure Data Lake Storage

Object storage is the prevalent way to store data due to its simplicity and infinite scalability, where you can add more data without worrying about available space. Modern object storage solutions by major cloud providers replaced open source HDFS (Hadoop Distributed File System), which provided a file system spread across clusters of machines and required a lot of overhead to manage while being far less elastic.

The storage layer is one that many data platforms that lean heavily on open source decide to buy from a vendor such as AWS. However, if you want to go 100% open source, MinIO is the leading technology for self-managed blob storage that is compatible with S3 API (and hence highly interoperable within the data ecosystem). It can be easily deployed and managed through Kubernetes.

File layer

The File Layer determines how the data is stored on the disk. Choosing the right storage format can have a massive impact on the performance of data lake operations. Luckily, given the flexibility of the data lake architecture, you can mix and match different storage formats. Apache Parquet is the leading storage format for the file layer due to its balanced performance, flexibility, and maturity. Avro is a more specialized option for specific applications.

Feature Parquet Avro
Data orientation Columnar Row
Compression High Medium
Optimized for Read-heavy analytical workloads, classic ELT architectures Row-oriented workloads, data exchange, streaming
Schema evolution Partial Full
Interoperability Best Good
Repo apache/parquet-format apache/avro
Stargazers 1.8k 2.9k
# of contributors (180d) 11 37
# of weekly commits (180d) 0 12
Maturity Mature Mature
License Apache 2.0 Apache 2.0

Metadata layer

Since in the data lake architecture, unlike a data warehouse, we operate with files directly, the metadata layer (also called “table format” in modern solutions such as Apache Iceberg) is a key component that enables efficient data workloads at scale while abstracting the end users away from the complexity of dealing with files. It serves two primary purposes:

  1. Linking files to tables allows the user to query a table without knowing in what file(s) the data is stored.
  2. Enabling data ops such as concurrent writes, ACID transactions, versioning, and file compaction that simplify data management and ensure high robustness.

Hive Metastore, originally built within the Hadoop/Hive ecosystem, has been the prevalent metadata layer for over a decade, primarily focused on the first use case. Over the last decade, as the data complexity, usage, and volume increased, “just” storing data as tables became no longer feasible. Today, Hive Metastore, while still widely adopted, is considered legacy as it’s been superseded by newer data lake platforms such as Apache Iceberg, which not only provide a metastore but also much more powerful table management capabilities that enable sophisticated data workflows. 

Apache Hudi and Delta Lake should be mentioned as more specialized data lake platforms originally optimized for Apache Spark as an engine.

As of now, Apache Iceberg seems to be winning as the standard data lake platform providing table and metadata management for a broad variety of use cases.

Feature Iceberg Hive Metastore Hudi Delta Lake
Originated at Netflix Facebook Uber Databricks
File formats Parquet, Avro, etc. Parquet, Avro, etc. Parquet, Avro, etc. Parquet
Schema evolution Yes Limited Limited Yes
Transactions ACID No ACID ACID
Concurrency High Limited High High
Table Versioning Yes No Yes Yes
Interoperability High (Spark, Trino, Flink, BigQuery) High (Hive, Trino, Spark) Limited (Optimized for Apache Spark) Limited (Optimized for Apache Spark)
Repo apache/iceberg apache/hive apache/hudi delta-io/delta
Stargazers 6.2k 5.5k 5.4k 7.5k
# of contributors (180d) 95 77 84 81
# of weekly commits (180d) 21 15 19 20
Maturity Mature Legacy Mature Mature
License Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0

Compute layer

The Compute Layer is the most user-facing part of the data lake architecture since it’s what data engineers, analysts, and other data users leverage to run queries and workloads.

There are multiple ways compute engines can be compared:

  1. Interface: SQL, Python, Scala, etc.
  2. Architecture: in-process vs. distributed
  3. Primary processing: in-memory vs. streaming
  4. Capabilities: querying, transformations, ML

We believe that today, the architecture – the in-process vs. distributed – is the most important distinction since other aspects, e.g. supported languages or capabilities can be added over time. 

In-process compute engines

Over the past few years, in-process compute engines emerged as a novel way of running compute on data lakes. Unlike distributed compute engines like Hive, Trino, and Spark which run their own clusters of nodes that are queried over the network, in-process engines are invoked from and run within data science notebooks or orchestrator jobs. I.e., they function similarly to libraries that are called by the code that runs the flow. To illustrate, with DuckDB you can do the following from a Python application:


import duckdb
duckdb.connect().execute("SELECT * FROM your_table")

Whereas working with a distributed system like Spark requires first spinning it up as a standalone cluster and then connecting to it over the network.

The power of the in-process pattern is simplicity and efficiency: distributed engines like Spark and Trino require a significant dedicated effort to operate, whereas in-process engines can run inside an Airflow task or a Dagster job and require no DevOps or additional infrastructure.

The are three main limitations of in-process engines:

  1. Size scalability: data queried should generally fit in memory limited by the available memory to the process/machine; although disc spilling is increasingly supported as a fallback.
  2. Compute scalability: parallelization is limited to the number of cores available to the machine and cannot be parallelized to multiple machines.
  3. Limited support for data apps: while in-process engines can be used for various workloads, including ingestion, transformation, and machine learning, most BI tools and business applications require a distributed engine with an online server endpoint. One cannot connect to DuckDB from, say, Looker since DuckDB runs within a process and does not expose a server endpoint, unlike PostgreSQL.

As a rule of thumb, the in-process engines are optimal and efficient for small-medium (up to hundreds of gigabytes) data operations.

The following three technologies are leading in-process data engines:

DuckDBa column-oriented SQL in-process engine

Pandasa popular Python library for processing data frames

Polarswidely considered “faster Pandas”, Polars is a new-generation data frame engine optimized for performance.

It’s worth noting that DuckDB and Polars are rapidly growing their interoperability with other technologies with the adoption of highly efficient Apache Arrow (ADBC) as a data transfer protocol and Iceberg as a table format.

Feature DuckDB Pandas Polars
Repo duckdb/duckdb pandas-dev/pandas pola-rs/polars
Querying Interface SQL Python Python, Rust, SQL
File IO JSON, CSV, Parquet, Excel, etc. JSON, CSV, Parquet, Excel, etc. JSON, CSV, Parquet, Excel, etc.
Database integrations Native PostgreSQL, MySQL; Any via ADBC Any via ADBC or SQLAlchemy Any via ADBC or SQLAlchemy
Parallel (multi-core) execution Yes No (through extensions) Yes
Metastore Support Iceberg None Iceberg (emerging)
Written in C++ Python Rust
Stargazers 2.4k 43.4k 29.6k
# of contributors (180d) 42 201 149
# of weekly commits (180d) 18 33 63
Maturity Mature Mature Mature
License MIT BSD 3-Clause MIT

Distributed compute engines

Best-in-class SaaS: Snowflake, Databricks SQL, BigQuery

We previously discussed criteria for choosing a data warehouse and common misconceived advantages of open-source data processing technologies. We have highlighted Snowflake, BigQuery and Databricks SQL as optimal solutions for typical analytical needs.

Assuming the typical pattern for analytics:

  1. Current or anticipated large (terabyte-petabyte) and growing data volume
  2. Many (10-100+) data consumers (people and systems)
  3. Load is distributed in bursts (e.g. more people query data in the morning but few at night)
  4. Query latency is important but not critical; 5-30s is satisfactory
  5. The primary interface for transforming and querying data is SQL

To meet the assumptions above, teams frequently use a data lake architecture (decoupling storage from computing).

The leading open-source project that meets these criteria is Trino. Trino has the following advantages:

  • Feature-rich SQL interface
  • Works well for a wide range of use cases from ETL to serving BI tools
  • Close to matching best-in-class SaaS offerings in terms of usability and performance
  • Provides a great trade-off between latency and scalability

Trino has been so successful as an open-source Snowflake alternative largely due to a substantial improvement in usability and abstracting the user (and the DevOps) from the underlying complexity: it just works. Forget hundreds of critical tuning parameters to make older systems like Hive work.

ML & specialized jobs

Spark is a true workhorse of modern data computing with a polyglot interface (SQL, Python, Java & Scala) and unmatched interoperability with other systems. It is also extremely versatile and handles a wide range of workloads, from classic batch ETL to streaming to ML and graph analytics.

Do I need Trino if Spark can handle everything?

‍While Spark is a Swiss army knife for ETL and ML, it is not optimized for interactive query performance. Usually, it has significantly higher latency for average BI queries, as shown by benchmarks.

A popular pattern is to have both Trino & Spark used for different types of workflows while sharing data and metadata:

Feature Trino Spark
Repo trinodb/trino apache/spark
Architecture Distributed Distributed
Querying Interface SQL Scala, Java, Python, R, SQL
Fault Tolerance None - partial failures require full restart Fault-tolerant, can restart from checkpoints
Latency Lower Higher
Streaming No Yes
Metastore Support Hive Metastore, Iceberg, Delta Lake (emerging) Delta Lake (optimal), Hive Metastore, Iceberg
Machine learning No Yes
Written in Java Scala
Stargazers 10.3k 39.4k
# of contributors (180d) 122 180
# of weekly commits (180d) 100 73
Maturity Mature Mature
License Apache 2.0 Apache 2.0

Notable mentions:

PrestoDB 

PrestoDB shares a common ancestry with Trino (previously called PrestoSQL). Side story: Trino is a fork of PrestoDB, which was initially developed at Facebook as a replacement for Hive. Facebook enacted its trademark and ownership rights on the software, prompting the fork and rename. You can read more on that from the core team behind PrestoDB/PrestoSQL. We recommend Trino over PrestoDB due its larger community (PrestoDB is primarily maintained by Facebook/Meta and therefore prioritizes the internal needs).

Hive – Hive was once a major breakthrough in large-scale data compute offering a SQL interface to Hadoop MapReduce. However, by now it’s mostly obsolete and has no advantages over modern systems like Trino and Spark while being much slower and harder to maintain and use.

Data warehouses

Two good reasons for choosing data warehouse (coupled storage and compute architecture) are:

  1. Performance: use cases that require sub-second query latencies which are hard to achieve with a data lake architecture shown above (since data is sent over the network for every query). For such use cases, there are warehousing technologies that bundle storage and compute on the same nodes and, as a result, are optimized for performance (at the expense of scalability), such as ClickHouse and Hydra.
  2. Simplicity - Operating a data warehouse is usually significantly simpler than a data lake due to fewer components and moving parts. It may be an optimal choice when:
    1. The volume of data and queries is not high enough to justify running a data lake.
    2. Data lake already exists but certain workloads require lower latency (e.g.,  BI). As far as simplicity goes, projects like Hydra that are based on PostgreSQL are hard to match in terms of ease of deployment, with scalability as a potential tradeoff.

Both ClickHouse and Hydra take advantage of the tighter coupled storage and compute standard for the data warehousing paradigm:

  • Column-oriented data storage optimized for analytical workloads
  • Vectorized data processing allowing the use of SIMD on modern CPUs
  • Efficient data compression

However, they have several important distinctions:

ClickHouse is one of the most mature and prolific open-source data warehousing technologies originating at Yandex (aka the Google of Russia). It powers Yandex Metrica (a competitor to Google Analytics)  and processes over 20 billion events daily in a multi-petabyte deployment.

Hydra is a newer player that builds upon PostgreSQL. As an extension to PostgreSQL, Hydra has excellent interoperability with other tools while adding powerful analytical features and optimizations to PostgreSQL. Hydra’s primary limitation is its scalability – unlike ClickHouse, which can scale to hundreds of nodes and handle petabytes of data, Hydra currently can only scale vertically on a single node.

Feature ClickHouse Hydra
Repo ClickHouse/ClickHouse hydradatabase/hydra
Architecture Distributed Single node (based on PostgreSQL)
Data volume Up to petabytes Up to hundreds of gigabytes
Scaling by Adding nodes Increasing instance size and storage (limited to 1x instance)
Querying Interface SQL SQL
External data sources PostgreSQL, MySQL, Snowflake, BigQuery, Kafka, S3, GCS, Redshift S3, PostgreSQL, MySQL, Google Sheets
Streaming Inserts Yes Emerging
Written in C++ C
Stargazers 36.9k 2.8k
# of contributors (180d) 253 4
# of weekly commits (180d) 570 0
Maturity Mature Promising
License Apache 2.0 Apache 2.0

Notable mentions:

Pinot and Druid are distributed OLAP data engines that are more similar to ClickHouse than to Hydra. Originally developed at LinkedIn and Metamarkets respectively, those systems are largely superseded by ClickHouse in terms of lower maintenance costs, faster performance, and a wider range of features (for example, Pinot and Druid have very limited support for JOIN operations that are key parts of modern data workflows).

Data orchestration

The role of the data orchestrator is to help you develop and reliably execute code that extracts, loads, and transforms the data in the data lake or warehouse as well as perform tasks such as training machine learning models. Given the variety of available frameworks and tools, it is important to consider three major use cases:

SQL-centric workflows

Premise: your transformations are expressed predominantly in SQL.

dbt was our top choice even when compared to SaaS tools for the following reasons:

  • Comprehensive yet lightweight: captures most use cases and is easy to start with
  • Open source yet smooth: on top of all the benefits of OSS, it’s also easy to run
  • Opinionated yet reasonable: relies on common conventions and best practices
  • Great documentation and an active, devoted community

SQLMesh is a new player that positions itself as an alternative to dbt and claims to provide a better developer experience with the same paradigm.

Key differentiators:

  1. Code Understanding – unlike dbt, which relies on Jinja templating, SQLMesh leverages full SQL parsing and code manipulation leveraging SQLGlot. This, in turn, enables it to solve several problems:some text
    1. Column-level lineage – SQLMesh tracks dependencies for all models that it manages on the columnar level.
    2. Impact analysis – SQLMesh analyzes changes, determines which are breaking vs. not, and simplifies dataops such as backfills and deploys.
  2. More robust handling of incremental/partitioned and slowly-changing dimension tables. Incremental (or partitioned) models are needed when data is too large to be processed in one query run. In dbt, such models are handled in a non-idempotent way, meaning that each run results in a different output that is not deterministic because the state is not managed. ​​SQLMesh preserves the state of incremental models making expressing them easier and eliminating potential data leakage and data quality issues. It also provides richer templating support for various types of incremental models. Read in their own words how the dbt workflow feels compared to SQLMesh.

As of today, dbt is a de-facto standard for SQL transformations with a massive and loyal community and viral adoption. Nevertheless, SQLMesh is worth paying attention to given its strong technological core and focus on developer experience and data quality.

Feature dbt Core SQLMesh
Repo dbt-labs/dbt-core TobikoData/sqlmesh
Interface CLI CLI + UI
Column-level Lineage No Yes
Macros Jinja Jinja, Python
Unit tests Coming in v1.8 Yes
Embedded Data Diff No Yes
Contracts Schema only Schema + Data
Multiple Environments Yes, via materialization Virtual Data Environments
Package Marketplace Yes No
Multi-repository support No Yes
Written in Python Python
Stargazers 9.7k 1.7k
# of Slack community members 34k 2.5k
# of contributors (180d) 37 38
# of weekly commits (180d) 9 24
Maturity Mature Promising
License Apache 2.0 Apache 2.0

Full-featured data orchestration

Premise: in addition to (or instead of) SQL, your workflows include jobs defined in Python, PySpark, and/or other frameworks. While dbt and SQLMesh provide basic support for Python models, their focus is on SQL and a full-featured orchestrator may be warranted for running multi-language workloads.

In this case, it may be beneficial to use a full-featured data orchestrator that:

  1. Manages scheduling and dependencies of disparate tasks
  2. Handles failures and retries gracefully
  3. Provides visibility into the data pipeline status, logs, and performance

The most widely used and incredibly influential data orchestrator is Apache Airflow.

We consider Airflow a legacy technology that should no longer be the default option as a general-purpose orchestrator. While it’s incredibly widely adopted and has many extensions, more modern orchestrators such as Dagster are better suited for present day data complexity and development lifecycle, and have reached maturity to be run at an Enterprise level.

In the last few years, multiple teams have attempted to rethink and improve orchestration, and at the moment, recommend taking a closer look at the following leading open-source orchestration frameworks. 

Prefect follows the Airflow paradigm and aims to provide a better developer experience:

And better data-ops visibility:

The most promising full-featured orchestrator that we are excited about today is Dagster. Unlike Prefect, which can be seen as the "modern Airflow", Dagster takes a bolder approach and proposes a few paradigm shifts.

The biggest one is Software-defined Assets: unlike Airflow where the primary concept is a unit of work – Task – with multiple Tasks chained into DAGs, Dagster defines the output – an asset – as its primary concern: "An asset is an object in persistent storage, such as a table, file, or persisted machine learning model. A Software-defined Asset is a Dagster object that couples an asset to the function and upstream assets used to produce its contents.”

While the difference may seem subtle at first glance, it turns the paradigm upside down by shifting the focus from the work being done to the actual data product produced (what the business cares about).

Dagster brings and implements a number of other major enhancements relative to Airflow, including:

  1. Data-aware tasks – a big step forward in ensuring the reliability of pipelines; now, dependencies between tasks can be verified both during development and in production.
  2. Declarative, SLA-driven Scheduling – in Airflow’s paradigm, DAGs of tasks are triggered on a particular schedule. It is up to the data engineer to figure out the scheduling and dependencies making it increasingly difficult to maintain Service Level Agreements at scale. Dagster takes a radically different approach with declarative SLA-driven scheduling: you define when the data asset needs to be produced or updated, and the upstream scheduling is managed for you.
  3. Integrated testing, tracing, logging
  4. First-class environment support for easier development, deployment and maintenance in production.

Mage is a newer player in the data orchestration space. Its focus seems to be on the speed of development and it may be a great choice for smaller data teams that prioritize speed of development over robustness. For example, Mage comes with a built-in data integration framework based on Singer that enables data syncing and removes the need for standalone data integration software.

Feature Airflow Dagster Prefect Mage
Repo apache/airflow dagster-io/dagster PrefectHQ/prefect mage-ai/mage-ai
Data contracts No Yes Yes Yes
Multiple Environments No Yes Yes Yes
SLA-driven Scheduling No Yes No No
dbt integration No/limited Yes No/limited Yes
Written in Python Python Python Python
Stargazers 36.5k 11.3k 15.9k 7.8k
# of Slack community members 24.4k 10.6k 14.4k 5.8k
# of contributors (180d) 335 111 74 45
# of weekly commits (180d) 79 99 59 11
Maturity Legacy Mature Mature Promising
License Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0

Streaming & real time

Streaming and real-time data processing space is a vast space with a lot of nuance. In this guide, we offer a high-level overview of concepts and mention a few foundational and breakthrough technologies.

Streaming paradigm is key in a few use cases:

  1. Collecting data (e.g., events) and ingesting it into a data warehouse/lake
  2. Processing data in-flight for enrichment before loading into the long-term storage
  3. Real-time analytics and machine learning, e.g. fraud detection, operational analysis, etc. – where latency matters.

The cornerstone of data streaming is Apache Kafka. Kafka implements a publisher-subscriber (pub/sub) event hub allowing it to collect, store, and deliver vast amounts of data with low latency. Many technologies integrate with Kafka as a source or a sink.

When it comes to streaming transformations and compute, three technologies are worth a close look: Apache Spark (mentioned earlier), Apache Flink, and Materialize.

Spark is a popular choice for implementing streaming applications thanks to its mature, multi-language Structured Streaming framework and versatility allowing data teams to leverage Spark both for their real-time and batch workloads.

Flink is a streaming-first engine with a Java/Scala API that offers a few advantages over Spark:

  • Native support of stateful streaming transformations (e.g. to calculate metrics over user sessions)
  • Low latency (millisecond scale vs. second with Spark)

While multiple deeply technical (e.g. memory management) and operational aspects need when comparing Flink to Spark, from the strategic perspective, Spark is likely a better choice for straightforward ingestion/transformation jobs and would be easy to pick up by teams already using Spark for ML or batch transformations. On the other hand, Flink enables more advanced production workflows developed by professional data engineers.

Materialize is a newer player in the real-time world offering a new paradigm: “a cloud-native data warehouse purpose-built for operational workloads where an analytical data warehouse would be too slow, and a stream processor would be too complicated.” Materialize’s framework is centered around real-time views allowing the developers to implement applications in a familiar table/view pattern but leverage real-time updates. It is therefore best suited for real-time analytics and applications that require up-to-the-second insights, such as monitoring dashboards, personalization features, and financial alerts.

Feature Spark Flink Materialize
Optimized for Versatility across ETL/ELT, large-scale processing, and machine learning Low-latency high-throughput stream processing Low-latency updates and queries over streaming data with emphasis on consistency and accuracy
Support for batch paradigm Yes Yes No
Latency Seconds Subsecond Subsecond
Scalability High High High
Interface Scala, Java, Python, SQL Scala, Java SQL
Written in Scala Java Rust
Repo apache/spark apache/flink materializeInc/materialize
Stargazers 39.4k 23.9k 5.7k
# of Slack community members N/A 4.3k 2.3k
# of contributors (180d) 180 115 49
# of weekly commits (180d) 73 30 208
Maturity Mature Mature Mature
License Apache 2.0 Apache 2.0 Business Source

Notable mentions:

Apache Beam, unlike Spark and Flink, is a framework (think like dbt but for streaming) that allows developers to write batch or streaming transformations and execute them on a “runner” of their choice, such as Spark or Flink. It’s certainly a powerful concept, and Beam’s API is designed to be expressive and robust. However, as most of the momentum behind Beam seems to originate at Google (Beam is the official interface to Google Dataflow), we have yet to see it being adopted by the larger open-source community. For that reason, it gets a ranking of a turnkey but not yet mature tech.

Data cataloging

With all the awesome tools for instrumenting, collecting, integrating, and processing data at your disposal, it has indeed become quite easy to create new datasets and analytical content. 

That, however, led to another problem - data discovery:

  1. Where do I find the data / prior analytical work on a topic?
  2. Can I trust that data? Who else uses it? When was it last updated? Are there any known quality issues?

Open-source tooling in the space is still young, but there are two projects: Amundsen by Lyft and Datahub by LinkedIn that show a lot of promise in the area. Out of the box, they provide a basic data catalog with search and offer a framework for augmenting the catalog with additional metadata (e.g. links to source code, ownership, etc.).

The open-source nature of Amundsen and DataHub allows their users to tailor them according to their stack, workflows, and team cultures, while their plugin architecture allows them to easily add metadata from various sources. For example, column-level lineage information can be synced to Amundsen from Datafold (which is not open source) via API and visualized in the Amundsen UI.

Feature Amundsen DataHub
Originally developed at Lyft LinkedIn
Optimized for Data discovery and search Data discovery and metadata management
Lineage Table-level, bring-your-own lineage Column-level lineage for a subset of integrations
UI Emphasizing simplicity of discovery More feature-rich and complex
Written in Python Java, Python
Repo amundsen-io/amundsen datahub-project/datahub
Stargazers 4.4k 9.7k
# of Slack community members 3.3k 11k
# of contributors (180d) 5 123
# of weekly commits (180d) 0 36
Maturity Mature Mature
License Apache 2.0 Apache 2.0

Analysis

In our previous post, I laid out the main use cases for BI:

  1. Self-serve analytics – no-code data analysis
  2. Dashboards – combinations of charts for passive review
  3. Ad-hoc analysis – advanced analysis by technical users
  4. Code-first data app frameworks for building highly customizable interactive data applications embedded into other products.
  5. End-to-end product analytics

Self-serve analytics & dashboards

SaaS state of the art for SQL ad-hoc analysis: Looker, Tableau

While Metabase supports direct SQL > chart > dashboard flow, among the open-source BI tools, it has the most advanced self-serve UX that allows less technical users (or anyone, really) to explore data with simple aggregations:

The UX falls in between Looker and Mode - closer to the former, but without a semantic data model.

Another open-source alternative to Looker, Lightdash promises to connect to your dbt project and allow everyone to answer their own data questions. Lightdash is definitely not the most mature open-source BI product relative to Metabase or Superset, but the LookML-like modeling layer and tight integration with dbt make me believe they are onto something big! Read more about why I'm intrigued by Lightdash in my recent blog.

Started by the creator of Apache Airflow, Maxime Beauchemin at Airbnb, Superset is perhaps the most advanced open-source BI tool in terms of visualization capabilities and has a massive momentum behind it.

Keep in mind: there is a learning curve even for technical users.

Feature Metabase Lightdash Superset
Repo metabase/metabase lightdash/lightdash apache/superset
Closest SaaS Tableau Looker Tableau
Optimized for Self-serve data exploration and dashboards Self-serve data exploration and dashboards Dashboards with advanced data visualization
Focus Intuitive UI over extensibility Deep integration with dbt as a semantic layer Extensibility with plugins, custom visualizations
Written in Clojure Typescript Typescript, Python
Stargazers 38.3k 3.8k 62.1k
# of Slack community members Website-based forum 1.6k 15.5k
# of contributors (180d) 73 34 132
# of weekly commits (180d) 96 57 36
Maturity Mature Promising Mature
License Apache 2.0 MIT Apache 2.0

Ad-hoc analysis & notebooks

SaaS state of the art for SQL ad-hoc analysis: Hex

Querybook is a promising notebook-like open-source IDE for data analysis originally developed at Pinterest. In line with the modern trend, it also features lightweight data discovery functionality.

For data science that typically heavily relies on R, Python, or Julia, Jupyter is still a safe choice.

Evidence.dev  is an interesting project that pushes the simplicity of creating dashboards and visualizations to the edge: making the workflow centered on SQL and Markdown

Feature Jupyter Querybook Evidence
Primary interface Python SQL SQL + Markdown
Optimized for Data science, ad-hoc analytics, machine learning Ad-hoc SQL analysis and data discovery Getting analytics reports through SQL and markdown
Written in Python Typescript, Python JavaScript
Repo jupyter/notebook pinterest/querybook evidence-dev/evidence
Stargazers 11.6k 1.9k 4.2k
# of Slack community members No Slack community, only public forum 600 1.3k
# of contributors (180d) 8 8 23
# of weekly commits (180d) 1 1 78
Maturity Mature Mature Promising
License BSD 3-Clause Apache 2.0 MIT

‍Additional tools:

Polynote is a Scala-first notebook experience with support for Python and SQL which could be a great option for teams building on Spark.

End-to-end product analytics

‍While SQL + charts tools give a lot of flexibility in terms of the types of analysis you can perform, for certain tasks such as optimizing UX and conversion in your app or website, end-to-end analytics tools offer pre-baked analysis and powerful visualizations such as funnels, cohort charts, and customer segmentation. These tools typically handle the entire cycle of event collection, storage, transformation, and visualization - hence "end-to-end".

PostHog is an open-source alternative to Mixpanel, Amplitude, and Heap.

Plausible is an open-source alternative to Google Analytics.

Feature PostHog Plausible
Optimized for Enabling product analytics in one platform Web analytics alternative to Google Analytics
Written in Python, Typescript Elixir, Javascript
Repo PostHog/posthog plausible/analytics
Stargazers 21.2k 19.9k
# of Slack community members Website-based forum Website-based forum
# of contributors (180d) 75 13
# of weekly commits (180d) 108 16
Maturity Mature Mature
License MIT AGPLv3

Code-first data apps

Plotly Dash

Developed by Plotly, Dash is designed for creating analytical web applications. It's Python-based and heavily integrates with Plotly for visualization, which makes it ideal for building highly interactive, data-driven applications. Dash is particularly popular among data scientists who are familiar with Python and want to create complex, customizable web apps without extensive web development experience.

Streamlit

Streamlit is another Python-based tool that allows for the rapid creation of web apps specifically for data science and machine learning projects. Its main selling point is its simplicity and efficiency, allowing developers to turn data scripts into shareable web apps using straightforward Python code. Streamlit was recently acquired by Snowflake and is getting increasingly integrated into the Snowflake ecosystem.

Observable

Observable is a new entrant in the data app world offering a framework for creating interactive visualizations and narratives using Markdown, JavaScript, SQL, etc. “You write simple Markdown pages — with interactive charts and inputs in reactive JavaScript, and with data snapshots generated by loaders in any programming language (SQL, Python, R, and more) — and Observable Framework compiles it into a static site with instant page loads for a great user experience. “

Feature Plotly Dash Streamlit Observable Framework
Primary interface Python Python Javascript
Optimized for Building interactive analytical web apps Fast prototyping of data apps Building data narratives with Markdown + code
Written in Python Typescript, Python Javascript/Typescript
Repo plotly/dash streamlit/streamlit observablehq/framework
Stargazers 21.3k 34.9k 2.4k
# of Slack community members Website-based forum Website-based forum 2.3k
# of contributors (180d) 18 38 20
# of weekly commits (180d) 11 17 10
Maturity Mature Mature Promising
License MIT Apache 2.0 ISC

Additional tools:

Cube

Unlike Dash and Streamlit, Cube is not directly a tool for building applications but a headless BI platform that acts as an analytics API layer between your databases and applications. It is designed to provide a consistent way to access data for analytics and visualizations across different client-side frameworks (including Dash and Streamlit), making it a powerful backend for data applications.

Putting it all together

When I set out to write this post, I didn’t have a clear answer as to whether a 100% OSS data stack could be as effective if open source wasn’t a constraint. But after quite a bit of research and tinkering, I can say “YES” with a caveat that everyone should still choose open source for the right reasons and be mindful of the implications on support, hosting, maintenance, and development.

An example open-source data workflow