Book Digest: 97 Things Every Data Engineer Should Know
Whether you're looking to find which piece from the book is most interesting for you, or you're cheating at book club, this digest of 97 Things Every Data Engineer Should Know has you covered.
In the summer of 2021, Datafold founder and CEO Gleb Mezhanskiy contributed to the book 97 Things Every Data Engineer Should Know. Since then, we’ve heard that some data engineers are getting together and reading this in a sort of book club approach - sharing thoughts on a different chapter.
If you’ve bought the book and haven’t had a chance to read through it, or you’re considering getting the book but aren’t sure which sections will be the most interesting for you, this blog is for you! Here is a quick digest of each chapter so that you know what to expect, and which sections to flip to when you get the book. (Or you can use it to cheat at book club, we won’t tell.)
Things 1-10
1: A (Book) Case for Eventual Consistency
Written by: Denise Koessler Gosnell, PhD
Using owning a bookstore (and eventual bookstore chain) as an example, Gosnell makes the case for why the eventual consistency model works.
“Swapping to an eventually consistent system with read repair to address inconsistencies will keep your customers from the peril of long lines at the register. And it will keep your architects feeling less stress about protecting the availability of your master register’s database.”
2: A/B and How to Be
Written by: Sonia Mehta
Mehta explains how to conduct A/B testing, as well as the value of A/A testing to ensure instrumentation and logging are correct. She also prepares you for the constant questions in your future.
“Expect most experiments to fail. [...] Knowing this, expect a lot of questions! Expect questions around the accuracy of the instrumentation and logging. When you see a dramatic improvement with an experiment, be skeptical.”
3: About the Storage Layer
Written by: Julien Le Dem
Understanding the storage layer will make you a better data engineer, even if you only look at it once, according to Le Dem. Once you know how the trade-off between I/O cost and increasing CPU cost works, you can pick the approach that works best for you.
“Those implementation details are usually hidden behind what is commonly known as pushdowns. These push query operations into the storage layer to minimize the cost of loading the data. They come in a few flavors.”
4: Analytics as the Secret Glue for Microservice Architectures
Written by: Elias Nema
Company-wide analytics can be complicated but necessary. Nema highlights that a feature that increases one team’s KPI might lower another team’s performance. Data can help teams ensure they aren’t all pulling in different directions, resulting in lots of movement with no progress.
“That’s why, when stepping onto the path of microservices, a company-wide analytical and experimentation culture should be among the prerequisites, not an afterthought. A rich analytical platform can become the glue that connects separate elements of a system.”
5: Automate Your Infrastructure
Written by: Christiano Anderson
With so many components going into a full data pipeline, Anderson argues that data engineers need to learn how to automate their code. He provides guidelines, including making things modular, using a version control system, and of course testing code before applying changes.
“The time and effort required will be worthwhile: you will have full control of your infrastructure, and it will enable you to deploy a brand-new data pipeline in minutes, by just executing your infrastructure code.”
6: Automate Your Pipeline Tests
Written by: Tom White
White highlights the value in treating data engineering like software engineering to write well-factored, reliable, and robust pipelines.
“Data files should be diff-able, so you can quickly see what’s happening when a test fails. You can check the input and expected outputs into version control and track changes over time.”
7: Be Intentional About the Batching Model in Your Data Pipelines
Written by: Raghotham Murthy
When ingesting data records in batches, you’ll need to decide how to create the batches over a period of time. Murthy discusses the pros and cons of the Data Time Window (DTW) batching model and Arrival Time Window (ATW) batching model, plus how to combine the two.
“This example shows that the trade-offs around completeness and latency requirements can be incorporated into the same data pipeline. Analysts can then make an informed decision on when to perform the analysis, before or after closing books.”
8: Beware of Silver-Bullet Syndrome
Written by: Thomas Nield
Anyone who is passionate about their favorite tools or technology needs to read the wake-up call from Nield. He showcases how this can actually work against us, particularly when we build our name connected to a specific tool or approach.
“Do you really want your professional identity to be simply a tool stack? [...] Build your professional identity on skills, problem-solving, and adaptability - not a fleeting technology.”
9: Building a Career as a Data Engineer
Written by: Vijay Kiran
For those looking to start or expand their career in data engineering, Kiran shares three standout skills that can give you a head start: solid experience in the software development life cycle, knowledge of SQL, specialization in data engineering sub-roles like data processing or analytics.
“Data engineering encompasses many overlapping disciplines. It is hard to chart a single route to becoming a data engineer. [...] Strong data engineers on my team have joined from roles as diverse as sales, operations, and even marketing.”
10: Business Dashboards for Data Pipelines
Written by: Valliappa (Lak) Lakshmanan
Built on the premise of “Show them the data; they’ll tell you when it’s wrong.” Lakshmanan explains that showcasing how business data is flowing through your data pipeline with a visual representation can actually help to improve your data pipelines.
“People are drawn to real-time dashboards like cats are to catnip. The day that those outlier values are produced for some reason other than malfunctioning equipment, someone will call you and let you know.”
Things 11-20
11: Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
Written by: Shweta Katre
As companies rush to develop predictive models and algorithms to “win” in an increasingly competitive, data-driven world, data science projects are exploding. However, not every project results in beautiful output, which is why Katre explains how to avoid these embarrassing moments.
“Roughly 80% of project time is spent on data collection/selection, data preparation, and exploratory analysis. [...] The promised predictive model or algorithm is not revealed in the early or even middle stages. Sometimes, in the evaluation or validation stage, the possibility arises of scrapping the entire analysis and going back to the drawing board.”
12: Change Data Capture
Written by: Raghotham Murthy
When you want to analyze your most valuable data in your production databases without adding more load to those production databases, you’ll probably rely on a data warehouse or data lake. But reliably replicated that data can be tricky at scale, which is what Change Data Capture (CDC) solves.
“In CDC, a tool reads this write-ahead log and applies the changes to the data warehouse. This technique is a lot more robust than batch exports of the tables and has a low footprint on the production database.”
13: Column Names as Contracts
Written by: Emily Riederer
Because data tables exist in a weird middle place, neither engineered like a service nor designed like an application, engineers are often confused about why users aren’t satisfied and data consumers can end up frustrated that the data is never quite right. But a shared vocabulary for naming fields could help.
“Of course, no single, silver-bullet solution exists for data quality, discoverability, and communication. But using column names to form contracts is a useful way to start communicating both with your users and your workflow tools.”
14: Consensual, Privacy-Aware Data Collection
Written by: Katharine Jarmul
GDPR and data privacy are increasingly hot topics in data that can impact how data practitioners collect and move data. Jarmul breaks down some ideas to get started, particularly related to consent metadata, data provenance, and dropping fields.
“We should apply data-protection mechanisms when we know the sensitive fields our data may contain. Do we really need usernames for aggregate analytics? Nope? Then drop them (or don’t collect them in the first place). Do we need email addresses for our chatbot training? Nope? Then make sure that they don’t make it into the model.”
15: Cultivate Good Working Relationships with Data Consumers
Written by: Ido Shlomo
As companies become more data-driven, it’s increasingly likely that your data consumers will have some sense of how to work with data. But Shlomo argues that data engineers shouldn’t just hand over ownership of the data to the consumers as that can compromise the data and the relationships you need to build.
“Put a premium on knowing what data consumers actually do. Data consumers rely on data infrastructure to do their respective jobs. Their level of comfort, productivity, and adoption depends on the fit between that infrastructure and the dynamics of their work.”
16: Data Engineering != Spark
Written by: Jesse Anderson
Anderson explains that you need more for your data pipeline than just Apache Spark. In fact, you’ll need three general types of technologies to create a data pipeline: Computation, Storage, and Messaging.
“Put another way:
Data Engineering = Computation + Storage + Messaging + Coding + Architecture + Domain Knowledge + Use Cases”
17: Data Engineering for Autonomy and Rapid Innovation
Written by Jeff Magnusson
Data engineering, and by extension data pipelines, are often considered complex and specialized, outside the domain of other teams in the organization. However, data-flow logic often requires input from the teams requesting the work, which Magnusson says can cause issues that can be overcome with specific strategies.
“Data engineers specialize in implementing data-flow logic, but often must implement other logic to spec based on the desires or needs of the team requesting the work, and without the ability to autonomously adjust those requirements. [...] Instead, look for ways to decouple data-flow logic from other forms of logic within the pipeline.”
18: Data Engineering from a Data Scientist’s Perspective
Written by: Bill Franks
People have been working on ingestion and management of data for decades, but data engineering is a fairly new role. Franks explains this is because for a long time, a small number of mature tools interfaced with mature data repositories, making life simple. But as the scale and types of data expanded, new tools are needed, and these aren’t as mature, requiring creative data engineers to piece it all together.
“It takes hard work to get the pieces of a data pipeline to work together efficiently and securely, and it often requires more energy input and complexity than should be needed. [...] Data engineers must focus on integration and optimization across tools and platforms as opposed to optimizing workloads within a given tool or platform.”
19: Data Pipeline Design Patterns for Reusability and Extensibility
Written by: Mukul Sood
When designing data pipelines, data engineers can see similarities in problems, which is an indicator of common themes. That’s when it’s possible to start thinking in terms of design patterns, which Sood defines as creational or structural patterns, behavioral patterns, and the facade pattern.
“When we combine these patterns, we realize the benefits of design principles, as the separation of responsibilities allows the modularization of the code-base at multiple levels. [...] This provides building blocks for generalizing the data pipelines to move away from writing custom code to more generic modules, templates, and configuration.”
20: Data Quality for Data Engineers
Written by: Katharine Jarmul
Drawing parallels with “real-world” pipelines for oil or water, Jarmul urges data engineers to focus on the quality of what’s running through their data pipelines. It doesn’t matter how much data is collected each day if it’s essentially useless for the necessary tasks.
“Take the time to determine what quality and validation measurements make sense for your data source and destination, and set up ways to ensure that you can meet those standards. Not only will your data science and business teams thank you for the increase in data quality and utility, but you can also feel proud of your title: engineer.”
Things 21-30
21: Data Security for Data Engineers
Written by: Katharine Jarmul
Data engineers are often managing the most valuable company resource, so it makes sense to learn and apply security engineering to the data. Jarmul gives some helpful starting points and advice if this has suddenly appeared on your radar.
“Scheduling a regular security spring into your planning is a great way to stay on top of these issues and improve security over time. When faced with those questions again, you and your team can respond with ease of mind, knowing your data-engineering workflows are secure.”
22: Data Validation is More Than Summary Statistics
Written by: Emily Riederer
While data-quality management is critical, approaches can vary widely. Riderer highlights that summary statistics and anomaly detection without context ignore the nuance and can cause engineers to waste precious time on pernicious errors.
“Defining context-enriched business rules as checks on data quality can complement statistical approaches to data validation by encoding domain knowledge. [...] While these checks can still be simple arithmetic, they encode a level of intuition that no autonomous approach could hope to find and ask questions that might be more representative of the way our ETL process could break.
23: Data Warehouses Are the Past, Present, and Future
Written by: James Densmore
Even as data warehouses are long prophesied to disappear, their value has endured. Densmore explores the evolution of the data warehouse and how they will continue to thrive in the future.
“It’s now better to focus on extracting data and loading it into a data warehouse, and then performing the necessary transformations. With ELT, data engineers can focus on the extract and load steps, while analysts can utilize SQL to transform the data that’s been ingested for reporting and analysis.”
24: Defining and Managing Messaging in Log-Centric Architectures
Written by: Boris Lublinsky
With messaging systems changing how data is exposed, message definitions become more important. Logs are becoming a centerpiece of the architecture, which encourages the creation of canonical data models.
“Ideally, this canonical data model should never change, but in reality it does. [...] A solution to this problem, similar to API versioning best practices, is creation of a new deployment of the service with a new topic. [...] This approach to versioning allows functionality to evolve independently without breaking the rest of the system.”
25: Demystify the Source and Illuminate the Data Pipeline
Written by: Meghan Kwartler
Good data documentation can be rare. Kwartler gives excellent first steps to take when working on a new project, team, or company to understand the data and set up a solid foundation. With tips on how to investigate where data originates as well as how to examine the metadata, you’ll truly understand the data you need to work with.
“During your investigation, create documentation if it doesn’t exist. [...] It can be simple and yet make a strong impact at the same time. Pay it forward and help the next person who will be tasked with using data to benefit the business.”
26: Develop Communities, Not Just Code
Written by: Emily Riederer
Data engineering job descriptions and value are usually based on building a data pipeline. However, developing a data culture along with your data products can go a long way and improve the impact that you have on your organization.
“Your data is poised to be a shared platform that can connect a network of users with common objectives. Empowered users not only will derive more value from your data, but also may be able to self-service or crowd-source more of their data needs instead of continuing to rely on a data-engineering team for ad hoc requests.”
27: Effective Data Engineering in the Cloud World
Written by: Dipti Borkar
Moving from on-prem to the Cloud has turned data engineers from being purely focused on data infrastructure to being almost full-stack engineers. Skills are needed across compute, containers, storage, data movement, performance, and network.
“After data lands in the enterprise, it should not be copied around except, of course, for backup, recovery, and disaster-recovery scenarios. How to make this data accessible to as many business units, data scientists, and analysts as possible with as few new copies created as possible is the data engineer’s puzzle to solve.
28: Embrace the Data Lake Architecture
Written by: Vinoth Chandar
While it’s easy in the short term to build a single-stage pipeline to extract data, transform it, and enable other parts of the organization to query the resulting datasets, this doesn’t work when you scale to tera/petabytes.
“The data lake architecture has grown in popularity. In this model, source data is first extracted with little to no transformation into a first set of raw datasets. [...] All data pipelines that express business-specific transformations are then executed on top of these raw datasets.”
29: Embracing Data Silos
Written by: Bin Fan & Amelia Wong
Data engineers frequently blame data silos as the biggest obstacle to extracting value from data efficiently. However, there are a number of reasons why data silos exist. Rather than fight the system, Fan and Wong argue that engineers should embrace it.
“Instead of eliminating silos, we propose leveraging a data orchestration system, which sits between computing frameworks and storage systems, to resolve data access challenges. [...] With a data orchestration system, data engineers can easily access data stored across various storage systems.”
30: Engineering Reproducible Data Science Projects
Written by: Tianhui Michael Li
When data engineers start off in their careers, many dive into “cutting edge” elements of the field. Li advocates for engineers to focus on the foundations to create reproducible results - just like scientists in other fields.
“It’s always smart to begin a data science project with some idea of how it will be put into production. For instance, designing a pipeline that uses the same data format during the research and production phases will prevent bugs and data corruption issues later.”
Things 31-40
31: Five Best Practices for Stable Data Processing
Written by: Christian Lauer
When implementing data processes like ELT or ETL, Lauer highlights the best practices to keep in mind, including how to prevent errors, set fair processing times, use data-quality measurement jobs, ensure transaction security, and consider dependency on other systems.
“You should always keep in mind that data quality is important. Otherwise, you might experience a lack of trust from users and business departments.”
32: Focus on Maintainability and Break Up Those ETL Tasks
Written by: Chris Moradi
With ever more democratization of data, data scientists can do well when they can build their own pipelines. Moradi explains how data scientists and data engineers can adapt their typical ETLs for a full-stack approach.
“Breaking pipelines into small tasks may carry computational costs as the work can’t be optimized across these boundaries. However, we sometimes focus too much on runtime performance when we should instead focus on the speed of innovation that’s enabled.”
33: Friends Don’t Let Friends do Dual-Writes
Written by: Gunnar Morling
Writes to various distributed resources without shared transaction semantics aren’t just error-prone, they can lead to inconsistencies in your data. Morling explains why this is an issue and how to avoid it.
“This is where change data capture (CDC) comes in: it allows users to react to all the changes in a database and transmit them as events to downstream consumers.”
34: Fundamental Knowledge
Written by: Pedro Marcelino
Information and knowledge is doubling faster than ever, which can make deciding what to learn even more difficult, particularly for knowledge workers. Marcelino argues that focusing on fundamentals stands the test of time and helps to make sense of all future information.
“On the one hand, the growth of knowledge leads to an unmanageable need to keep up-to-date with several new concepts, technologies, and frameworks. On the other hand, knowledge is becoming more transient, and what we learn today may be obsolete tomorrow.”
35: Getting the “Structured” Back into SQL
Written by: Elias Nema
In a rather humorous article, Nema takes on one of the only problems that have persisted in computer science for about 50 years - How to write SQL. He argues that it’s necessary to start with structure, optimizing for readability.
“Once you have learned how to get the data you need from different sources and documented it in the form of a readable structure, the query will tell the story of your analysis by itself.”
36: Give Data Products a Frontend with Latent Documentation
Written by: Emily Riederer
As DataOps mirrors DevOps, Riederer argues that data engineers need to also incorporate principles from design and product management. Specifically, she looks at ways data engineers can build latent documentation for a low-cost data frontend.
“Many of the artifacts that data consumers want can be created with little to no incremental effort if engineers embrace latent documentation: systematically documenting their own thought processes and decision making during the engineering process in a way that can be easily shared with and interpreted by users.”
37: How Data Pipelines Evolve
Written by: Chris Heinzmann
ETL can be intimidating, which is why Heinzmann breaks down alternatives and when you might need them.
“Pipelines are constructed to get faster insights into the business from data. This is true no matter the scale. The architecture of the pipeline will depend on the scale of the business as well as how well the company has reached product-market fit.”
38: How to Build your Data Platform like a Product
Written by: Barr Moses and Atul Gupte
Moses and Gupte share best practices to avoid common pitfalls when building the data platform of your dreams. These include aligning your product’s goals with the goals of the business, gaining feedback and buy-in from the right stakeholders, and prioritizing long-term growth and sustainability over short-term gains.
“It doesn’t matter how great your data platform is if you can’t trust your data, but data quality means different things to different stakeholders. Consequently, your data platform won’t be successful if you and your stakeholders aren’t aligned on this definition.”
39: How to Prevent a Data Mutiny
Written by: Sean Knapp
DevOps helped more people build increasingly complex software products faster and more safely, so it only stands to reason that data teams can learn from these lessons. Knapp showcases how modular architectures, declarative configurations, and automated systems can help companies avoid data mutinies.
“These same trends can be leveraged for data teams, providing harmonious balance among otherwise conflicting goals, and helping to prevent data mutinies. In return, data teams gain the benefits of improved productivity, flexibility, and speed to delivery on new, innovative data products.”
40: Know the Value per Byte of Your Data
Written by: Dhruba Borthakur
While data engineers used to love flaunting the size of their datasets, Borthakur argues that a new metric is needed, particularly at large-scale enterprises. Value per byte showcases the value that the enterprise extracts from the data.
“For my multiterabyte dataset, I found that my value per byte is 2.5%. This means that for every 100 bytes of data that I help manage, I”m using only the information stored in 2.5 bytes.”
Things 41-50
41: Know Your Latencies
Written by: Dhruba Borthakur
Most data engineers are familiar with the size of their data, but might be less concerned with the recency of the data and latency of queries. Borthakur now always asks: what is my data latency, what is my query latency, and what are my queries per second?
“The answers to these three questions determine the type of data system you should use. [...] The ETL process adds latencies to your data, and a shorter pipeline means that you get to query your most recent data.”
42: Learn to Use a NoSQL Database, but Not like an RDBMS
Written by: Kirk Kirkconnell
Too many data engineers create a schema in their NoSQL database that is like a relational schema, and then perform what Kirkconnell calls a “naive migration”. This fails to take advantage of the NoSQL database, and can be an expensive mistake.
“The more you try to use a NoSQL database as a general-purpose database, the moer you get into the “jack of all trades, master of non” arena that RDBMSs have unfortunately been shoehorned into. For best performance, scalability, and cost, asking questions of your data should be the minority of the requests in OLTP-type NoSQL databases.”
43: Let the Robots Enforce the Rules
Written by: Anthony Burdi
Asking other human beings to follow specific rules and processes, or even read docs, can be emotionally demanding and make life harder for data professionals. Instead, Burdi offers suggestions for ways to automate tasks with some easy validation-robot job description ideas.
“Add validation everywhere that you can (within reason!) yields two clear benefits: you get cleaner data, and the computer can be the bad guy. Save your emotional reserves for more important conversations.”
44: Listen to Your Users - But Not Too Much
Written by: Amanda Tomlinson
Working with data is becoming ever more complicated, not that Tomlinson needs to tell the data engineers reading this book that! However, she points out that the hard work of managing the data and fulfilling a range of requests isn’t the only part of the job for data practitioners.
“A certain level of satisfaction comes from delivering requirements and closing tickets, but simply churning out solutions creates long-term problems. A fine balance needs to be struck between delivering what your users need and maintaining a sustainable, scalable data function.”
45: Low-Cost Sensors and the Quality of Data
Written by: Dr Shivanand Prabhoolall Guness
Telling the story of a project involving Raspberry Pis and Arduinos, Guness highlights the importance of having redundancy built into any project, and the value of building towards quality data from day one. Thanks to hardware failures, internet interruptions, and other preventable issues, the project was not nearly as expansive as it could have been.
“Because of the problems we encountered, the usable data from ourone year of collection was reduced to a period of five months.”
46: Maintain your Mechanical Sympathy
Written by: Tobias Macey
As you grow in your knowledge and understanding of technology, you’ll inevitably become less connected to the hardware that is churning through your instructions. Macey argues that it’s vital to learn enough about the physical considerations of your requirements to understand their limitations and to know what you don’t know.
“For data engineers, some of the foundational principles that will prove most useful are the different access speeds for networked services, hard drives of various types, and system memory. All of these contribute greatly to the concept of data gravity, which will influence your decisions about how and when to move information between systems, or when to leave it in place and bring your computation to the data.”
47: Metadata ≥ Data
Written by: Jonathan Seidman
Referring back to a period at Orbitz Worldwide, Seidman describes how they ended up with numerous Hive tables that basically represented the same entities. He argues that it’s best to plan your data management strategy early, in parallel with any new data initiative or project.
“Having a data management infrastructure that includes things like metadata management isn’t only critical for allowing users to perform data discovery and make optimal use of your data. It’s also crucial for things like complying with existing and new government regulations around data.”
48: Metadata Services as a Core Component of the Data Platform
Written by: Lohit VijayaRenu
With massive amounts of both structured and unstructured data flooding organizations, data engineers need to prioritize metadata services. VijayaRenu highlights four key elements to evaluate: Discoverability, Security Control, Schema Management, and Application Interface and Service Guarantee.
“As the amount of data and the complexity of its usage has increased, considering a unified metadata service as part of your data platform has become even more important. [...] While no single system may provide all these features, data engineers should consider these requirements while choosing one or more services to build their metadata service.”
49: Mind the Gap: Your Data Lake Provides No ACID Guarantees
Written by: Einat Orr
While data lakes are cost-effective and allow high throughput when ingesting or consuming data, Orr explains there are some persistent challenges when working with data. These include missing isolation, no atomicity, not being able to ensure cross-collection consistency, no reproducibility, and low manageability.
“First and foremost, we must know not to expect those guarantees. Once you know what you are in for, you will make sure to put in place the guarantees you need, depending on your requirements and contracts you have with your customers who consume data from the data lake.”
50: Modern Metadata for the Modern Data Stack
Written by: Prukalpa Sankar
While the modern data stack is great for speed, scalability, and reduced overhead, it lacks governance, trust, and context. Sankar explains how the need for modern metadata has expanded with the modern data stack. For example, she highlights how traditional data catalogs were built on the premise that tables were the only asset that needed to be managed, as compared to our current situation with BI dashboards, code snippets, SQL queries and more are all data assets.
“We’re at an inflection point in metadata management - a shift from slow, on-premises solutions to a new era, built on the principles of embedded collaboration common in modern tools.”
Things 51-60
51: Most Data Problems Are Not Big Data Problems
Written by: Thomas Nield
Depending on the feature needed, or the way your data is structured, you may be best served by embracing a SQL database rather than a NoSQL competitor. Even as data grows, Nield argues that simply dumping big data into a horizontally scalable cluster isn’t always the right approach.
“The truth is, most data problems are not big data problems. Anecdotally, 99.9% of problems I’ve encountered are best solved with a traditional relational database.”
52: Moving from Software Engineering to Data Engineering
Written by: John Salinas
If you’re considering if transitioning from software engineering to data engineering, this is the article for you. Salinas explains the similarities and differences between the roles, and why he enjoys his work in data engineering.
“All my experience is still relevant, and even more important, so are skills like troubleshooting, scaling enterprise applications, API development, networking, programming, and scripting. With data engineering, you solve problems similar to those in software engineering, but at a larger scale.”
53: Observability for Data Engineers
Written by: Barr Moses
As companies become more data driven, Moses highlights the ways that good data can turn bad. She argues that data observability will continue to follow the principles of software engineering to address data freshness, distribution, volume, schema, and lineage.
“As data leaders increasingly invest in data-reliability solutions that leverage data observability, I anticipate that this field will continue to intersect with other major trends in data engineering, including data meshes, machine learning, cloud data architectures, and the delivery of data products as platforms.”
54: Perfect Is the Enemy of Good
Written by: Bob Haffner
Haffner encourages data engineers to not be haunted by their experienced of rushed implementations or natural aversion to risk. While he doesn’t advocate for shortcuts in data, he does promote an agile approach to data engineering.
“If your leaders require three metrics to run the business and you have only one completed, ship it. Driving the organization with some insight is always better than driving with none.”
55: Pipe Dreams
Written by: Scott Haines
In discussing data pipeline architecture, Haines describes how this grew out of the concept of distributed systems - all the components were available, just waiting to be assembled. However, the focus of the article is on message passing, particularly failure in the face of partially processed messages, and how it led to offset tracking and checkpointing.
“Now there was a reliable, highly available, and highly scalable system that could be used for more than just message passing, and this essentially created the foundation of the streaming pipeline architecture.”
56: Preventing the Data Lake Abyss
Written by: Scott Haines
After describing the evolution of data lakes, Haines explains that they can turn into black holes for data, particularly when it comes to legacy data. He argues for data contracts to address this issue, eventually leading to compiled libraries for producing and consuming data.
“Instead of road maps being derailed with long periods of let’s play find the mising data [...], you can get back to having a data lake that is an invaluable shared resources used for any of your data needs: from analytics to data sciences and on into the deep learning horizons.”
57: Prioritizing User Experience in Messaging Systems
Written by: Jowanza Joseph
Drawing upon tangibly awful user experiences such as applying for a loan from a bank with multiple copies of bank statements issued by the same bank, or providing medical history to providers repeatedly during a single doctor’s visit, Joseph makes the case for resource efficiency in messaging systems.
“While initially designed to prevent duplicate collection of medical histories, the system ended up having a far-reaching impact on the hospital system. Not only did the hospital system save money, but the patient experience improved dramatically.”
58: Privacy Is Your Problem
Written by: Stephen Bailey, PhD
Bailey argues that data engineers should be the ones to take ownership of the privacy problem, forging a new path. He highlights that this doesn’t have to be a boring exercise, but rather a chance at an exciting technical challenge.
“Ultimately, data engineers should know the practice of privacy by design as intuitively as they do the principle of least privilege.”
59: QA and All Its Sexiness
Written by: Sonia Mehta
Drawing comparisons with inspecting a home before buying it, Mehta advocates for QA testing data before sending it to production. She explains the types of tests that are ideal, as well as advice on the QA process.
“While implementing rigorous QA standards does not make you immune from blunders, it does help decrease their frequency and improve overall data sentiment. When folks can focus on the information that the data is telling us, organizations can expect to see higher data literacy and innovation.”
60: Seven Things Data Engineers Need to Watch Out for in ML Projects
Written by: Dr. Sandeep Uttamchandani
87% of machine learning projects fail. Dr Uttamchandani highlights the main reasons this can be attributed to data engineering. These include using datasets without proper documentation, metrics without clear definitions, changing data source schema, among others.
“I believe ML projects are a team sport involving data engineers, data scientists, statisticians, DataOps/MLOps engineers, and business domain experts. Each player needs to play their role in making the project successful.”
Things 61-70
61: Six Dimensions for Picking an Analytical Data Warehouse
Written by: Gleb Mezhanskiy
The data ecosystem revolves around the data warehouse, which is why picking the right one is vital for any data team. Mezhanskiy highlights top considerations when shopping for a data warehouse, including scalability, price elasticity, interoperability, and speed.
“It is also often the most expensive piece of data infrastructure to replace, so it’s important to choose the right solution and one that can work well for at least seven years. Since analytics is used to power important business decisions, picking the wrong DWH is a sure way to create a costly bottleneck for your business.”
62: Small Files in a Big Data World
Written by: Adi Polak
Small files, those that are significantly smaller than the storage block size, can cause outsized problems. Polak explains why this happens, how to detect and mitigate the issue.
“Be aware of small files when designing data pipelines. Try to avoid them, but know that you can fix them, too!”
63: Streaming is Different from Batch
Written by: Dean Wampler, PhD
Batch processes can cause potentially long delays, which begs the question of why companies don’t move these analytics to streaming services? Wampler explains the challenges to “simply” switching to streaming.
“The solution is to adopt the tools pioneered by the microservices community to keep long-running systems healthy, more resilient against failures, and easier to scale dynamically.”
64: Tardy Data
Written by: Ariel Shaqed
In a play on Shakespeare, Shaqed describes that time-based data can be born late, achieve lateness, or have lateness thrust upon it. His solutions include updating existing data, adding serialized storage arrival time, and ignoring late data, all depending on the circumstances.
“Lateness occurs at all levels of a collection pipeline. Most collection pipelines are distributed, and late data arrives significantly out of order. Lateness is unavoidable; handling it robustly is essential.”
65: Tech Should Take a Back Seat for Data Project Success
Written by: Andrew Stevenson
For projects to be successful, Stevenson argues that the biggest challenge is to understand the business context. Too often, organizations try to use technology to overcome these hurdles, giving data engineers incredible technology without providing the necessary context.
“The greatest success and innovation I’ve witnessed occurs when end users (business users, data analysts, data scientists) are given the correct tooling and the access to explore, process, and operate data themselves.”
66: Ten Must-Ask Questions for Data-Engineering Projects
Written by: Haidar Hadi
Before you give an estimate on delivery or begin coding, Hadi provides a checklist of questions to ask. These range from touchpoints and algorithms to due dates and why due dates were set when they were.
“Many projects will be maintained by other people. Ask about their skill level and the kind of documentation they need to operate your data pipeline.”
67: The Data Pipeline Is Not About Speed
Written by: Rustem Feyzkhanov
Data-processing pipelines used to be about speed, but now they’re increasingly about scalability. Once you’ve achieved perfect horizontal scalability, then you can focus on execution time, but this time as a matter of optimizing for cost.
“The emerging opportunity is to design data pipelines to optimize unit costs and enable scalability from the initial phases, allowing transparent communication between data engineers and other stakeholders such as project managers and data scientists.”
68: The Dos and Don’ts of Data Engineering
Written by: Christopher Bergh
Everyone asks, “What did you wish you had known when you started?” Bergh answers with an honest and authentic take on the heroism and burnout that many data engineers face.
“The bottom line here is that methodology and tools matter more than heroism. Automate everything that can be automated, and focus your attention on the creativity that requires a human touch.”
69: The End of ETL as We Know It
Written by: Paul Singman
While ETL is incredibly popular, Singman suggests Intentional Transfer of Data (ITD) as an alternative. If ETL is needed because people don’t build their user database or content management system with downstream analytics in mind, ITD has you add logic in the application code that first processes events.
“Moving from ETL to IDT isn’t a transformation that will happen for all your datasets overnight. [...] My advice is to find a use case that will clearly benefit from real-time data processing and then transition it from ETL to the IDT pattern.”
70: The Haiku Approach to Writing Software
Written by: Mitch Seymour
Drawing upon the deeply evocative Japanese poetry, Seymour advocates for a thoughtful, artistic, and precise approach to software and code. In a surprisingly impactful argument, he showcases how removing unnecessary complexity is difficult but rewarding.
“The haiku approach is to be careful and intentional with early decisions, so that they become a strong foundation for everything that comes after.”
Things 71-80
71: The Hidden Cost of Data Input/Output
Written by: Lohit VijayaRenu
While data engineers use a range of libraries and helper functions to read and write data, there are a few hidden details that you might be missing. VijayaRenu highlights elements around data compression, format, and serialization to consider when optimizing your applications.
“Data engineers should take time to understand them and profile their applications to break down hidden costs associated with I/O.”
72: The Holy War Between Proprietary and Open Source Is a Lie
Written by: Paige Roberts
Data engineers should focus on how to get a project into production most efficiently, not whether the software is open source or proprietary. Sometimes open source software is more flexible with better integrations, but sometimes proprietary software wins in that regard.
“Choose software that works and plays well with others, particularly other software that you already know your project requires. And look at how much of the job a single application can accomplish. If one thing can do multiple parts of your task, that will reduce your integration burden.”
73: The Implications of the CAP Theorem
Written by: Paul Doran
The CAP theorem describes the balancing act data engineers face between consistency, availability, and partition tolerance in distributed data systems. However, it’s the first step in considerations that data engineers need to keep in mind.
“The CAP theorem is an important result, but it is only a small part of the picture. It allows us to frame our thinking and understanding of the technologies we use, but it does not allow us to stop thinking about who our customers are and what they need.”
74: The Importance of Data Lineage
Written by: Julien Le Dem
Data lineage is vital for understanding how a dataset was derived from another one. Le Dem explores how operational lineage goes a step further by tracing how and when that transformation happened. This lets data engineers uncover more about privacy, discovery, compliance, and governance.
“As the number of datasets and jobs grows within an organization, these questions quickly become impossible to answer without collecting data-lineage metadata.”
75: The Many Meanings of Missingness
Written by: Emily Riederer
Missing data is a fundamental topic for data management, with preventing and detecting null values as vital for many data engineers. However, sometimes there’s more context leading to those null values that actually offer additional information.
“Nulls that represent an issue loading data raise concerns about overall data quality; nulls from random sampling can possibly be safely ignored; nulls that represent aspects of user feedback could themselves become features in a model or analysis.”
76: The Six Words That Will Destroy Your Career
Written by: Bartosz Mikulski
The six words Mikulski claims can destroy your credibility and jeopardize your career are, “This number does not look correct.” Once you lose trust, it’s a slippery slope to becoming the company scapegoat.
“The other best practices we can copy come from site reliability engineering (SRE) teams. These include relentless automation and monitoring.”
77: The Three Invaluable Benefits of Open Source for Testing Data Quality
Written by: Tom Baeyens
Even though people know they should test data, they often don’t have the knowledge or best practices to do so. Baeyens argues that open source software is boosting the speed and creativity of data testing through community engagement.
“Open source software is opening up better data testing to those both within and outside the professional developer community. It’s free, inclusive, built for purpose, and open to community-driven innovation - three invaluable benefits to improve data quality across all sectors.”
78: The Three Rs of Data Engineering
Written by: Tobias Macey
Unlike the three Rs of education (reading, writing, arithmetic), Macey’s three Rs actually all start with R. They are reliability, reproducibility, and repeatability. With the most attention given to reliability, all three are vital for any data engineer.
“Reliability can mean a lot of things, but in this context we are talking about characteristics of your data that contribute to a high degree of confidence that the analyses you are performing can be assumed correct.”
79: The Two Types of Data Engineering and Data Engineers
Written by: Jesse Anderson
Organizations and people often confuse these different types, leading to failures in big data projects. Anderson distinguishes between SQL-focused data engineering and big data-focused data engineering. Similarly, engineers may specialize in these distinct areas.
“While there are SQL interfaces for big data, you need programming skills to get the data into a state that’s queryable. For people who have never programmed before, this is a more difficult learning curve.”
80: The Yin and Yang of Big Data Scalability
Written by: Paul Brebner
Apache Cassandra and Apache Kafka offer exceptional horizontal scalability. However, tuning that scalability and optimizing performance can yield some interesting results.
“However, we found that increasing the number of partitions beyond a critical number significantly reduced the throughput of the Kafka cluster. Subsequent benchmarking revealed that Kafka cluster through is maximized with a “sweet spot” number of partitions [...].”
Things 81-90
81: Threading and Concurrency in Data Processing
Written by: Matthew Housley, PhD
In his teardown of the Amazon Kinesis outage of late 2020, Housley highlights the engineering issues related to the limits of thread-based concurrency. These include operating system threading, threading overhead, and solving the C10K problem.
“Cloud native engineers have internalized the idea that scaling and concurrency can solve any problem; tech companies now routinely spin up clusters consisting of thousands of instances, and single servers can manage over 10 million connections.”
82: Three Important Distributed Programming Concepts
Written by: Adi Polak
During the transform task of T in your ETL or ELT operations, you might be working with data that fits in one machine’s memory. Often, though, you’ll need to use distributed parallel computation. These can use models including MapReduce Algorithm, Distributed Shared Memory Model, or Message passing/Actors Model.
“When picking open source solutions for this, check for message guarantees. Can you guarantee that messages will be delivered at most once? At least once? Exactly once? This will influence your system’s operations.”
83: Time (Semantics) Won’t Wait
Written by: Marta Paes Moreira and Fabian Hueske
Batch and stream processing are different based on the notion of completeness. In batch processing, data is always considered complete, but stream-processing needs to reason about the completeness of input. This can be done with processing time or event time.
“A good illustration of the difference between processing time and event time is the sequence of Star Wars movies: the year each movie was released corresponds to the processing time, and the actual chronological order of the action in the plots to the event time.”
84: Tools Don’t Matter, Patterns and Practices Do
Written by: Bas Geerdink
Instead of diving into learning the wide array of tools, frameworks, languages, and engines, Geerdink advises people to focus on concepts, best practices, and techniques. This way, when you find a cool piece of technology or an intriguing term, you can understand the why, how, and what behind it.
“Don’t focus on details such as syntax and configuration too much, but rather keep asking yourself the why questions.”
85: Total Opportunity Cost of Ownership
Written by: Joe Reis
While total cost of ownership (TCO) has been around for decades, Reis raises the concept of total opportunity cost of ownership (TOCO). For example, if you got on-prem Hadoop in 2010, you probably felt like it was a smart move, but by 2020, it seems antiquated. “Total opportunity cost of ownership (TOCO) is the cost of being captive to technology X and paradigm Y, while no longer benefiting from new technologies and platforms. It’s the price you pay by going all in on technology X and paradigm Y, and not being able to make the transition to new paradigms.”
86: Understanding the Ways Different Data Domains Solve Problems
Written by: Matthew Seal
Even in the same organization, data silos can exist across teams working on data engineering, machine learning, and data infrastructure. These different pursuits often use different design approaches, which can cause misunderstandings between the teams - something Seal attempts to reconcile.
“What’s important to take from this is not that one group’s focus is better than another’s, but rather that if you’re in one of these groups, you should be aware of your own design bias.”
87: What is a Data Engineer? Clue: We’re Data Science Enablers
Written by: Lewis Gavin
When 80% of a data scientist’s job is cleaning and preparing data, they usually aren’t happy. Gavin argues that this is where data engineers can thrive, providing repeatable, consistent sources of fresh, clean data.
“The efficiencies gained mean that getting to the point of building the model will be quicker, and the resulting models will undoubtedly be better, as the data scientists have more time to spend tweaking and improving them.”
88: What is a Data Mesh, and How Not to Mesh It Up
Written by: Barr Moses and Lior Gavish
Defining the data mesh as the data platform version of microservices, Moses and Gavish envision the data mesh as supporting data as a product, with each domain handling its own data pipelines.
“Some of our customers worry that the unforeseen autonomy and democratization of a data mesh introduces new risks related to data discovery and health, as well as data management. [...] Instead of introducing these risks, a data mesh actually mandates scalable, self-serve observability in your data.”
89: What Is Big Data?
Written by: Ami Levin
While the term “big data” has been used often for many years, the clear definition of what we mean by it is elusive. It can mean everything from Hadoop to social media or IoT devices, to unstructured data. Anyone can claim their products or datasets are “big data” because it’s impossible to refute this claim.
“The truth is that there is no such thing as big data. Large-volume, high-speed, varying-source data has always challenged our ability to derive value from it. Nothing fundamental has changed in the last decade to warrant a new term.”
90: What to Do When You Don’t Get Any Credit
Written by: Jesse Anderson
Praise often rains down on data scientists while data engineering live in the praise drought. Realistically, executives don’t care if you’re using the latest technology or have unit tests or even a scalable system, they only care about the business value created by the data - and if something goes wrong.
“We need to get in the habit of talking in language that they understand and care about. We need to explain that the analytics are possible only as a direct result of the data pipelines we’ve created.”
Things 91-97
91: When our Data Science Team Didn’t Produce Value
Written by: Joel Nantais
Even as your team delivers on amazing ML and forecasting tools, the perception of the data team can decline if that’s balanced with rejecting ad hoc requests. The data’s usefulness is certainly subjective, but you can impact how your team is perceived.
“Understand how your tools can provide solutions. Balance long-term solutions with short-term needs. Usually, today’s problem matters moer than next year’s problem.”
92: When to Avoid the Naive Approach
Written by: Nimrod Parasol
Data engineers are always told to avoid overengineering solutions, to keep things simple. However, when it comes to your data store, you’ll want to avoid complicated and costly data migration projects in the future.
“When you do decide to go with the naive approach, you should know that if you let too much data accumulate before the optimizations, it will be expensive to do them later. In that case, it’s best to mark milestones for revisiting the design before it becomes too late.”
93: When to Be Cautious About Sharing Data
Written by: Thomas Nield
While silos are generally considered bad and even toxic to organizations, sometimes they exist for legitimate reasons. Obviously sensitive data, or data that’s expensive to query, has access control, but Nield reveals another reason why some data should have gatekeepers.
“Sometimes navigating and interpreting specialized data correctly can be difficult, because doing so requires an immense amount of domain knowledge. If you know nothing about revenue forecasting, do you really have any business going through a revenue-forcasting database?”
94: When to Talk and When to Listen
Written by: Steven Finkelstein
This retrospective of a one-year digitization project, Finkelstein shares the micro and macro level obstacles faced along the way. While the project sounds simple enough, the challenges were quite tricky and universal for other organizations.
“Without the constant pushback from our team to reduce the complexity of the project, we would be nowhere near “go live”. The project provided many valuable lessons. Aiming for simplicity is essential in all discussions; however, understanding when it is our time to talk versus listen is extremely subjective.”
95: Why Data Science Teams Need Generalists, Not Specialists
Written by: Eric Colson
While Adam Smith argued for the productivity gains of specialization, Colson advocates for a general model for faster coordination and iteration. This also speeds up learning and development.
“Data scientists can be shielded from the inner workings of containerization, distributed processing, and automatic failover. This allows the data scientists to focus more on the science side of things, learning and developing solutions through iteration.”
96: With Great Data Comes Great Responsibility
Written by: Lohit VijayaRenu
Even as data-driven applications improve the user experience, VijayaRenu reminds data engineers that user information can be exposed to a wider range of applications and users. As such, data engineers have a responsibility to consider user privacy, ethics, and security.
“It is now more important than ever before for organizations to have a holistic view of how user information is used, what is derived from it, and who has access to this information.”
97: Your Data Tests Failed! Now What?
Written by: Sam Bail, PhD
It’s wonderful to build a data-quality strategy including data tests, but what do you do when those tests fail? Bail evaluates system response, logging and alerting, alert response, stakeholder communication, root cause identification, and issue resolution.
“I recommend not taking test failures at face value. For example, a test for null values in a column could fail because some rows have actual null values or because that column no longer exists. It makes sense to dig into the cause of an issue only after it’s clear what that issue actually is!”