Data Engineering Resources

Learning Data Engineering implies mastering many different skills and technologies, which can feel quite daunting. Luckily, whether you prefer to learn from books, practical exercises, talks or blog posts, you’ll find a lot to chew on below. This post is a curation of the data engineering resources I personally found most useful to learn this field.

Books
Blog Posts
Talks
Exercises
Podcasts
Projects
Papers
Others

Books

Designing Data-Intensive Applications, Martin Kleppmann. This 2017 book would be my recommendation if you could only choose one resource from this entire post. Instead of focusing on specific technologies, Martin discusses every aspect of distributed data systems from first principles, painting a coherent picture of the entire Big Data landscape.
Big Data: Principles and best practices of scalable realtime data systems, Nathan Marz. This book could as well have been called “The Lambda Architecture”, as that is its main focus. It is however a very fundamental architecture to grasp, explained here in a very thorough and well organized manner.
High Performance Spark, Holden Karau & Rachel Warren. I believe this book is useful to go through and keep around for everyone working with Apache Spark, as it is full of tips regarding the performance of your jobs, as well as insights about Spark’s own internals.
Hadoop: The Definitive Guide, Tom White. If you’re a professional Data Engineer working with the Hadoop stack, this is a book to keep around. It is a quite comprehensive overview of the ecosystem, written by a long time Hadoop contributor.

Blog Posts

The Rise of the Data Engineer, The Downfall of the Data Engineer and Functional Data Engineering, Maxime Beauchemin. Three must read posts, full of clarity and insights from Apache Airflow’s creator. A great description of the field in the first one, with the second one being it’s less rosy counterpart. Lastly, a great overview of how to write ETL with a functional taste.
A Beginner’s Guide to Data Engineering, part 1, part 2, and part 3, Robert Chang. A three post series where the author goes from describing ETL best practices to how to build ETL frameworks. While these posts are quite focused on Airflow, I believe these are a nice read regardless of which ETL tools you use.
Questioning the Lambda Architecture, Jay Kreps. The original post that contests the Lambda Architecture, and introduces a new one: “Maybe we could call this the Kappa Architecture, though it may be too simple of an idea to merit a Greek letter”.

Talks

All talks by Martin Kleppmann. I find that Martin’s talks mix his industry experience at Internet companies (LinkedIn) with relevant academic concepts in a way that offers a very fundamental perspective on the topics he covers. A good one to start with is Turning the database inside out with Apache Samza.

Exercises

Leetcode. If getting your hands dirty is your cup of tea, the leetcode Database section is a great place to brush up on your SQL skills. The Algorithms section is also good for honing your coding skills in Python, Scala, Java or various other languages.
HackerRank. HackerRank is another useful resource to improve SQL and coding proficiency.

Podcasts

Data Engineering Podcast. This podcast started in 2017 and is going strong, covering many topics in the data engineering space.
Software Engineering Daily. This one isn’t specific to data engineering, but has many great episodes on the topic. Searching for something like big data will bring up many interesting interviews.
The Airflow Podcast. A podcast by Astronomer, that goes deep into Apache Airflow’s capabilities in an eight-episode series.

Projects

This section lists some interesting open-source projects that I came across and thought were interesting to keep an eye on:

Apache Arrow. Arrow is a project under quite active development that specifies a language-agnostic format for in-memory columnar storage. It looks like projects such as Spark intend on starting to take advantage of Apache Arrow, with promising performance gains.
Iceberg. An interesting project open-sourced by Netflix, which defines a new table format for Hive/Presto/Spark. It promises benefits such as the support of snapshot isolation with atomic updates, and improved performance on S3 by removing the need to walk partition paths to find files.
Scio. Another interesting project by Spotify, which provides a Scala API similar to Spark for writing Apache Beam / Google Cloud Dataflow jobs. This looks especially useful if you find yourself migrating existing Spark applications to Dataflow on GCP.

Papers

For my list of the most influential data engineering research papers, checkout this other post: Classic Big Data Papers.

Others

Data Engineering Weekly. If you like to be on top of every development in the world of Data Engineering, this may be a good mailing list to subscribe to, or to look around in the website’s archives. I personally enjoy receiving the Monday morning email with the latest movements in the industry.

I’ll try to keep updating this list of resources whenever I find something interesting. Let me know if I’m missing something and have fun learning!

Diogo Franco

I love data, distributed systems, machine learning, code and science!