Classic Big Data Papers

This post lists some of the classic big data papers that I believe have had the most impact and influence in the data engineering field. Their order is chronological:

The Google File System

The first and oldest paper in this list (2003) is also one of the most foundational for the field - The Google File System introduced GFS, which later inspired important open-source efforts such as HDFS. The authors start by laying out some design goals for the system that differ from previous distributed file system implementations. While GFS shares some common goals - scalability, reliability and availability - this paper discusses new assumptions that guided its development:

  • Component failures without recovery are the norm rather than the exception.
  • The system doesn’t need to store more than a few million files, as long as it can handle big files (100MB+).
  • Random writes within a file are not supported, as most operations on GFS consist of batch sequential writes and reads.
  • Atomicity when multiple clients write to a file is a must.
  • High bandwidth is more desirable than low latency for workloads on GFS.

The paper goes on to discuss the file system’s single-master/many-chunkservers architecture, and how client communication with the master is minimized so that the latter doesn’t become a bottleneck. It also describes the metadata stored by the master - namespaces and file-chunk mappings are stored in-memory and logged to disk, while chunk locations are polled from chunkservers on startup. Moreover, GFS’ consistency model is discussed at some length, as well as the mechanisms for snapshots, chunk replication, garbage collection and stale replica detection.

Lastly, some performance metrics under different workloads are presented, as well as a quite interesting discussion of various technical and operational issues faced when developing GFS.

MapReduce: Simplified Data Processing on Large Clusters

The renowned 2004 MapReduce paper introduced a new computational model to process large datasets on clusters of commodity machines at Google. This model abstracts away the complexities of distributed systems such as parallelization, partitioning, task scheduling and machine failure, allowing developers to focus on the application logic. This article, together with The Google File System, was highly influential in that it inspired the open-source development of Apache Hadoop a couple of years later.

The authors start by explaining the MapReduce programming model, where users supply a Map function that processes input key/value pairs and emits a set of intermediate key/value pairs, which are then aggregated by key and passed to a Reduce function by the MapReduce library. The output of the Reduce step is the final result of the computation. They go on to discuss the system’s implementation, from how input data is partitioned into splits, how the master detects and handles worker failures, to how MapReduce can take advantage of the input data location to optimize for network bandwidth.

The paper also presents the results of experiments with different types of workloads, showing good performance overall. One interesting optimization discussed is the Backup Tasks mechanism, where near the end of a MapReduce computation, the master schedules backup executions for tasks that are still in-progress, to alleviate the problem of stagglers - Tasks that are taking unusually long to finish. They show that this optimization improves performance by almost 50% for some workloads.

Bigtable: A Distributed Storage System for Structured Data

The Bigtable paper was another highly influential publication by Google in 2006, following the two articles discussed previously. This paper triggered the NoSQL movement, as well as the development of an open-source version of Bigtable on top of Hadoop, called Apache HBase. Various other systems were highly influenced by its design, such as Cassandra and Apache Accumulo.

One design goal was for Bigtable to be a very flexible storage system, capable of handling both high-throughput batch processing and low-latency workloads. The authors describe its data model as a sorted map, indexed by a row key, column and timestamp. They also introduce the concept of column families, noting that the system works better with few (< 100) column families, and that they should rarely change - In contrast, the number of columns under each column family (qualifiers) can be unbounded. The paper then discusses Bigtable’s implementation, from its reliance on Chubby, a distributed lock service developed at Google, to how a table is dynamically split into tablets of around 100/200MB under the hood. This is followed by a detailed explanation of how tablet locations are stored, how tablets are assigned to servers, and what are minor and major table compactions. The authors also discuss various optimizations that were implemented, such as data compression, two-levels of caching and in-memory Bloom filters that reduce disk IO.

Finally, various benchmarks that show good performance of random reads, random writes, sequential reads, sequential writes and scans are presented, both in test cases and real applications deployed at Google. Bigtable is nowadays an offering on the Google Cloud Platform, as a managed service. If you’re still interested in distributed key-value stores after studying this one, checkout the Dynamo and Cassandra papers!

Hive: a warehousing solution over a map-reduce framework

This paper came out of Facebook in 2009, presenting a data-warehousing solution on top of Hadoop and MapReduce, Apache Hive. The authors describe the motivation to build such a system by alluding to the difficulty of scaling traditional RDBMS to Facebook’s size, as well as the complexity of writing low-level MapReduce jobs for tasks that could easily be described by SQL. Thus, Hive was born to provide an SQL layer on Hadoop.

The publication goes over the design of Hive, from the supported type system, the details of the query language (HiveQL), how tables, partitions and buckets map to HDFS, to how custom data formats are supported with the SerDe interface. It also describes Hive’s architectural components such as the Metastore, the HiveServer and the Query Compiler.

One interesting tidbit in the end of the article is how the unpredictability of ad-hoc queries performed by analysts at Facebook was impacting the reporting jobs, which are often time constrained. Their solution was to use separate Hive clusters, based on the type of workload (exploration vs reporting). This is a strategy that is still widely used today in many data architectures, so as not to allow user queries to impact production jobs.

Dremel: Interactive Analysis of Web-Scale Datasets

This 2010 paper describes Dremel, the query engine that now supports Google BigQuery, and the inspiration behind many interactive query systems such as Apache Drill, Cloudera’s Impala and Facebook’s Presto. It also discusses a columnar format that can store nested data structures, which inspired the development of Parquet by Twitter and Cloudera.

The article starts by laying out the motivation for an interactive, large scale query engine, with many examples of use cases at Google, where the common theme is analysts doing ad-hoc queries with much lower latency then with MapReduce based systems. It goes on to describe the columnar data model used by Dremel, and how it supports arbitrarily nested data, introducing the concepts of repetition and definition levels. The algorithm described to convert between nested and flat columnar data formats is now used in Parquet.

The query execution model is also discussed, where Dremel’s multi-level tree architecture is presented. Lastly, the authors describe experiments comparing row-based storage with columnar-based storage, as well as MapReduce with Dremel - They show that on a simple SQL aggregation, MapReduce on columnar storage is an order of magnitude faster than on row-oriented storage, and Dremel is yet another order of magnitude faster than MapReduce on columnar storage.

Spark: Cluster Computing with Working Sets

Apache Spark is nowadays the swiss army knife of the data processing world, and one of the most active open source projects under the Apache foundation, with more than 1200 contributors. Before being donated to Apache it was developed at UC Berkeley, and this was the paper that revealed it to the world in 2010.

The publication starts by explaining the use cases for which MapReduce comes up short:

  • Iterative jobs, such as machine learning algorithms that use the same data multiple times.
  • Interactive analysis, such as ad-hoc exploratory queries.

The authors go on to present the Spark framework and the RDD abstraction, which solve these shortcomings by supporting in-memory cluster computing while providing the same fault tolerance guarantees as MapReduce. After discussing the Spark programming model, the supported parallel operations on RDD’s and the two types of shared variables (Broadcast and Accumulators), some examples of Spark programs are shown. Finally, implementation details such as how tasks are shipped to workers, as well as promising results for 3 different experiments are presented. The common theme of the experiments is data re-use, and the results show that performance is comparable to MapReduce for the first time that a dataset is used (typically loaded from disk) but is many times faster than MapReduce for subsequent usage of the same data.

The End

Please let me know if I’m missing any other highly influential big data research. If you’re interested in learning data engineering from somewhere other than research papers, checkout my other Data Engineering Resources post!

Diogo Franco

Diogo Franco

I love data, distributed systems, machine learning, code and science!