Spark: Cluster Computing with Working Sets

One of the aspects you can’t miss even as you just begin reading this paper is the strong scent of functional programming that the design of Spark bears. The use of FP idioms is quite widespread across the architecture of Spark such as the ability to restore a partition from by applying a closure block, operations such as reduce and map/collect, distributed accumulators etc. It would suffice to say that it is a very functional system. Pun intended!

Spark is written in Scala and is well suited for the class of applications that reuse a working set of data across multiple parallel operations. It claims to outperform Hadoop by 10x in iterative machine learning jobs, and has been tried successfully to interactively query a 39 GB dataset with sub-second response time!
Its is built on top of Mesos, a resource management infrastructure, that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster.

Developers write a driving program that orchestrates various parallel operations. Spark’s programming model provides two abstractions to work with large datasets : resilient distributed datasets and parallel operations. In addition it supports two kinds of shared variables.

Resilient Distributed Datasets
A resilient distributed dataset is a group of read only objects residing on multiple machines. A unique characteristic of this partitioned collection is that it can be restored completely in the event of a partition loss. Its able to offer this ability due to the fact that the RDD as whole contains enough information to compute and derive the elements of the collection by starting with information seeded in some reliable storage.

Parallel Operations
Parallel operations that can be performed include map/collect, combine/reduce and iterate over the dataset. These operations are invoked by passing closures to Spark i.e code is passed around as data. In order to provide a “regular” functional experience any of the variables accessed by these code blocks are copied over to the worker nodes where the computation is performed. Since Scala closures are Java objects that can be serialized, this mechanism is used for transferring the code over to the workers.

Over and above this basic feature it also offers two types of variables with special semantics called broadcast variables and accumulators.
Broadcast variables are created when there is a large piece of read-only data that needs to be made repeatedly available to all the worker nodes. Its effectively an optimization that reduces redundant copy/transfer operations of the same piece of data.

Accumulators have the same semantics as we know them in other languages/programming models. They are variables that represent a sum over a collection. Any datatype that you can define an “add” operation on can leverage accumulators.
These variables are implemented as custom classes which in turn have specific serialization algorithm. Details about their implementation is explained in section 4 of the paper. The paper also has a good collection of parallel programs based on this programming model. Although its still evolving it has found takers in the industry. Check out the description of Conviva’s data processing system which uses Spark.

Previewing from http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf