The Limitation of MapReduce: A Probing Case and a Lightweight Solution

While we usually see enough papers that deal with the applications of the Map Reduce programming model this one for a change tries to address the limitations of the MR model. It argues that MR only allows a program to scale up to process very large data sets, but constrains a program’s ability to process smaller data items. This ability or inability (depending on how you see it) is what it terms as “one-way scalability”. Obviously this “one-wayness” was a requirement for Google but here the authors turn our attention to how this impacts the application of this framework to other computation forms.

The system they argue based on is a distributed compiler and their solution is a more scaled “down” parallelization framework called MRLite that handles more moderate volumes of data. The workload characteristics of a compiler are bit different from analytical workloads. Primary differences being compilation workloads deal with much more humble volumes of data albeit with much greater intertwining amongst the files.

mrcc, which is the name of the distributed compiler follows a master slave model. The main mrcc program runs on the master node. The other “map” component, mrcc-map runs on the slave nodes.

A Cycle of Distributed Compilation

The compilation cycle starts with the code base of a project being submitted to mrcc, the master program. The master program forks a “preprocessor” process after scanning the arguments passed to the compiler. This preprocessor merges the header file into the source file. This is in preparation for the next step which will distribute these preprocessed files to different slaves. In order to keep the preprocessed files accessible to the slaves these files are kept on a network file system. Very similar to what we see in GFS and HDFS.

mrcc then initiates the remote compilation of the preprocessed files on multiple slave machines. mrcc-map, the program that runs on the slave is the one that performs the compilation.
It first parses its arguments to obtain the source file name on the network file system and the compiler arguments. It then retrieves the preprocessed file from the network file system. After that, mrcc-map calls the local gcc compiler and passes the compilation arguments to it. When gcc exits with a successful return value, mrcc-map places the object file into the network file system and returns immediately.

Distributed Compiling using Hadoop

The paper then describes the consequences of performing the above compilation cycle on Hadoop. In summary it appear that the compilation time using mrcc/hadoop on 10 nodes is at least twice as long as that on one node (sequential compilation).
The reasons for this slowness can be attributed to a) overheads due to spawning a new process for each compilation batch on the slaves b) retrieving and writing the file back onto the NFS server etc. In a nutshell they argue that the tasking and data transportation overheads are acceptable only for the class of applications where relatively simple processing logic is applied to a large number of independent units of work.

MR Lite for distributed compilation workloads

So MRlite comes to the rescue. It optimizes for large scale parallelism and low latency to provide a more general and flexible parallel execution capability.
Bearing a great deal of similarity with the classic MapReduce framework MRlite is made up of 1) the MRlite Master 2) MRlite slaves 3) In Memory NFS server and 4) MRlite client.

The master controls the parallel execution of the tasks. The client submits the job to the master which in turn is submitted to multiple slaves.
Some key aspects of MRlite’s design include –
1) The timing control feature in its design as part of the low-latency execution mechanism
2) Master submits tasks to slaves without sophisticated queueing to maximize the possibility of finishing the job within the timeout limit
3) Use of run-time daemons and thread pools to support the operations of the master and the slaves. The reduces the cost of creating a process.
4) NFS server running on only one node to provide the file system abstraction. This server runs atop a virtual memory file system so that operations are as fast as in-memory operations
5) Reliability through multiple-way replication is not included in MRlite

Previewing from