Iterative Map Reduce – Prior Art

There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space.

HaLoop: Efficient Iterative Data Processing on Large Clusters – It extends MapReduce by adding programming support for iterative applications, and also improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms

iMapReduce – It allows users to specify the iterative operations with map and reduce functions, while supporting the iterative processing automatically without the need of users’ involvement. More importantly, iMapReduce significantly improves the performance of iterative algorithms by (1) reducing the overhead of creating a new task in every iteration, (2) eliminating the shuffling of the static data in the shuffle stage of MapReduce, and (3) allowing asynchronous execution of each iteration, i.e., an iteration can start before all tasks of a previous iteration have finished.

Twister: A Runtime for Iterative MapReduce – It uses a publish/subscribe messaging infrastructure for communication and data transfers, and supports long running map/reduce tasks, which can be used in “configure once and use many times” approach. In addition it provides programming extensions to MapReduce with “broadcast” and “scatter” type data transfers. These improvements allow Twister to support iterative MapReduce computations highly efficiently compared to other MapReduce runtimes.

Peregrine – From the Sipnn3r folks.

iHadoop: Asynchronous Iterations for MapReduce – The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop’s task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application’s latency