There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space.
HaLoop: Efï¬cient Iterative Data Processing on Large Clusters – It extends MapReduce by adding programming support for iterative applications, and also improves their efï¬ciency by making the task scheduler loop-aware and by adding various caching mechanisms
iMapReduce – It allows users to specify the iterative operations with map and reduce functions, while supporting the iterative processing automatically without the need of usersâ€™ involvement. More importantly, iMapReduce signiï¬cantly improves the performance of iterative algorithms by (1) reducing the overhead of creating a new task in every iteration, (2) eliminating the shufï¬‚ing of the static data in the shufï¬‚e stage of MapReduce, and (3) allowing asynchronous execution of each iteration, i.e., an iteration can start before all tasks of a previous iteration have ï¬nished.
Twister: A Runtime for Iterative MapReduce – It uses a publish/subscribe messaging infrastructure for communication and data transfers, and supports long running map/reduce tasks, which can be used in â€œconfigure once and use many timesâ€ approach. In addition it provides programming extensions to MapReduce with â€œbroadcastâ€ and â€œscatterâ€ type data transfers. These improvements allow Twister to support iterative MapReduce computations highly efficiently compared to other MapReduce runtimes.
Peregrine – From the Sipnn3r folks.
iHadoop: Asynchronous Iterations for MapReduce – The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoopâ€™s task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the applicationâ€™s latency