Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework

Fault-tolerance is a significant property for distributed and parallel computing systems. An emerging trend of Big Data computing is to combine MPI and MapReduce technologies in a single framework. The distinctive state model in this kind of frameworks brings challenges to designing an efficient and...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2015 IEEE First International Conference on Big Data Computing Service and Applications s. 176 - 183
Hlavní autoři:	Jian Lin, Fan Liang, Xiaoyi Lu, Li Zha, Zhiwei Xu
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.03.2015
Témata:	Analytical models Business checkpoint Computational modeling data computing Data models Fault tolerance Fault tolerant systems MapReduce MPI Synchronization
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Fault-tolerance is a significant property for distributed and parallel computing systems. An emerging trend of Big Data computing is to combine MPI and MapReduce technologies in a single framework. The distinctive state model in this kind of frameworks brings challenges to designing an efficient and transparent fault-tolerance mechanism. In this paper, a state model analysis method is proposed for uniformly modeling independent MPI, MapReduce and MPI-based MapReduce data computing frameworks. Based on this analysis, a library-level fault-tolerance mechanism with global persistent state model is proposed, a data-staging and routine-sharing based checkpoint approach is designed within this mechanism. The proposed mechanism has been implemented in DataMPI, a communication library supporting MPI-based MapReduce data computing applications. The experiments show that it can transparently enable fault-tolerance for applications. Taking TeraSort as an example, it introduces only 6.8% time overhead and 11% space overhead. For a failure-resume execution, it has a 10%-32% performance advantage compared with the naive checkpoint solutions based on local or parallel storages. The proposed mechanism also provides superior performance and resource utilization compared with Hadoop for both fault-free and failure-resume executions.
DOI:	10.1109/BigDataService.2015.33