Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant M...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Conference on High Performance Networking and Computing: Proceedings of the 2006 ACM/IEEE conference on Supercomputing; 11-17 Nov. 2006 s. 127 - es
Hlavní autoři:	Coti, Camille, Herault, Thomas, Lemarinier, Pierre, Pilard, Laurence, Rezmerita, Ala, Rodriguez, Eric, Cappello, Franck
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	New York, NY, USA ACM 11.11.2006
Edice:	ACM Conferences
Témata:	Computing methodologies > Concurrent computing methodologies > Concurrent programming languages General and reference > Cross-computing tools and techniques > Performance Software and its engineering > Software creation and management > Software verification and validation > Operational analysis Software and its engineering > Software notations and tools > General programming languages > Language types > Concurrent programming languages Software and its engineering > Software organization and properties > Extra-functional properties > Software fault tolerance > Checkpoint > restart Software and its engineering > Software organization and properties > Software system structures > Distributed systems organizing principles
ISBN:	0769527000, 9780769527000
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.
Bibliografie:	SourceType-Conference Papers & Proceedings-1 ObjectType-Conference Paper-1 content type line 25
ISBN:	0769527000 9780769527000
DOI:	10.1145/1188455.1188587