Reducing Application-level Checkpoint File Sizes: Towards Scalable Fault Tolerance Solutions

Systems intended for the execution of long-running parallel applications require fault tolerant capabilities, since the probability of failure increases with the execution time and the number of nodes. Checkpointing and rollback recovery is one of the most popular techniques to provide fault toleran...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications s. 371 - 378
Hlavní autoři:	Cores, I., Rodriguez, G., Martin, M. J., Gonz'lez, P.
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.07.2012
Témata:	Arrays Checkpointing Fault tolerance Fault tolerant systems Libraries MPI Multicore processing Optimization Parallel Programming
ISBN:	1467316318, 9781467316316
ISSN:	2158-9178
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Systems intended for the execution of long-running parallel applications require fault tolerant capabilities, since the probability of failure increases with the execution time and the number of nodes. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support. However, in order to be useful for large scale systems, current checkpoint-recovery techniques should tackle the problem of reducing checkpointing cost. This paper addresses this issue through the reduction of the checkpoint file sizes. Different solutions to reduce the size of the checkpoints generated at application level are proposed and implemented in a checkpointing tool. Detailed experimental results on two multicore clusters show the effectiveness of the proposed methods.
ISBN:	1467316318 9781467316316
ISSN:	2158-9178
DOI:	10.1109/ISPA.2012.55