Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications

We present a fault-tolerant protocol for task-parallel message-passing applications to mitigate transient errors. The protocol requires the restart only of the task that experienced the error and transparently handles any MPI calls inside the task. The protocol is implemented in Nanos -- a dataflow...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings / IEEE International Conference on Cluster Computing pp. 563 - 570
Main Authors: Martsinkevich, Tatiana, Subasi, Omer, Unsal, Osman, Cappello, Franck, Labarta, Jesus
Format: Conference Proceeding
Language:English
Published: IEEE 01.09.2015
Subjects:
ISSN:1552-5244
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Be the first to leave a comment!
You must be logged in first