Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

•Fault-tolerant and robust multigrid methods.•Hierarchical finite element compression.•Asynchronous checkpointing with local restart. We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault toleranc...

Full description

Saved in:
Bibliographic Details
Published in:Parallel computing Vol. 49; pp. 117 - 135
Main Authors: Göddeke, Dominik, Altenbernd, Mirco, Ribbrock, Dirk
Format: Journal Article
Language:English
Published: Elsevier B.V 01.11.2015
Subjects:
ISSN:0167-8191, 1872-7336
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Fault-tolerant and robust multigrid methods.•Hierarchical finite element compression.•Asynchronous checkpointing with local restart. We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques.
ISSN:0167-8191
1872-7336
DOI:10.1016/j.parco.2015.07.003