Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
•Fault-tolerant and robust multigrid methods.•Hierarchical finite element compression.•Asynchronous checkpointing with local restart. We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault toleranc...
Uloženo v:
| Vydáno v: | Parallel computing Ročník 49; s. 117 - 135 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier B.V
01.11.2015
|
| Témata: | |
| ISSN: | 0167-8191, 1872-7336 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | •Fault-tolerant and robust multigrid methods.•Hierarchical finite element compression.•Asynchronous checkpointing with local restart.
We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques. |
|---|---|
| ISSN: | 0167-8191 1872-7336 |
| DOI: | 10.1016/j.parco.2015.07.003 |