Multi-level checkpointing and silent error detection for linear workflows

•A brand new section to deal with multi-level fail-stop errors.•A brand new section to deal with the combination of multilevel fail-stop errors.•Silent errors (with in memory checkpointing, and partial/guaranteed verifications). We focus on High Performance Computing (HPC) workflows whose dependency...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computational science Jg. 28; S. 398 - 415
Hauptverfasser: Benoit, Anne, Cavelan, Aurélien, Robert, Yves, Sun, Hongyang
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 01.09.2018
Elsevier
Schlagworte:
ISSN:1877-7503, 1877-7511
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•A brand new section to deal with multi-level fail-stop errors.•A brand new section to deal with the combination of multilevel fail-stop errors.•Silent errors (with in memory checkpointing, and partial/guaranteed verifications). We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm.
ISSN:1877-7503
1877-7511
DOI:10.1016/j.jocs.2017.03.024