NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart
In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overhea...
Uloženo v:
| Vydáno v: | Proceedings - Euromicro Workshop on Parallel and Distributed Processing s. 99 - 102 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Konferenční příspěvek Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.03.2015
|
| Témata: | |
| ISSN: | 1066-6192 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by check pointing only tasks' inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications (up to 1024 cores). The results indicate that NanoCheckpoints has on average overhead 3% for shared memory benchmarks. The dataflow semantics of Nanos, where both check pointing and error recovery are asynchronous, allows NanoCheckpoints to scale at large core counts even when high error rates are present. For hybrid OmpSs+MPI benchmarks, NanoCheckpoints has very low overhead, on average 2%, and high scalability. |
|---|---|
| Bibliografie: | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2 |
| ISSN: | 1066-6192 |
| DOI: | 10.1109/PDP.2015.17 |