NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart
In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overhea...
Saved in:
| Published in: | Proceedings - Euromicro Workshop on Parallel and Distributed Processing pp. 99 - 102 |
|---|---|
| Main Authors: | , , , , |
| Format: | Conference Proceeding Journal Article |
| Language: | English |
| Published: |
IEEE
01.03.2015
|
| Subjects: | |
| ISSN: | 1066-6192 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by check pointing only tasks' inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications (up to 1024 cores). The results indicate that NanoCheckpoints has on average overhead 3% for shared memory benchmarks. The dataflow semantics of Nanos, where both check pointing and error recovery are asynchronous, allows NanoCheckpoints to scale at large core counts even when high error rates are present. For hybrid OmpSs+MPI benchmarks, NanoCheckpoints has very low overhead, on average 2%, and high scalability. |
|---|---|
| AbstractList | In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by check pointing only tasks' inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications (up to 1024 cores). The results indicate that NanoCheckpoints has on average overhead 3% for shared memory benchmarks. The dataflow semantics of Nanos, where both check pointing and error recovery are asynchronous, allows NanoCheckpoints to scale at large core counts even when high error rates are present. For hybrid OmpSs+MPI benchmarks, NanoCheckpoints has very low overhead, on average 2%, and high scalability. |
| Author | Labarta, Jesus Subasi, Omer Arias, Javier Unsal, Osman Cristal, Adrian |
| Author_xml | – sequence: 1 givenname: Omer surname: Subasi fullname: Subasi, Omer organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain – sequence: 2 givenname: Javier surname: Arias fullname: Arias, Javier email: javier.arias@bsc.es organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain – sequence: 3 givenname: Osman surname: Unsal fullname: Unsal, Osman email: osman.unsal@bsc.es organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain – sequence: 4 givenname: Jesus surname: Labarta fullname: Labarta, Jesus email: jesus.labarta@bsc.es organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain – sequence: 5 givenname: Adrian surname: Cristal fullname: Cristal, Adrian email: adrian.cristal@bsc.es organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain |
| BookMark | eNpFj81PwjAYh2uCiYCcPHrp0cuwH9vaekM-1IQoUe7Lu-1tmIwW2xHCfy8JJp5-lyfPk9-A9Jx3SMgdZ2POmXlczVZjwXg25uqKjIzSPFXG6NTwvEf6nOV5knMjbsggxm_GmEqF6ZPtOzg_3WC13fvGdfGJTuga4jZ5hog1ncSTqzbBO3-IdAYd2NYf6SLADo8-bKn1gc6tbaoGXUfB1fSrghbKFum_9PETYwehuyXXFtqIo78dkvVivp6-JsuPl7fpZJk0gukuqY20UlcqLcsa0toKDkwZhIyXpUTGsjRFXaKqlKwyqwyXIIUBEFzoXFs5JA8X7T74n8M5XeyaWGHbgsPzjYLnOlMm01Kc0fsL2iBisQ_NDsKpUMwIxXL5C1OaZ2E |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding Journal Article |
| DBID | 6IE 6IL CBEJK RIE RIL 7SC 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/PDP.2015.17 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP All) 1998-Present Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Computer and Information Systems Abstracts |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9781479984916 1479984914 |
| EndPage | 102 |
| ExternalDocumentID | 7092706 |
| Genre | orig-research |
| GroupedDBID | 29N 29O 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS 7SC 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-i208t-d93f38c74bbda4df21a079ea51bb3e00544e8be7c73c5f7913a329aa212868f3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 18 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000380471500015&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1066-6192 |
| IngestDate | Thu Jul 10 22:40:33 EDT 2025 Wed Aug 27 02:46:40 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i208t-d93f38c74bbda4df21a079ea51bb3e00544e8be7c73c5f7913a329aa212868f3 |
| Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2 |
| PQID | 1685795832 |
| PQPubID | 23500 |
| PageCount | 4 |
| ParticipantIDs | ieee_primary_7092706 proquest_miscellaneous_1685795832 |
| PublicationCentury | 2000 |
| PublicationDate | 20150301 |
| PublicationDateYYYYMMDD | 2015-03-01 |
| PublicationDate_xml | – month: 03 year: 2015 text: 20150301 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | Proceedings - Euromicro Workshop on Parallel and Distributed Processing |
| PublicationTitleAbbrev | EMPDP |
| PublicationYear | 2015 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0007429 ssib026764739 |
| Score | 1.6507647 |
| Snippet | In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage... |
| SourceID | proquest ieee |
| SourceType | Aggregation Database Publisher |
| StartPage | 99 |
| SubjectTerms | Arrays Benchmark testing Benchmarks Checkpoint/restart Checkpointing Conferences Dataflow Derivatives Error recovery Instruction sets Nanostructure Networks Reliability Runtime Scalability Task parallelism Tasks Weight reduction |
| Title | NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart |
| URI | https://ieeexplore.ieee.org/document/7092706 https://www.proquest.com/docview/1685795832 |
| WOSCitedRecordID | wos000380471500015&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcBUoEW8ZSRG3DZ24gdb6UMMqKpQhbpVfkVURUnVpCD-PXaatAMsbFls2Zfz3dn33XcA3AtidGgChZSIKQp5JJEMeIy4YS6AtTjWBUD27YWNx3w2E5MaeNjVwlhrC_CZbfvPIpdvUr3xT2Ud1hWYeX7tA8botlar0h1MGQ2ZZ7gsrbC78oki00kp8peEsjYv6IrOZDDxmK6oHVQ9VX4Z4sK7jBr_W9cxaO3L9OBk54BOQM0mp6BR9WmA5bFtgqUzoWn_3erlKl0kefYIe3AqsyV6ci7MwF72nWjPkZtuMjiQvp9v-gVHFWoLurAWDgumCbcKKBPjZpYfvuQK7iftvLpdOTVsgeloOO0_o7LJAlrgLs-RESQmXLNQKSNDE-NAdpmwMgqUItZHdKHlyjLNiI5iJgIiCRZSOpfHKY_JGagnaWLPAVQyokRzzwhvQqKx5AZjZVwEZlUkBbsATS-6-WpLozEvpXYB7irZz51q-3yFTKzb8TygPGIicjbn8u-hV-DI_8ctJOwa1PP1xt6AQ_2ZL7L1baEfP6Qvu0c |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELYqQIKJtyhPIzFimvgR22xQqECUqkIVYov8iqiKEkRaEP8eO01ggIUtiy37cr47-777DoATSayhNtZIyyxBVDCFVCwyJCz3AazDmakAso99PhiIpyc5bIHT71oY51wFPnNn4bPK5dvCzMJTWYdHEvPAr73IKMXRvFqr0R6c8ITywHFZ22F_6ZNVrjNJULgm1NV5cSQ7w6thQHWxs7jpqvLLFFf-pbf6v5Wtga2fQj04_HZB66Dl8g2w2nRqgPXB3QQTb0SL7rMzk9dinE_Lc3gBR6qcoEvvxCy8KD9zE1hyi1kJr1To6Ft8wF6D24I-sIXXFdeEXwVUufUzq5dQdAV_Ju08-F15RdwCo971qHuD6jYLaIwjMUVWkowIw6nWVlGb4VhFXDrFYq2JCzEddUI7bjgxLOMyJopgqZR3eiIRGdkGC3mRux0AtWIJMSJwwltKDFbCYqytj8GcZkryNtgMoktf50QaaS21NjhuZJ965Q4ZC5U7v-M0TgTjknmrs_v30COwfDO676f928HdHlgJ_3QOENsHC9O3mTsAS-Z9Oi7fDitd-QKMj76O |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+Euromicro+Workshop+on+Parallel+and+Distributed+Processing&rft.atitle=NanoCheckpoints%3A+A+Task-Based+Asynchronous+Dataflow+Framework+for+Efficient+and+Scalable+Checkpoint%2FRestart&rft.au=Subasi%2C+Omer&rft.au=Arias%2C+Javier&rft.au=Unsal%2C+Osman&rft.au=Labarta%2C+Jesus&rft.date=2015-03-01&rft.pub=IEEE&rft.issn=1066-6192&rft.spage=99&rft.epage=102&rft_id=info:doi/10.1109%2FPDP.2015.17&rft.externalDocID=7092706 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1066-6192&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1066-6192&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1066-6192&client=summon |