NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart

In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overhea...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings - Euromicro Workshop on Parallel and Distributed Processing pp. 99 - 102
Main Authors: Subasi, Omer, Arias, Javier, Unsal, Osman, Labarta, Jesus, Cristal, Adrian
Format: Conference Proceeding Journal Article
Language:English
Published: IEEE 01.03.2015
Subjects:
ISSN:1066-6192
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by check pointing only tasks' inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications (up to 1024 cores). The results indicate that NanoCheckpoints has on average overhead 3% for shared memory benchmarks. The dataflow semantics of Nanos, where both check pointing and error recovery are asynchronous, allows NanoCheckpoints to scale at large core counts even when high error rates are present. For hybrid OmpSs+MPI benchmarks, NanoCheckpoints has very low overhead, on average 2%, and high scalability.
AbstractList In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by check pointing only tasks' inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications (up to 1024 cores). The results indicate that NanoCheckpoints has on average overhead 3% for shared memory benchmarks. The dataflow semantics of Nanos, where both check pointing and error recovery are asynchronous, allows NanoCheckpoints to scale at large core counts even when high error rates are present. For hybrid OmpSs+MPI benchmarks, NanoCheckpoints has very low overhead, on average 2%, and high scalability.
Author Labarta, Jesus
Subasi, Omer
Arias, Javier
Unsal, Osman
Cristal, Adrian
Author_xml – sequence: 1
  givenname: Omer
  surname: Subasi
  fullname: Subasi, Omer
  organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain
– sequence: 2
  givenname: Javier
  surname: Arias
  fullname: Arias, Javier
  email: javier.arias@bsc.es
  organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain
– sequence: 3
  givenname: Osman
  surname: Unsal
  fullname: Unsal, Osman
  email: osman.unsal@bsc.es
  organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain
– sequence: 4
  givenname: Jesus
  surname: Labarta
  fullname: Labarta, Jesus
  email: jesus.labarta@bsc.es
  organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain
– sequence: 5
  givenname: Adrian
  surname: Cristal
  fullname: Cristal, Adrian
  email: adrian.cristal@bsc.es
  organization: Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain
BookMark eNpFj81PwjAYh2uCiYCcPHrp0cuwH9vaekM-1IQoUe7Lu-1tmIwW2xHCfy8JJp5-lyfPk9-A9Jx3SMgdZ2POmXlczVZjwXg25uqKjIzSPFXG6NTwvEf6nOV5knMjbsggxm_GmEqF6ZPtOzg_3WC13fvGdfGJTuga4jZ5hog1ncSTqzbBO3-IdAYd2NYf6SLADo8-bKn1gc6tbaoGXUfB1fSrghbKFum_9PETYwehuyXXFtqIo78dkvVivp6-JsuPl7fpZJk0gukuqY20UlcqLcsa0toKDkwZhIyXpUTGsjRFXaKqlKwyqwyXIIUBEFzoXFs5JA8X7T74n8M5XeyaWGHbgsPzjYLnOlMm01Kc0fsL2iBisQ_NDsKpUMwIxXL5C1OaZ2E
CODEN IEEPAD
ContentType Conference Proceeding
Journal Article
DBID 6IE
6IL
CBEJK
RIE
RIL
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/PDP.2015.17
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP All) 1998-Present
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781479984916
1479984914
EndPage 102
ExternalDocumentID 7092706
Genre orig-research
GroupedDBID 29N
29O
6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
RNS
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-i208t-d93f38c74bbda4df21a079ea51bb3e00544e8be7c73c5f7913a329aa212868f3
IEDL.DBID RIE
ISICitedReferencesCount 18
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000380471500015&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1066-6192
IngestDate Thu Jul 10 22:40:33 EDT 2025
Wed Aug 27 02:46:40 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i208t-d93f38c74bbda4df21a079ea51bb3e00544e8be7c73c5f7913a329aa212868f3
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Conference-1
ObjectType-Feature-3
content type line 23
SourceType-Conference Papers & Proceedings-2
PQID 1685795832
PQPubID 23500
PageCount 4
ParticipantIDs ieee_primary_7092706
proquest_miscellaneous_1685795832
PublicationCentury 2000
PublicationDate 20150301
PublicationDateYYYYMMDD 2015-03-01
PublicationDate_xml – month: 03
  year: 2015
  text: 20150301
  day: 01
PublicationDecade 2010
PublicationTitle Proceedings - Euromicro Workshop on Parallel and Distributed Processing
PublicationTitleAbbrev EMPDP
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0007429
ssib026764739
Score 1.6507647
Snippet In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage...
SourceID proquest
ieee
SourceType Aggregation Database
Publisher
StartPage 99
SubjectTerms Arrays
Benchmark testing
Benchmarks
Checkpoint/restart
Checkpointing
Conferences
Dataflow
Derivatives
Error recovery
Instruction sets
Nanostructure
Networks
Reliability
Runtime
Scalability
Task parallelism
Tasks
Weight reduction
Title NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart
URI https://ieeexplore.ieee.org/document/7092706
https://www.proquest.com/docview/1685795832
WOSCitedRecordID wos000380471500015&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcBUoEW8ZSRG3DZ24gdb6UMMqKpQhbpVfkVURUnVpCD-PXaatAMsbFls2Zfz3dn33XcA3AtidGgChZSIKQp5JJEMeIy4YS6AtTjWBUD27YWNx3w2E5MaeNjVwlhrC_CZbfvPIpdvUr3xT2Ud1hWYeX7tA8botlar0h1MGQ2ZZ7gsrbC78oki00kp8peEsjYv6IrOZDDxmK6oHVQ9VX4Z4sK7jBr_W9cxaO3L9OBk54BOQM0mp6BR9WmA5bFtgqUzoWn_3erlKl0kefYIe3AqsyV6ci7MwF72nWjPkZtuMjiQvp9v-gVHFWoLurAWDgumCbcKKBPjZpYfvuQK7iftvLpdOTVsgeloOO0_o7LJAlrgLs-RESQmXLNQKSNDE-NAdpmwMgqUItZHdKHlyjLNiI5iJgIiCRZSOpfHKY_JGagnaWLPAVQyokRzzwhvQqKx5AZjZVwEZlUkBbsATS-6-WpLozEvpXYB7irZz51q-3yFTKzb8TygPGIicjbn8u-hV-DI_8ctJOwa1PP1xt6AQ_2ZL7L1baEfP6Qvu0c
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELYqQIKJtyhPIzFimvgR22xQqECUqkIVYov8iqiKEkRaEP8eO01ggIUtiy37cr47-777DoATSayhNtZIyyxBVDCFVCwyJCz3AazDmakAso99PhiIpyc5bIHT71oY51wFPnNn4bPK5dvCzMJTWYdHEvPAr73IKMXRvFqr0R6c8ITywHFZ22F_6ZNVrjNJULgm1NV5cSQ7w6thQHWxs7jpqvLLFFf-pbf6v5Wtga2fQj04_HZB66Dl8g2w2nRqgPXB3QQTb0SL7rMzk9dinE_Lc3gBR6qcoEvvxCy8KD9zE1hyi1kJr1To6Ft8wF6D24I-sIXXFdeEXwVUufUzq5dQdAV_Ju08-F15RdwCo971qHuD6jYLaIwjMUVWkowIw6nWVlGb4VhFXDrFYq2JCzEddUI7bjgxLOMyJopgqZR3eiIRGdkGC3mRux0AtWIJMSJwwltKDFbCYqytj8GcZkryNtgMoktf50QaaS21NjhuZJ965Q4ZC5U7v-M0TgTjknmrs_v30COwfDO676f928HdHlgJ_3QOENsHC9O3mTsAS-Z9Oi7fDitd-QKMj76O
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+Euromicro+Workshop+on+Parallel+and+Distributed+Processing&rft.atitle=NanoCheckpoints%3A+A+Task-Based+Asynchronous+Dataflow+Framework+for+Efficient+and+Scalable+Checkpoint%2FRestart&rft.au=Subasi%2C+Omer&rft.au=Arias%2C+Javier&rft.au=Unsal%2C+Osman&rft.au=Labarta%2C+Jesus&rft.date=2015-03-01&rft.pub=IEEE&rft.issn=1066-6192&rft.spage=99&rft.epage=102&rft_id=info:doi/10.1109%2FPDP.2015.17&rft.externalDocID=7092706
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1066-6192&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1066-6192&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1066-6192&client=summon