Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model

The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with differ...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on parallel and distributed systems Ročník 28; číslo 1; s. 244 - 259
Hlavní autori: Sheng Di, Robert, Yves, Vivien, Frederic, Cappello, Franck
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York IEEE 01.01.2017
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Institute of Electrical and Electronics Engineers
Predmet:
ISSN:1045-9219, 1558-2183
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3 percent over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1 percent of the best pattern in the experiments.
AbstractList The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3 percent over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1 percent of the best pattern in the experiments.
—The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3% over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1% of the best pattern in the experiments.
Author Robert, Yves
Cappello, Franck
Sheng Di
Vivien, Frederic
Author_xml – sequence: 1
  surname: Sheng Di
  fullname: Sheng Di
  email: disheng222@gmail.com
  organization: Math. & Comput. Sci. (MCS) Div., Argonne Nat. Lab., Chicago, IL, USA
– sequence: 2
  givenname: Yves
  surname: Robert
  fullname: Robert, Yves
  email: Yves.Robert@inria.fr
  organization: Lab. LIP, UCB Lyon, Lyon, France
– sequence: 3
  givenname: Frederic
  surname: Vivien
  fullname: Vivien, Frederic
  email: frederic.vivien@inria.fr
  organization: Lab. LIP, UCB Lyon, Lyon, France
– sequence: 4
  givenname: Franck
  surname: Cappello
  fullname: Cappello, Franck
  email: cappello@mcs.anl.gov
  organization: Math. & Comput. Sci. (MCS) Div., Argonne Nat. Lab., Chicago, IL, USA
BackLink https://inria.hal.science/hal-01353871$$DView record in HAL
BookMark eNp9kE9LwzAYh4NMcJt-APES8OShM2-SNulx1D8TJhtsnkOapayzJjPtNvz2tmyIePAQEsLze98fzwD1nHcWoWsgIwCS3i_nD4sRJZCMaMwTyuUZ6kMcy4iCZL32TXgcpRTSCzSo6w0hwGPC-2ix9AcdVlg7PNs25Yeu8MxVpbM4W1vzvvWla_DCV7um9A7v3MoGrPHy4KOp3dsKT-bZb_LVr2x1ic4LXdX26nQP0dvT4zKbRNPZ80s2nkaGCdJEMdUppIVOWFGkbR8tuJVGJDkrOBdGSEkSmUNiuNaUQp4XgudGGtaefCUEG6K749y1rtQ2tOXDl_K6VJPxVHV_BFjMpIA9tOztkd0G_7mzdaM2fhdcW0-B5JKKhCekpeBImeDrOtjiZywQ1XlWnWfVeVYnz21G_MmYstGdribosvo3eXNMltban02CcyrjlH0D8WGK9Q
CODEN ITDSEO
CitedBy_id crossref_primary_10_1007_s11704_022_2096_3
crossref_primary_10_1016_j_cosrev_2024_100660
crossref_primary_10_3390_a12090197
crossref_primary_10_1109_TPDS_2018_2844210
crossref_primary_10_1109_TC_2016_2643660
crossref_primary_10_1002_spe_3021
crossref_primary_10_1016_j_future_2024_07_022
crossref_primary_10_1109_TSUSC_2018_2797890
crossref_primary_10_1109_TPDS_2020_3015805
crossref_primary_10_1016_j_future_2020_04_019
crossref_primary_10_3390_drones7050286
crossref_primary_10_1145_3624560
crossref_primary_10_1007_s10586_021_03464_4
crossref_primary_10_1109_ACCESS_2019_2903588
crossref_primary_10_1109_TCC_2021_3057422
crossref_primary_10_1007_s11227_018_2621_1
crossref_primary_10_1109_TPDS_2019_2896894
crossref_primary_10_1145_3403956
crossref_primary_10_1109_TASE_2022_3195958
Cites_doi 10.2172/984082
10.1109/TPDS.2008.172
10.1109/HiPC.2012.6507514
10.1109/HIPC.2010.5713163
10.1109/IPDPS.2016.11
10.1145/2063384.2063428
10.1109/TC.2012.17
10.1109/IPDPS.2014.122
10.2172/1081941
10.1145/2063384.2063427
10.1109/ICPP.2010.80
10.1145/223587.223596
10.1109/DSNW.2012.6264677
10.1177/1094342014532297
10.1016/j.future.2004.11.016
10.1109/CCGRID.2010.40
10.1155/1996/483083
10.1016/j.future.2015.04.003
10.1371/journal.pone.0104591
10.1109/SC.2014.79
10.1007/978-3-540-77220-0_26
10.1145/361147.361115
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017
Distributed under a Creative Commons Attribution 4.0 International License
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017
– notice: Distributed under a Creative Commons Attribution 4.0 International License
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
1XC
VOOES
DOI 10.1109/TPDS.2016.2546248
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database


Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2183
EndPage 259
ExternalDocumentID oai:HAL:hal-01353871v1
10_1109_TPDS_2016_2546248
7442859
Genre orig-research
GrantInformation_xml – fundername: Office of Science
  funderid: 10.13039/100006132
– fundername: US Department of Energy
  funderid: 10.13039/100000015
– fundername: European project SCoRPiO
– fundername: Institut Universitaire de France
  funderid: 10.13039/501100004795
– fundername: US Department of Energy Office of Science laboratory
  grantid: DE-AC02-06CH11357
– fundername: Advanced Scientific Computing Research Program
  grantid: DE-AC02-06CH11357
GroupedDBID --Z
-~X
.DC
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
TN5
TWZ
UHB
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
1XC
VOOES
ID FETCH-LOGICAL-c370t-52a919fa63ff9014a74e8c76b3f447c788068b16c4aa221bbf74bc8c3c8cbd773
IEDL.DBID RIE
ISICitedReferencesCount 39
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000390676100019&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1045-9219
IngestDate Tue Oct 14 06:53:44 EDT 2025
Sun Nov 09 08:21:24 EST 2025
Tue Nov 18 22:32:09 EST 2025
Sat Nov 29 03:36:09 EST 2025
Wed Aug 27 02:52:30 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Keywords High-Performance Computing
Multilevel Checkpoint
Optimization
Fault Tolerance
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c370t-52a919fa63ff9014a74e8c76b3f447c788068b16c4aa221bbf74bc8c3c8cbd773
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-2361-055X
0000-0002-0663-6152
OpenAccessLink https://inria.hal.science/hal-01353871
PQID 1848276460
PQPubID 85437
PageCount 16
ParticipantIDs crossref_primary_10_1109_TPDS_2016_2546248
crossref_citationtrail_10_1109_TPDS_2016_2546248
hal_primary_oai_HAL_hal_01353871v1
ieee_primary_7442859
proquest_journals_1848276460
PublicationCentury 2000
PublicationDate 2017-Jan.-1
2017-1-1
20170101
2017-01-01
PublicationDateYYYYMMDD 2017-01-01
PublicationDate_xml – month: 01
  year: 2017
  text: 2017-Jan.-1
  day: 01
PublicationDecade 2010
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on parallel and distributed systems
PublicationTitleAbbrev TPDS
PublicationYear 2017
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Institute of Electrical and Electronics Engineers
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
– name: Institute of Electrical and Electronics Engineers
References ref13
ref12
ref15
ref14
ref11
ref10
ref2
ref17
ref16
liu (ref27) 0
ref19
(ref1) 2013
ref24
ref25
ref20
di (ref18) 2016
ref28
smith (ref22) 2010
ref29
ref8
ref7
ref9
ref4
ref3
ref6
ref5
chen (ref26) 2009
(ref21) 0
(ref23) 0
References_xml – ident: ref12
  doi: 10.2172/984082
– year: 2016
  ident: ref18
– ident: ref6
  doi: 10.1109/TPDS.2008.172
– year: 2010
  ident: ref22
  article-title: The parallel ocean program (POP) reference manual: Ocean component of the community climate system model (CCSM)
– ident: ref14
  doi: 10.1109/HiPC.2012.6507514
– ident: ref9
  doi: 10.1109/HIPC.2010.5713163
– year: 0
  ident: ref23
– ident: ref16
  doi: 10.1109/IPDPS.2016.11
– ident: ref19
  doi: 10.1145/2063384.2063428
– ident: ref29
  doi: 10.1109/TC.2012.17
– ident: ref7
  doi: 10.1109/IPDPS.2014.122
– ident: ref3
  doi: 10.2172/1081941
– ident: ref13
  doi: 10.1145/2063384.2063427
– year: 0
  ident: ref21
– ident: ref24
  doi: 10.1109/ICPP.2010.80
– ident: ref28
  doi: 10.1145/223587.223596
– year: 2013
  ident: ref1
– start-page: 1015
  year: 2009
  ident: ref26
  article-title: Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications
  publication-title: Proc ACM Symp Appl Comput
– ident: ref15
  doi: 10.1109/DSNW.2012.6264677
– ident: ref4
  doi: 10.1177/1094342014532297
– ident: ref20
  doi: 10.1016/j.future.2004.11.016
– ident: ref10
  doi: 10.1109/CCGRID.2010.40
– ident: ref17
  doi: 10.1155/1996/483083
– ident: ref2
  doi: 10.1016/j.future.2015.04.003
– start-page: 1
  year: 0
  ident: ref27
  article-title: An optimal checkpoint/restart model for a large scale high performance computing system
  publication-title: Proc IEEE Int Symp Parallel Distrib Process
– ident: ref11
  doi: 10.1371/journal.pone.0104591
– ident: ref8
  doi: 10.1109/SC.2014.79
– ident: ref5
  doi: 10.1007/978-3-540-77220-0_26
– ident: ref25
  doi: 10.1145/361147.361115
SSID ssj0014504
Score 2.3900602
Snippet The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have...
—The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have...
SourceID hal
proquest
crossref
ieee
SourceType Open Access Repository
Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 244
SubjectTerms Checkpointing
Computational modeling
Computer crashes
Computer Science
Data Structures and Algorithms
Distributed, Parallel, and Cluster Computing
Error recovery
Fault tolerance
Fault tolerant systems
High-performance computing
Mathematical models
Multilevel
multilevel checkpoint
Optimization
Transient analysis
Title Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model
URI https://ieeexplore.ieee.org/document/7442859
https://www.proquest.com/docview/1848276460
https://inria.hal.science/hal-01353871
Volume 28
WOSCitedRecordID wos000390676100019&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1558-2183
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014504
  issn: 1045-9219
  databaseCode: RIE
  dateStart: 19900101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1ba9swFD40ZQ_rw9L1wrJeEGNPY251s2Q9lrQlD6ULNIO-GUmWaWnnlDTN_v50FMcEOgp9MBhzZIw-63Iu-j6A73kQluciZFZUOpOMVZn1lc2E5DY3NHc2uCQ2oa-vi9tbM96An91ZmBBCKj4LJ3ibcvnV1L9gqOxUS4l8az3oaa2WZ7W6jIHMk1Rg9C7yzMRh2GYwGTWnk_H5DRZxqRMkf-co9bO2BvXusAIySau8mo_TInPZf9_nbcOndjNJzpbof4aN0OxAfyXUQNpxuwNba6yDu3AzSaWyxDbkV5ww_sQ3LAlHyfAu-Ien6X0zJ6twGcFDZjNiyeTvNLvCCiMyGg_XLVFN7XEPfl9eTIajrNVWyLzQdB79T2uYqa0SdY2ZVKtlKLxWTtRSah8dY6oKx5SX1nLOnKu1dL7wIl6u0lrsw2YzbcIXIIZxVQUjgi0MbgdcURU8MBp0XRkh6wHQVW-XviUeR_2LxzI5INSUCFCJAJUtQAP40TV5WrJuvGX8LULY2SFf9ujsqsRnFFU9oku4YAPYRcA6qxarARyuEC_bwftcRqe34FpJRb_-v9UBfOS4uqdIzCFszmcv4Qg--MX8_nl2nP7LfzwL3f0
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3faxQxEB7aKqgPVtuK11YN4pO4bX7tZvNYrpYTz_OgK_QtJNksLda9cr3Wf99Mbm85UIQ-LCzLZFnybTKZzOT7AD7kQViei5BZUatMMlZn1tc2E5LbXNPc2eCS2ISaTMqLCz3dgE_9WZgQQio-C0d4m3L59czf4VbZsZIS-dY24VEe_ShdntbqcwYyT2KBMb7IMx0HYpfDZFQfV9PTcyzjKo6Q_p2j2M-aF9q8xBrIJK7y14yc3MzZ9sM-8AU875aT5GSJ_0vYCO0ObK-kGkg3cnfg2Rrv4C6cV6lYltiWfI9Txq_4hiXlKBleBv_zZnbVLshqw4zgMbM5saT6PcvGWGNERtPhuiXqqV3vwY-zz9VwlHXqCpkXii5iBGo1040tRNNgLtUqGUqvCicaKZWPoTEtSscKL63lnDnXKOl86UW8XK2UeAVb7awNr4Foxos6aBFsqXFB4Mq65IHRoJpaC9kMgK562_iOehwVMK5NCkGoNgiQQYBMB9AAPvZNbpa8G_8zfh8h7O2QMXt0Mjb4jKKuRwwK79kAdhGw3qrDagCHK8RNN3xvTQx7S64KWdD9f7d6B09G1bexGX-ZfD2Apxx9fdqXOYStxfwuvIHH_n5xdTt_m_7RPwEe4Uo
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Toward+an+Optimal+Online+Checkpoint+Solution+under+a+Two-Level+HPC+Checkpoint+Model&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Di%2C+Sheng&rft.au=Robert%2C+Yves&rft.au=Vivien%2C+Fr%C3%A9d%C3%A9ric&rft.au=Cappello%2C+Franck&rft.date=2017-01-01&rft.pub=Institute+of+Electrical+and+Electronics+Engineers&rft.issn=1045-9219&rft.volume=28&rft.issue=1&rft_id=info:doi/10.1109%2FTPDS.2016.2546248&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-01353871v1
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon