Towards Optimal Multi-Level Checkpointing

We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="benoit-ieq1-2643660.gif"/> </inline-formula>-level checkpointing patter...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on computers Ročník 66; číslo 7; s. 1212 - 1226
Hlavní autoři: Benoit, Anne, Cavelan, Aurelien, Le Fevre, Valentin, Robert, Yves, Sun, Hongyang
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York IEEE 01.07.2017
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Institute of Electrical and Electronics Engineers
Témata:
ISSN:0018-9340, 1557-9956
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="benoit-ieq1-2643660.gif"/> </inline-formula>-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of <inline-formula> <tex-math notation="LaTeX">\sum _{\ell =1}^{k}\sqrt{2\lambda _\ell C_\ell}</tex-math> <inline-graphic xlink:href="benoit-ieq2-2643660.gif"/> </inline-formula>, where <inline-formula> <tex-math notation="LaTeX">\lambda _\ell</tex-math> <inline-graphic xlink:href="benoit-ieq3-2643660.gif"/> </inline-formula> is the error rate at level  <inline-formula><tex-math notation="LaTeX">\ell</tex-math> <inline-graphic xlink:href="benoit-ieq4-2643660.gif"/> </inline-formula>, and <inline-formula> <tex-math notation="LaTeX">C_\ell</tex-math> <inline-graphic xlink:href="benoit-ieq5-2643660.gif"/> </inline-formula> the checkpointing cost at level <inline-formula><tex-math notation="LaTeX">\ell </tex-math> <inline-graphic xlink:href="benoit-ieq6-2643660.gif"/> </inline-formula>. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels.
AbstractList We provide a framework to analyze multi-level checkpointing protocols, by formally defining a [Formula Omitted]-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of [Formula Omitted], where [Formula Omitted] is the error rate at level [Formula Omitted], and [Formula Omitted] the checkpointing cost at level [Formula Omitted]. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels.
We provide a framework to analyze multi-level checkpointing protocols, by formally defining a k-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of k =1 √ 2λ C , where λ is the error rate at level , and C the checkpointing cost at level. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels.
We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="benoit-ieq1-2643660.gif"/> </inline-formula>-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of <inline-formula> <tex-math notation="LaTeX">\sum _{\ell =1}^{k}\sqrt{2\lambda _\ell C_\ell}</tex-math> <inline-graphic xlink:href="benoit-ieq2-2643660.gif"/> </inline-formula>, where <inline-formula> <tex-math notation="LaTeX">\lambda _\ell</tex-math> <inline-graphic xlink:href="benoit-ieq3-2643660.gif"/> </inline-formula> is the error rate at level  <inline-formula><tex-math notation="LaTeX">\ell</tex-math> <inline-graphic xlink:href="benoit-ieq4-2643660.gif"/> </inline-formula>, and <inline-formula> <tex-math notation="LaTeX">C_\ell</tex-math> <inline-graphic xlink:href="benoit-ieq5-2643660.gif"/> </inline-formula> the checkpointing cost at level <inline-formula><tex-math notation="LaTeX">\ell </tex-math> <inline-graphic xlink:href="benoit-ieq6-2643660.gif"/> </inline-formula>. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels.
Author Benoit, Anne
Le Fevre, Valentin
Robert, Yves
Cavelan, Aurelien
Sun, Hongyang
Author_xml – sequence: 1
  givenname: Anne
  surname: Benoit
  fullname: Benoit, Anne
  email: Anne.Benoit@ens-lyon.fr
  organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France
– sequence: 2
  givenname: Aurelien
  surname: Cavelan
  fullname: Cavelan, Aurelien
  email: Aurelien.Cavelan@ens-lyon.fr
  organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France
– sequence: 3
  givenname: Valentin
  surname: Le Fevre
  fullname: Le Fevre, Valentin
  email: Valentin.Le-Fevre@ens-lyon.fr
  organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France
– sequence: 4
  givenname: Yves
  surname: Robert
  fullname: Robert, Yves
  email: Yves.Robert@inria.fr
  organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France
– sequence: 5
  givenname: Hongyang
  surname: Sun
  fullname: Sun, Hongyang
  email: sunhongyang@gmail.com
  organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France
BackLink https://inria.hal.science/hal-02082416$$DView record in HAL
BookMark eNp9kL1PwzAQxS1UJNrCzMBSialD2vNHnHisIqBIQV3CbDmJTV1CHJy0iP-eREEdGJhOevd-d09vhia1qzVCtxhWGINYZ8mKAOYrwhnlHC7QFIdhFAgR8gmaAuA4EJTBFZq17QEAOAExRcvMfSlftotd09kPVS1ejlVng1SfdLVI9rp4b5ytO1u_XaNLo6pW3_zOOXp9fMiSbZDunp6TTRoUlLMuKCNSEEONYibHTBkTMmaULktG8pwNcplzSrFWZQE8LiNW0Fj0K2444VjROVqOd_eqko3vQ_lv6ZSV200qBw0IxIRhfsK99370Nt59HnXbyYM7-rqPJwmOGCWYguhd69FVeNe2XpvzWQxy6E5miRy6k7_d9UT4hyhspzrr6s4rW_3D3Y2c1Vqfv0SRCAkB-gOnkHuW
CODEN ITCOB4
CitedBy_id crossref_primary_10_1007_s11704_022_2096_3
crossref_primary_10_3390_app11031169
crossref_primary_10_1109_TPDS_2018_2844210
crossref_primary_10_1145_3624560
crossref_primary_10_1016_j_future_2024_07_022
crossref_primary_10_1016_j_jocs_2017_03_024
crossref_primary_10_1109_TSUSC_2018_2797890
crossref_primary_10_1016_j_jpdc_2018_08_002
crossref_primary_10_1109_TPDS_2020_3015805
Cites_doi 10.2172/984082
10.1007/978-3-319-20943-2
10.1017/CBO9780511804441
10.1109/TPDS.2016.2546248
10.1109/IPDPS.2014.122
10.1007/978-3-319-17248-4_13
10.1109/TC.2012.17
10.1145/2063384.2063444
10.1145/361147.361115
10.1145/223586.223596
10.1145/2063384.2063427
10.1109/71.730527
10.1049/ip-sen:19982440
10.1016/j.future.2004.11.016
10.1002/cpe.3173
10.1145/2063384.2063443
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017
Distributed under a Creative Commons Attribution 4.0 International License
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017
– notice: Distributed under a Creative Commons Attribution 4.0 International License
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
1XC
VOOES
DOI 10.1109/TC.2016.2643660
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database


Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1557-9956
EndPage 1226
ExternalDocumentID oai:HAL:hal-02082416v1
10_1109_TC_2016_2643660
7795220
Genre orig-research
GrantInformation_xml – fundername: French National Research Agency (ANR)
  funderid: 10.13039/501100001665
– fundername: PIA
– fundername: LABEX
  funderid: 10.13039/501100004100
– fundername: Institut Universitaire de France
  funderid: 10.13039/501100004795
– fundername: ELCI
– fundername: Université de Lyon
– fundername: Investissements d’Avenir”
  grantid: ANR-11-IDEX-0007
– fundername: ANR
  funderid: 10.13039/501100001665
– fundername: MILYON
  grantid: ANR-10-LABX-0070
GroupedDBID --Z
-DZ
-~X
.DC
0R~
29I
4.4
5GY
6IK
85S
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
ACNCT
AENEX
AETEA
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
RXW
TAE
TN5
TWZ
UHB
UPT
XZL
YZZ
AAYXX
ABUFD
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
1XC
VOOES
ID FETCH-LOGICAL-c364t-d72c2f3fa4fb14aff544faedd42bb4fa4fdb6331eadc068d74c389bb46f6261a3
IEDL.DBID RIE
ISICitedReferencesCount 19
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000403288900009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0018-9340
IngestDate Tue Oct 14 06:51:09 EDT 2025
Sun Nov 30 04:49:03 EST 2025
Tue Nov 18 19:41:25 EST 2025
Sat Nov 29 01:35:39 EST 2025
Wed Aug 27 02:49:05 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 7
Keywords Optimal pattern
Fail-stop errors
Multi-level checkpointing
Resilience
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c364t-d72c2f3fa4fb14aff544faedd42bb4fa4fdb6331eadc068d74c389bb46f6261a3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-2910-3540
0000-0003-2361-055X
0000-0002-4379-4467
OpenAccessLink https://inria.hal.science/hal-02082416
PQID 2174321309
PQPubID 85452
PageCount 15
ParticipantIDs proquest_journals_2174321309
crossref_citationtrail_10_1109_TC_2016_2643660
crossref_primary_10_1109_TC_2016_2643660
ieee_primary_7795220
hal_primary_oai_HAL_hal_02082416v1
PublicationCentury 2000
PublicationDate 2017-July-1
2017-7-1
20170701
2017-07-01
PublicationDateYYYYMMDD 2017-07-01
PublicationDate_xml – month: 07
  year: 2017
  text: 2017-July-1
  day: 01
PublicationDecade 2010
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on computers
PublicationTitleAbbrev TC
PublicationYear 2017
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Institute of Electrical and Electronics Engineers
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
– name: Institute of Electrical and Electronics Engineers
References ref12
ref15
ref14
ref11
ref2
ref1
gallager (ref10) 2014
ref17
ref16
ref18
ref8
ref7
ref9
ref4
ref6
ref5
hérault (ref13) 2015
benoit (ref3) 2015
References_xml – ident: ref14
  doi: 10.2172/984082
– year: 2015
  ident: ref13
  article-title: Fault-tolerance techniques for high-performance computing
  publication-title: Computer Communications and Networks
  doi: 10.1007/978-3-319-20943-2
– year: 2015
  ident: ref3
  article-title: Optimal resilience patterns to cope with fail-stop and silent errors
– ident: ref5
  doi: 10.1017/CBO9780511804441
– ident: ref8
  doi: 10.1109/TPDS.2016.2546248
– ident: ref7
  doi: 10.1109/IPDPS.2014.122
– ident: ref1
  doi: 10.1007/978-3-319-17248-4_13
– ident: ref11
  doi: 10.1109/TC.2012.17
– ident: ref12
  doi: 10.1145/2063384.2063444
– ident: ref18
  doi: 10.1145/361147.361115
– year: 2014
  ident: ref10
  publication-title: Stochastic Processes Theory for Applications
– ident: ref17
  doi: 10.1145/223586.223596
– ident: ref2
  doi: 10.1145/2063384.2063427
– ident: ref15
  doi: 10.1109/71.730527
– ident: ref16
  doi: 10.1049/ip-sen:19982440
– ident: ref6
  doi: 10.1016/j.future.2004.11.016
– ident: ref4
  doi: 10.1002/cpe.3173
– ident: ref9
  doi: 10.1145/2063384.2063443
SSID ssj0006209
Score 2.3202434
Snippet We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math>...
We provide a framework to analyze multi-level checkpointing protocols, by formally defining a [Formula Omitted]-level checkpointing pattern. We provide a...
We provide a framework to analyze multi-level checkpointing protocols, by formally defining a k-level checkpointing pattern. We provide a first-order...
SourceID hal
proquest
crossref
ieee
SourceType Open Access Repository
Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1212
SubjectTerms Algorithms
Checkpointing
Computer Science
Computer simulation
Distributed, Parallel, and Cluster Computing
Dynamic programming
Error analysis
fail-stop errors
Heuristic algorithms
multi-level checkpointing
optimal pattern
Optimization
Optimized production technology
Protocols
Resilience
Shape
Title Towards Optimal Multi-Level Checkpointing
URI https://ieeexplore.ieee.org/document/7795220
https://www.proquest.com/docview/2174321309
https://inria.hal.science/hal-02082416
Volume 66
WOSCitedRecordID wos000403288900009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1557-9956
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0006209
  issn: 0018-9340
  databaseCode: RIE
  dateStart: 19680101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED4BYoCBRwFRXooQA0ikjWPHjkcUgTpUhSGgbpHjh6iAturr92O7aUACBrbIOUvRne_OF3_-DuBK4FRpxFlohIhCgrQKuaIqFLSUNmGkjHkC05cu6_XSfp8_rcFtfRdGa-3BZ7rlHv1ZvhrJuftV1maM2-2CLdDXGaPLu1p11KUrOAeyDoxJVNH4oIi388xBuGjL5n685KL8ykDrrw7_6Bur_IjGPsU87P7v4_Zgp9pKBndL2-_Dmh42YHfVpiGovLYB2984Bw_gJvdA2WnwaIPFh53vr-CGXQceCrJXLd_Go4FvH3EIzw_3edYJq34JocSUzELFYhkbbAQxJSLCmIQQI7RSJC5L4oZVSTFGdvHIiKaKEWm3K_YVNbasQQIfwcZwNNTHEKQiSTCPFEuoISxNuNGMsTSKpYlNjEwTWisdFrIiE3c9Ld4LX1REvMizwim9qJTehOt6wnjJo_G36KU1Si3l-K87d93CjbmOonYF0QVqwoEzQS1Vab8JZysbFpUzTgtfdcU2WfOT32edwlbssrVH4Z7Bxmwy1-ewKRezwXRy4dfZJ5-xzes
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT-MwEB6xgLS7B95oyzNCHBaJlDh27fiIKlARoXAIiJvl-CEQbIto4fdju25AYjlwi5yxFM3YHk_8-fsA9iUutEGcpVbKLCXI6JRrqlNJa-USRsFYIDC9KVm_X9ze8qsZOGzuwhhjAvjMtP1jOMvXQ_Xif5UdMcbddsEV6HNeOSve1mrWXToFdCA3hTHJIpEPyvhR1fUgLtp22R9P2Cjfc9CPO4-ADNIqn9bjkGROF7_3eUuwEDeTyfEk-sswYwYrsDgVakjivF2B3x9YB1fhoApQ2VFy6ZaLf65_uISblh4-lHTvjHp4Gt4HAYk1uD49qbq9NCompApTMk41y1VusZXE1ohIa52frDRak7yuiW_WNcUYueGjMlpoRpTbsLhX1LrCBkm8DrOD4cD8gaSQnQ7mmWYdagkrOtwaxliR5crmNke2Be2pD4WKdOJe1eJRhLIi46LqCu90EZ3egr9Nh6cJk8bXpnsuKI2VZ8DuHZfCt3lNUTeG6CtqwaoPQWMVvd-CrWkMRZyOIxHqrtyla77x_1678LNXXZSiPOufb8Kv3OfugMndgtnx84vZhnn1Or4fPe-EMfcGhoHRNA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Towards+Optimal+Multi-Level+Checkpointing&rft.jtitle=IEEE+transactions+on+computers&rft.au=Benoit%2C+Anne&rft.au=Cavelan%2C+Aurelien&rft.au=Le+Fevre%2C+Valentin&rft.au=Robert%2C+Yves&rft.date=2017-07-01&rft.issn=0018-9340&rft.volume=66&rft.issue=7&rft.spage=1212&rft.epage=1226&rft_id=info:doi/10.1109%2FTC.2016.2643660&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TC_2016_2643660
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0018-9340&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0018-9340&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0018-9340&client=summon