Towards Optimal Multi-Level Checkpointing
We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="benoit-ieq1-2643660.gif"/> </inline-formula>-level checkpointing patter...
Uloženo v:
| Vydáno v: | IEEE transactions on computers Ročník 66; číslo 7; s. 1212 - 1226 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
IEEE
01.07.2017
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Institute of Electrical and Electronics Engineers |
| Témata: | |
| ISSN: | 0018-9340, 1557-9956 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="benoit-ieq1-2643660.gif"/> </inline-formula>-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of <inline-formula> <tex-math notation="LaTeX">\sum _{\ell =1}^{k}\sqrt{2\lambda _\ell C_\ell}</tex-math> <inline-graphic xlink:href="benoit-ieq2-2643660.gif"/> </inline-formula>, where <inline-formula> <tex-math notation="LaTeX">\lambda _\ell</tex-math> <inline-graphic xlink:href="benoit-ieq3-2643660.gif"/> </inline-formula> is the error rate at level <inline-formula><tex-math notation="LaTeX">\ell</tex-math> <inline-graphic xlink:href="benoit-ieq4-2643660.gif"/> </inline-formula>, and <inline-formula> <tex-math notation="LaTeX">C_\ell</tex-math> <inline-graphic xlink:href="benoit-ieq5-2643660.gif"/> </inline-formula> the checkpointing cost at level <inline-formula><tex-math notation="LaTeX">\ell </tex-math> <inline-graphic xlink:href="benoit-ieq6-2643660.gif"/> </inline-formula>. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels. |
|---|---|
| AbstractList | We provide a framework to analyze multi-level checkpointing protocols, by formally defining a [Formula Omitted]-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of [Formula Omitted], where [Formula Omitted] is the error rate at level [Formula Omitted], and [Formula Omitted] the checkpointing cost at level [Formula Omitted]. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels. We provide a framework to analyze multi-level checkpointing protocols, by formally defining a k-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of k =1 √ 2λ C , where λ is the error rate at level , and C the checkpointing cost at level. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels. We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="benoit-ieq1-2643660.gif"/> </inline-formula>-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of <inline-formula> <tex-math notation="LaTeX">\sum _{\ell =1}^{k}\sqrt{2\lambda _\ell C_\ell}</tex-math> <inline-graphic xlink:href="benoit-ieq2-2643660.gif"/> </inline-formula>, where <inline-formula> <tex-math notation="LaTeX">\lambda _\ell</tex-math> <inline-graphic xlink:href="benoit-ieq3-2643660.gif"/> </inline-formula> is the error rate at level <inline-formula><tex-math notation="LaTeX">\ell</tex-math> <inline-graphic xlink:href="benoit-ieq4-2643660.gif"/> </inline-formula>, and <inline-formula> <tex-math notation="LaTeX">C_\ell</tex-math> <inline-graphic xlink:href="benoit-ieq5-2643660.gif"/> </inline-formula> the checkpointing cost at level <inline-formula><tex-math notation="LaTeX">\ell </tex-math> <inline-graphic xlink:href="benoit-ieq6-2643660.gif"/> </inline-formula>. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels. |
| Author | Benoit, Anne Le Fevre, Valentin Robert, Yves Cavelan, Aurelien Sun, Hongyang |
| Author_xml | – sequence: 1 givenname: Anne surname: Benoit fullname: Benoit, Anne email: Anne.Benoit@ens-lyon.fr organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France – sequence: 2 givenname: Aurelien surname: Cavelan fullname: Cavelan, Aurelien email: Aurelien.Cavelan@ens-lyon.fr organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France – sequence: 3 givenname: Valentin surname: Le Fevre fullname: Le Fevre, Valentin email: Valentin.Le-Fevre@ens-lyon.fr organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France – sequence: 4 givenname: Yves surname: Robert fullname: Robert, Yves email: Yves.Robert@inria.fr organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France – sequence: 5 givenname: Hongyang surname: Sun fullname: Sun, Hongyang email: sunhongyang@gmail.com organization: Ecole Normale Supérieure de Lyon & INRIA, Lyon, France |
| BackLink | https://inria.hal.science/hal-02082416$$DView record in HAL |
| BookMark | eNp9kL1PwzAQxS1UJNrCzMBSialD2vNHnHisIqBIQV3CbDmJTV1CHJy0iP-eREEdGJhOevd-d09vhia1qzVCtxhWGINYZ8mKAOYrwhnlHC7QFIdhFAgR8gmaAuA4EJTBFZq17QEAOAExRcvMfSlftotd09kPVS1ejlVng1SfdLVI9rp4b5ytO1u_XaNLo6pW3_zOOXp9fMiSbZDunp6TTRoUlLMuKCNSEEONYibHTBkTMmaULktG8pwNcplzSrFWZQE8LiNW0Fj0K2444VjROVqOd_eqko3vQ_lv6ZSV200qBw0IxIRhfsK99370Nt59HnXbyYM7-rqPJwmOGCWYguhd69FVeNe2XpvzWQxy6E5miRy6k7_d9UT4hyhspzrr6s4rW_3D3Y2c1Vqfv0SRCAkB-gOnkHuW |
| CODEN | ITCOB4 |
| CitedBy_id | crossref_primary_10_1007_s11704_022_2096_3 crossref_primary_10_3390_app11031169 crossref_primary_10_1109_TPDS_2018_2844210 crossref_primary_10_1145_3624560 crossref_primary_10_1016_j_future_2024_07_022 crossref_primary_10_1016_j_jocs_2017_03_024 crossref_primary_10_1109_TSUSC_2018_2797890 crossref_primary_10_1016_j_jpdc_2018_08_002 crossref_primary_10_1109_TPDS_2020_3015805 |
| Cites_doi | 10.2172/984082 10.1007/978-3-319-20943-2 10.1017/CBO9780511804441 10.1109/TPDS.2016.2546248 10.1109/IPDPS.2014.122 10.1007/978-3-319-17248-4_13 10.1109/TC.2012.17 10.1145/2063384.2063444 10.1145/361147.361115 10.1145/223586.223596 10.1145/2063384.2063427 10.1109/71.730527 10.1049/ip-sen:19982440 10.1016/j.future.2004.11.016 10.1002/cpe.3173 10.1145/2063384.2063443 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017 Distributed under a Creative Commons Attribution 4.0 International License |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017 – notice: Distributed under a Creative Commons Attribution 4.0 International License |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 1XC VOOES |
| DOI | 10.1109/TC.2016.2643660 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1557-9956 |
| EndPage | 1226 |
| ExternalDocumentID | oai:HAL:hal-02082416v1 10_1109_TC_2016_2643660 7795220 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: French National Research Agency (ANR) funderid: 10.13039/501100001665 – fundername: PIA – fundername: LABEX funderid: 10.13039/501100004100 – fundername: Institut Universitaire de France funderid: 10.13039/501100004795 – fundername: ELCI – fundername: Université de Lyon – fundername: Investissements d’Avenir” grantid: ANR-11-IDEX-0007 – fundername: ANR funderid: 10.13039/501100001665 – fundername: MILYON grantid: ANR-10-LABX-0070 |
| GroupedDBID | --Z -DZ -~X .DC 0R~ 29I 4.4 5GY 6IK 85S 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK ACNCT AENEX AETEA AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS RXW TAE TN5 TWZ UHB UPT XZL YZZ AAYXX ABUFD CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 1XC VOOES |
| ID | FETCH-LOGICAL-c364t-d72c2f3fa4fb14aff544faedd42bb4fa4fdb6331eadc068d74c389bb46f6261a3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 19 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000403288900009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0018-9340 |
| IngestDate | Tue Oct 14 06:51:09 EDT 2025 Sun Nov 30 04:49:03 EST 2025 Tue Nov 18 19:41:25 EST 2025 Sat Nov 29 01:35:39 EST 2025 Wed Aug 27 02:49:05 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 7 |
| Keywords | Optimal pattern Fail-stop errors Multi-level checkpointing Resilience |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c364t-d72c2f3fa4fb14aff544faedd42bb4fa4fdb6331eadc068d74c389bb46f6261a3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0003-2910-3540 0000-0003-2361-055X 0000-0002-4379-4467 |
| OpenAccessLink | https://inria.hal.science/hal-02082416 |
| PQID | 2174321309 |
| PQPubID | 85452 |
| PageCount | 15 |
| ParticipantIDs | proquest_journals_2174321309 crossref_citationtrail_10_1109_TC_2016_2643660 crossref_primary_10_1109_TC_2016_2643660 ieee_primary_7795220 hal_primary_oai_HAL_hal_02082416v1 |
| PublicationCentury | 2000 |
| PublicationDate | 2017-July-1 2017-7-1 20170701 2017-07-01 |
| PublicationDateYYYYMMDD | 2017-07-01 |
| PublicationDate_xml | – month: 07 year: 2017 text: 2017-July-1 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | IEEE transactions on computers |
| PublicationTitleAbbrev | TC |
| PublicationYear | 2017 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Institute of Electrical and Electronics Engineers |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) – name: Institute of Electrical and Electronics Engineers |
| References | ref12 ref15 ref14 ref11 ref2 ref1 gallager (ref10) 2014 ref17 ref16 ref18 ref8 ref7 ref9 ref4 ref6 ref5 hérault (ref13) 2015 benoit (ref3) 2015 |
| References_xml | – ident: ref14 doi: 10.2172/984082 – year: 2015 ident: ref13 article-title: Fault-tolerance techniques for high-performance computing publication-title: Computer Communications and Networks doi: 10.1007/978-3-319-20943-2 – year: 2015 ident: ref3 article-title: Optimal resilience patterns to cope with fail-stop and silent errors – ident: ref5 doi: 10.1017/CBO9780511804441 – ident: ref8 doi: 10.1109/TPDS.2016.2546248 – ident: ref7 doi: 10.1109/IPDPS.2014.122 – ident: ref1 doi: 10.1007/978-3-319-17248-4_13 – ident: ref11 doi: 10.1109/TC.2012.17 – ident: ref12 doi: 10.1145/2063384.2063444 – ident: ref18 doi: 10.1145/361147.361115 – year: 2014 ident: ref10 publication-title: Stochastic Processes Theory for Applications – ident: ref17 doi: 10.1145/223586.223596 – ident: ref2 doi: 10.1145/2063384.2063427 – ident: ref15 doi: 10.1109/71.730527 – ident: ref16 doi: 10.1049/ip-sen:19982440 – ident: ref6 doi: 10.1016/j.future.2004.11.016 – ident: ref4 doi: 10.1002/cpe.3173 – ident: ref9 doi: 10.1145/2063384.2063443 |
| SSID | ssj0006209 |
| Score | 2.3202434 |
| Snippet | We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">k</tex-math>... We provide a framework to analyze multi-level checkpointing protocols, by formally defining a [Formula Omitted]-level checkpointing pattern. We provide a... We provide a framework to analyze multi-level checkpointing protocols, by formally defining a k-level checkpointing pattern. We provide a first-order... |
| SourceID | hal proquest crossref ieee |
| SourceType | Open Access Repository Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 1212 |
| SubjectTerms | Algorithms Checkpointing Computer Science Computer simulation Distributed, Parallel, and Cluster Computing Dynamic programming Error analysis fail-stop errors Heuristic algorithms multi-level checkpointing optimal pattern Optimization Optimized production technology Protocols Resilience Shape |
| Title | Towards Optimal Multi-Level Checkpointing |
| URI | https://ieeexplore.ieee.org/document/7795220 https://www.proquest.com/docview/2174321309 https://inria.hal.science/hal-02082416 |
| Volume | 66 |
| WOSCitedRecordID | wos000403288900009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1557-9956 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0006209 issn: 0018-9340 databaseCode: RIE dateStart: 19680101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED4BYoCBRwFRXooQA0ikjWPHjkcUgTpUhSGgbpHjh6iAturr92O7aUACBrbIOUvRne_OF3_-DuBK4FRpxFlohIhCgrQKuaIqFLSUNmGkjHkC05cu6_XSfp8_rcFtfRdGa-3BZ7rlHv1ZvhrJuftV1maM2-2CLdDXGaPLu1p11KUrOAeyDoxJVNH4oIi388xBuGjL5n685KL8ykDrrw7_6Bur_IjGPsU87P7v4_Zgp9pKBndL2-_Dmh42YHfVpiGovLYB2984Bw_gJvdA2WnwaIPFh53vr-CGXQceCrJXLd_Go4FvH3EIzw_3edYJq34JocSUzELFYhkbbAQxJSLCmIQQI7RSJC5L4oZVSTFGdvHIiKaKEWm3K_YVNbasQQIfwcZwNNTHEKQiSTCPFEuoISxNuNGMsTSKpYlNjEwTWisdFrIiE3c9Ld4LX1REvMizwim9qJTehOt6wnjJo_G36KU1Si3l-K87d93CjbmOonYF0QVqwoEzQS1Vab8JZysbFpUzTgtfdcU2WfOT32edwlbssrVH4Z7Bxmwy1-ewKRezwXRy4dfZJ5-xzes |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT-MwEB6xgLS7B95oyzNCHBaJlDh27fiIKlARoXAIiJvl-CEQbIto4fdju25AYjlwi5yxFM3YHk_8-fsA9iUutEGcpVbKLCXI6JRrqlNJa-USRsFYIDC9KVm_X9ze8qsZOGzuwhhjAvjMtP1jOMvXQ_Xif5UdMcbddsEV6HNeOSve1mrWXToFdCA3hTHJIpEPyvhR1fUgLtp22R9P2Cjfc9CPO4-ADNIqn9bjkGROF7_3eUuwEDeTyfEk-sswYwYrsDgVakjivF2B3x9YB1fhoApQ2VFy6ZaLf65_uISblh4-lHTvjHp4Gt4HAYk1uD49qbq9NCompApTMk41y1VusZXE1ohIa52frDRak7yuiW_WNcUYueGjMlpoRpTbsLhX1LrCBkm8DrOD4cD8gaSQnQ7mmWYdagkrOtwaxliR5crmNke2Be2pD4WKdOJe1eJRhLIi46LqCu90EZ3egr9Nh6cJk8bXpnsuKI2VZ8DuHZfCt3lNUTeG6CtqwaoPQWMVvd-CrWkMRZyOIxHqrtyla77x_1678LNXXZSiPOufb8Kv3OfugMndgtnx84vZhnn1Or4fPe-EMfcGhoHRNA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Towards+Optimal+Multi-Level+Checkpointing&rft.jtitle=IEEE+transactions+on+computers&rft.au=Benoit%2C+Anne&rft.au=Cavelan%2C+Aurelien&rft.au=Le+Fevre%2C+Valentin&rft.au=Robert%2C+Yves&rft.date=2017-07-01&rft.issn=0018-9340&rft.volume=66&rft.issue=7&rft.spage=1212&rft.epage=1226&rft_id=info:doi/10.1109%2FTC.2016.2643660&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TC_2016_2643660 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0018-9340&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0018-9340&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0018-9340&client=summon |