Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model
The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with differ...
Uložené v:
| Vydané v: | IEEE transactions on parallel and distributed systems Ročník 28; číslo 1; s. 244 - 259 |
|---|---|
| Hlavní autori: | , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
New York
IEEE
01.01.2017
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Institute of Electrical and Electronics Engineers |
| Predmet: | |
| ISSN: | 1045-9219, 1558-2183 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3 percent over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1 percent of the best pattern in the experiments. |
|---|---|
| AbstractList | The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3 percent over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1 percent of the best pattern in the experiments. —The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3% over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1% of the best pattern in the experiments. |
| Author | Robert, Yves Cappello, Franck Sheng Di Vivien, Frederic |
| Author_xml | – sequence: 1 surname: Sheng Di fullname: Sheng Di email: disheng222@gmail.com organization: Math. & Comput. Sci. (MCS) Div., Argonne Nat. Lab., Chicago, IL, USA – sequence: 2 givenname: Yves surname: Robert fullname: Robert, Yves email: Yves.Robert@inria.fr organization: Lab. LIP, UCB Lyon, Lyon, France – sequence: 3 givenname: Frederic surname: Vivien fullname: Vivien, Frederic email: frederic.vivien@inria.fr organization: Lab. LIP, UCB Lyon, Lyon, France – sequence: 4 givenname: Franck surname: Cappello fullname: Cappello, Franck email: cappello@mcs.anl.gov organization: Math. & Comput. Sci. (MCS) Div., Argonne Nat. Lab., Chicago, IL, USA |
| BackLink | https://inria.hal.science/hal-01353871$$DView record in HAL |
| BookMark | eNp9kE9LwzAYh4NMcJt-APES8OShM2-SNulx1D8TJhtsnkOapayzJjPtNvz2tmyIePAQEsLze98fzwD1nHcWoWsgIwCS3i_nD4sRJZCMaMwTyuUZ6kMcy4iCZL32TXgcpRTSCzSo6w0hwGPC-2ix9AcdVlg7PNs25Yeu8MxVpbM4W1vzvvWla_DCV7um9A7v3MoGrPHy4KOp3dsKT-bZb_LVr2x1ic4LXdX26nQP0dvT4zKbRNPZ80s2nkaGCdJEMdUppIVOWFGkbR8tuJVGJDkrOBdGSEkSmUNiuNaUQp4XgudGGtaefCUEG6K749y1rtQ2tOXDl_K6VJPxVHV_BFjMpIA9tOztkd0G_7mzdaM2fhdcW0-B5JKKhCekpeBImeDrOtjiZywQ1XlWnWfVeVYnz21G_MmYstGdribosvo3eXNMltban02CcyrjlH0D8WGK9Q |
| CODEN | ITDSEO |
| CitedBy_id | crossref_primary_10_1007_s11704_022_2096_3 crossref_primary_10_1016_j_cosrev_2024_100660 crossref_primary_10_3390_a12090197 crossref_primary_10_1109_TPDS_2018_2844210 crossref_primary_10_1109_TC_2016_2643660 crossref_primary_10_1002_spe_3021 crossref_primary_10_1016_j_future_2024_07_022 crossref_primary_10_1109_TSUSC_2018_2797890 crossref_primary_10_1109_TPDS_2020_3015805 crossref_primary_10_1016_j_future_2020_04_019 crossref_primary_10_3390_drones7050286 crossref_primary_10_1145_3624560 crossref_primary_10_1007_s10586_021_03464_4 crossref_primary_10_1109_ACCESS_2019_2903588 crossref_primary_10_1109_TCC_2021_3057422 crossref_primary_10_1007_s11227_018_2621_1 crossref_primary_10_1109_TPDS_2019_2896894 crossref_primary_10_1145_3403956 crossref_primary_10_1109_TASE_2022_3195958 |
| Cites_doi | 10.2172/984082 10.1109/TPDS.2008.172 10.1109/HiPC.2012.6507514 10.1109/HIPC.2010.5713163 10.1109/IPDPS.2016.11 10.1145/2063384.2063428 10.1109/TC.2012.17 10.1109/IPDPS.2014.122 10.2172/1081941 10.1145/2063384.2063427 10.1109/ICPP.2010.80 10.1145/223587.223596 10.1109/DSNW.2012.6264677 10.1177/1094342014532297 10.1016/j.future.2004.11.016 10.1109/CCGRID.2010.40 10.1155/1996/483083 10.1016/j.future.2015.04.003 10.1371/journal.pone.0104591 10.1109/SC.2014.79 10.1007/978-3-540-77220-0_26 10.1145/361147.361115 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017 Distributed under a Creative Commons Attribution 4.0 International License |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017 – notice: Distributed under a Creative Commons Attribution 4.0 International License |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 1XC VOOES |
| DOI | 10.1109/TPDS.2016.2546248 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1558-2183 |
| EndPage | 259 |
| ExternalDocumentID | oai:HAL:hal-01353871v1 10_1109_TPDS_2016_2546248 7442859 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Office of Science funderid: 10.13039/100006132 – fundername: US Department of Energy funderid: 10.13039/100000015 – fundername: European project SCoRPiO – fundername: Institut Universitaire de France funderid: 10.13039/501100004795 – fundername: US Department of Energy Office of Science laboratory grantid: DE-AC02-06CH11357 – fundername: Advanced Scientific Computing Research Program grantid: DE-AC02-06CH11357 |
| GroupedDBID | --Z -~X .DC 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS TN5 TWZ UHB AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D 1XC VOOES |
| ID | FETCH-LOGICAL-c370t-52a919fa63ff9014a74e8c76b3f447c788068b16c4aa221bbf74bc8c3c8cbd773 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 39 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000390676100019&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1045-9219 |
| IngestDate | Tue Oct 14 06:53:44 EDT 2025 Sun Nov 09 08:21:24 EST 2025 Tue Nov 18 22:32:09 EST 2025 Sat Nov 29 03:36:09 EST 2025 Wed Aug 27 02:52:30 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Keywords | High-Performance Computing Multilevel Checkpoint Optimization Fault Tolerance |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c370t-52a919fa63ff9014a74e8c76b3f447c788068b16c4aa221bbf74bc8c3c8cbd773 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0003-2361-055X 0000-0002-0663-6152 |
| OpenAccessLink | https://inria.hal.science/hal-01353871 |
| PQID | 1848276460 |
| PQPubID | 85437 |
| PageCount | 16 |
| ParticipantIDs | crossref_primary_10_1109_TPDS_2016_2546248 crossref_citationtrail_10_1109_TPDS_2016_2546248 hal_primary_oai_HAL_hal_01353871v1 ieee_primary_7442859 proquest_journals_1848276460 |
| PublicationCentury | 2000 |
| PublicationDate | 2017-Jan.-1 2017-1-1 20170101 2017-01-01 |
| PublicationDateYYYYMMDD | 2017-01-01 |
| PublicationDate_xml | – month: 01 year: 2017 text: 2017-Jan.-1 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | IEEE transactions on parallel and distributed systems |
| PublicationTitleAbbrev | TPDS |
| PublicationYear | 2017 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Institute of Electrical and Electronics Engineers |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) – name: Institute of Electrical and Electronics Engineers |
| References | ref13 ref12 ref15 ref14 ref11 ref10 ref2 ref17 ref16 liu (ref27) 0 ref19 (ref1) 2013 ref24 ref25 ref20 di (ref18) 2016 ref28 smith (ref22) 2010 ref29 ref8 ref7 ref9 ref4 ref3 ref6 ref5 chen (ref26) 2009 (ref21) 0 (ref23) 0 |
| References_xml | – ident: ref12 doi: 10.2172/984082 – year: 2016 ident: ref18 – ident: ref6 doi: 10.1109/TPDS.2008.172 – year: 2010 ident: ref22 article-title: The parallel ocean program (POP) reference manual: Ocean component of the community climate system model (CCSM) – ident: ref14 doi: 10.1109/HiPC.2012.6507514 – ident: ref9 doi: 10.1109/HIPC.2010.5713163 – year: 0 ident: ref23 – ident: ref16 doi: 10.1109/IPDPS.2016.11 – ident: ref19 doi: 10.1145/2063384.2063428 – ident: ref29 doi: 10.1109/TC.2012.17 – ident: ref7 doi: 10.1109/IPDPS.2014.122 – ident: ref3 doi: 10.2172/1081941 – ident: ref13 doi: 10.1145/2063384.2063427 – year: 0 ident: ref21 – ident: ref24 doi: 10.1109/ICPP.2010.80 – ident: ref28 doi: 10.1145/223587.223596 – year: 2013 ident: ref1 – start-page: 1015 year: 2009 ident: ref26 article-title: Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications publication-title: Proc ACM Symp Appl Comput – ident: ref15 doi: 10.1109/DSNW.2012.6264677 – ident: ref4 doi: 10.1177/1094342014532297 – ident: ref20 doi: 10.1016/j.future.2004.11.016 – ident: ref10 doi: 10.1109/CCGRID.2010.40 – ident: ref17 doi: 10.1155/1996/483083 – ident: ref2 doi: 10.1016/j.future.2015.04.003 – start-page: 1 year: 0 ident: ref27 article-title: An optimal checkpoint/restart model for a large scale high performance computing system publication-title: Proc IEEE Int Symp Parallel Distrib Process – ident: ref11 doi: 10.1371/journal.pone.0104591 – ident: ref8 doi: 10.1109/SC.2014.79 – ident: ref5 doi: 10.1007/978-3-540-77220-0_26 – ident: ref25 doi: 10.1145/361147.361115 |
| SSID | ssj0014504 |
| Score | 2.3900602 |
| Snippet | The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have... —The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have... |
| SourceID | hal proquest crossref ieee |
| SourceType | Open Access Repository Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 244 |
| SubjectTerms | Checkpointing Computational modeling Computer crashes Computer Science Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Error recovery Fault tolerance Fault tolerant systems High-performance computing Mathematical models Multilevel multilevel checkpoint Optimization Transient analysis |
| Title | Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model |
| URI | https://ieeexplore.ieee.org/document/7442859 https://www.proquest.com/docview/1848276460 https://inria.hal.science/hal-01353871 |
| Volume | 28 |
| WOSCitedRecordID | wos000390676100019&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1558-2183 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014504 issn: 1045-9219 databaseCode: RIE dateStart: 19900101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1ba9swFD40ZQ_rw9L1wrJeEGNPY251s2Q9lrQlD6ULNIO-GUmWaWnnlDTN_v50FMcEOgp9MBhzZIw-63Iu-j6A73kQluciZFZUOpOMVZn1lc2E5DY3NHc2uCQ2oa-vi9tbM96An91ZmBBCKj4LJ3ibcvnV1L9gqOxUS4l8az3oaa2WZ7W6jIHMk1Rg9C7yzMRh2GYwGTWnk_H5DRZxqRMkf-co9bO2BvXusAIySau8mo_TInPZf9_nbcOndjNJzpbof4aN0OxAfyXUQNpxuwNba6yDu3AzSaWyxDbkV5ww_sQ3LAlHyfAu-Ien6X0zJ6twGcFDZjNiyeTvNLvCCiMyGg_XLVFN7XEPfl9eTIajrNVWyLzQdB79T2uYqa0SdY2ZVKtlKLxWTtRSah8dY6oKx5SX1nLOnKu1dL7wIl6u0lrsw2YzbcIXIIZxVQUjgi0MbgdcURU8MBp0XRkh6wHQVW-XviUeR_2LxzI5INSUCFCJAJUtQAP40TV5WrJuvGX8LULY2SFf9ujsqsRnFFU9oku4YAPYRcA6qxarARyuEC_bwftcRqe34FpJRb_-v9UBfOS4uqdIzCFszmcv4Qg--MX8_nl2nP7LfzwL3f0 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3faxQxEB7aKqgPVtuK11YN4pO4bX7tZvNYrpYTz_OgK_QtJNksLda9cr3Wf99Mbm85UIQ-LCzLZFnybTKZzOT7AD7kQViei5BZUatMMlZn1tc2E5LbXNPc2eCS2ISaTMqLCz3dgE_9WZgQQio-C0d4m3L59czf4VbZsZIS-dY24VEe_ShdntbqcwYyT2KBMb7IMx0HYpfDZFQfV9PTcyzjKo6Q_p2j2M-aF9q8xBrIJK7y14yc3MzZ9sM-8AU875aT5GSJ_0vYCO0ObK-kGkg3cnfg2Rrv4C6cV6lYltiWfI9Txq_4hiXlKBleBv_zZnbVLshqw4zgMbM5saT6PcvGWGNERtPhuiXqqV3vwY-zz9VwlHXqCpkXii5iBGo1040tRNNgLtUqGUqvCicaKZWPoTEtSscKL63lnDnXKOl86UW8XK2UeAVb7awNr4Foxos6aBFsqXFB4Mq65IHRoJpaC9kMgK562_iOehwVMK5NCkGoNgiQQYBMB9AAPvZNbpa8G_8zfh8h7O2QMXt0Mjb4jKKuRwwK79kAdhGw3qrDagCHK8RNN3xvTQx7S64KWdD9f7d6B09G1bexGX-ZfD2Apxx9fdqXOYStxfwuvIHH_n5xdTt_m_7RPwEe4Uo |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Toward+an+Optimal+Online+Checkpoint+Solution+under+a+Two-Level+HPC+Checkpoint+Model&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Di%2C+Sheng&rft.au=Robert%2C+Yves&rft.au=Vivien%2C+Fr%C3%A9d%C3%A9ric&rft.au=Cappello%2C+Franck&rft.date=2017-01-01&rft.pub=Institute+of+Electrical+and+Electronics+Engineers&rft.issn=1045-9219&rft.volume=28&rft.issue=1&rft_id=info:doi/10.1109%2FTPDS.2016.2546248&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-01353871v1 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon |