Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment
Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of w...
Uložené v:
| Vydané v: | 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing s. 276 - 283 |
|---|---|
| Hlavní autori: | , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
01.05.2012
|
| Predmet: | |
| ISBN: | 1467313955, 9781467313957 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall system performance painfully slow. Recognizing contention as a dominant performance factor, in this paper we propose a systematic approach named check pointing orchestration to reduce write contention, which combines the marshaling of concurrent checkpoint requests and the adopting of vertical data access in coordination. A prototype of the proposed check pointing orchestration approach has been implemented at the system-level under Open MPI over the PVFS2 file system. Extensive experiments based on NPB benchmarks have been conducted to verify the design and implementation. Experimental results show that check pointing orchestration reduced the check pointing cost at a degree of more than 30%. Check pointing cost was halved for 4 out of 5 the C class NPB benchmarks. |
|---|---|
| AbstractList | Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall system performance painfully slow. Recognizing contention as a dominant performance factor, in this paper we propose a systematic approach named check pointing orchestration to reduce write contention, which combines the marshaling of concurrent checkpoint requests and the adopting of vertical data access in coordination. A prototype of the proposed check pointing orchestration approach has been implemented at the system-level under Open MPI over the PVFS2 file system. Extensive experiments based on NPB benchmarks have been conducted to verify the design and implementation. Experimental results show that check pointing orchestration reduced the check pointing cost at a degree of more than 30%. Check pointing cost was halved for 4 out of 5 the C class NPB benchmarks. |
| Author | Tao Ke Xian-He Sun Hui Jin Yong Chen |
| Author_xml | – sequence: 1 surname: Hui Jin fullname: Hui Jin email: hjin6@iit.edu organization: Illinois Inst. of Technol., Chicago, IL, USA – sequence: 2 surname: Tao Ke fullname: Tao Ke email: tke1@iit.edu organization: Illinois Inst. of Technol., Chicago, IL, USA – sequence: 3 surname: Yong Chen fullname: Yong Chen email: yong.chen@ttu.edu organization: Texas Tech Univ., Lubbock, TX, USA – sequence: 4 surname: Xian-He Sun fullname: Xian-He Sun email: sun@cs.iit.edu organization: Illinois Inst. of Technol., Chicago, IL, USA |
| BookMark | eNotzLtOwzAUAFAjQIKWriws_oEEvx9sKOoDqahIZK_s-IZapE7lGBB_3wGms50ZukpjAoTuKakpJfaxadY5hpoRympFL9CMaGWlUJbaSzSjQmlOuZXyBi2mKXrClFZSK3KLXpsDdJ-nMaYS0wfe5e4AU8muxDE94Xb8cTlgh987Nzg_AN68NXjlvoZSteMA2aWCl-k75jEdIZU7dN27YYLFv3PUrpZts6m2u_VL87ytHFWkVFxI0evQa0usD84Z7TmV0BvNvBHBgjGCCiM7az0zuucEiOiChmBAe8bn6OGvjQCwP-V4dPl3rxjVgjN-BhxfT9Q |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/CCGrid.2012.61 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 0769546919 9780769546919 |
| EndPage | 283 |
| ExternalDocumentID | 6217432 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL |
| ID | FETCH-LOGICAL-a160t-3454f7df7909bdaa87b315ef872b84d9e8841485c99b287f30e04cd7ed8e7b23 |
| IEDL.DBID | RIE |
| ISBN | 1467313955 9781467313957 |
| IngestDate | Wed Aug 27 04:23:59 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a160t-3454f7df7909bdaa87b315ef872b84d9e8841485c99b287f30e04cd7ed8e7b23 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_6217432 |
| PublicationCentury | 2000 |
| PublicationDate | 2012-May |
| PublicationDateYYYYMMDD | 2012-05-01 |
| PublicationDate_xml | – month: 05 year: 2012 text: 2012-May |
| PublicationDecade | 2010 |
| PublicationTitle | 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing |
| PublicationTitleAbbrev | ccgrid |
| PublicationYear | 2012 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib026765760 ssib035550372 ssj0000702850 |
| Score | 1.5529822 |
| Snippet | Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 276 |
| SubjectTerms | Bandwidth Benchmark testing Checkpointing Fault tolerance Fault tolerant systems File systems Parallel File System Servers |
| Title | Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment |
| URI | https://ieeexplore.ieee.org/document/6217432 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVKxcAEqEV8ywMjaZM4ztmsUUsHKJWIULfKHxdRUTVVSfn9OG7aLixstifrfPY72_feEfIQFTECMhNYHesgQbeldFwUgUaFwiG0Mv5p4OMFxmMxncpJizzuuTCI6JPPsFc3_V--Lc2mfirrpz5-dgfuEQBsuVo734lTSF3ovPc9B6M8ZA0n1J_K4JCUh57blQJzgQ_nO8mnpg-NqGMUyn6WPa_ntZJoFPe8gPah9IpHnuHp_-Z8RroHCh-d7MHpnLRw2SGv2Sear1U59-Uh6NvaV8va-sATzX0GLVX03a1bzaiio0lGh2qzqIK8XKBDtYoODsS4LsmHgzwbBU09hUBFaVgFLOFJAbYAGUptlRKgWcSxEBBrkViJQiTudsSNlNpdpAoWYpgYC2gFgo7ZBWkvyyVeEuriCGUFB0x5nHCTSJsWTDKrDFgb6fCKdGpLzFZbxYxZY4Trv4dvyElt520a4S1pV-sN3pFj81PNv9f3fpl_AcVfovE |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZQQYIJECDeeGAkkPgR26wRpYhSKhEhtsqPi6io2iqk_H4cJ20XFjbbk3U--zvb932H0HVSEBBAbeQMMREDv6UMKYrIgAbpEVrb8DTw3heDgfz4UMMNdLPiwgBASD6D27oZ_vLdzC7qp7K7NMTP_sDd5IyRpGFrLb2HpCL1wfPK-zyQ8pi2rNBwLguPpTwO7K5UUB_6cL4UfWr7opV1TGJ1l2WP5bjWEk3IbZDQXhdfCdjT3f3frPfQ4ZrEh4creNpHGzA9QC_ZJ9iv-WwcCkTg1zLUy2q84B7nIYcWa_zmV67mVOHeMMNdvZhUUT6bgMe1Cj-sqXGHKO8-5FkvaisqRDpJ4yqijLNCuEKoWBmntRSGJhwKKYiRzCmQkvn7EbdKGX-VKmgMMbNOgJMgDKFHqDOdTeEYYR9JaCe5gJQTxi1TLi2ook5b4Vxi4hN0UFtiNG80M0atEU7_Hr5C2738pT_qPw2ez9BObfMmqfAcdapyARdoy_5U4-_yMiz5L3Yvpjg |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+12th+IEEE%2FACM+International+Symposium+on+Cluster%2C+Cloud+and+Grid+Computing&rft.atitle=Checkpointing+Orchestration%3A+Toward+a+Scalable+HPC+Fault-Tolerant+Environment&rft.au=Hui+Jin&rft.au=Tao+Ke&rft.au=Yong+Chen&rft.au=Xian-He+Sun&rft.date=2012-05-01&rft.pub=IEEE&rft.isbn=9781467313957&rft.spage=276&rft.epage=283&rft_id=info:doi/10.1109%2FCCGrid.2012.61&rft.externalDocID=6217432 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467313957/lc.gif&client=summon&freeimage=true |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467313957/mc.gif&client=summon&freeimage=true |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467313957/sc.gif&client=summon&freeimage=true |

