Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of w...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing s. 276 - 283
Hlavní autori: Hui Jin, Tao Ke, Yong Chen, Xian-He Sun
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.05.2012
Predmet:
ISBN:1467313955, 9781467313957
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall system performance painfully slow. Recognizing contention as a dominant performance factor, in this paper we propose a systematic approach named check pointing orchestration to reduce write contention, which combines the marshaling of concurrent checkpoint requests and the adopting of vertical data access in coordination. A prototype of the proposed check pointing orchestration approach has been implemented at the system-level under Open MPI over the PVFS2 file system. Extensive experiments based on NPB benchmarks have been conducted to verify the design and implementation. Experimental results show that check pointing orchestration reduced the check pointing cost at a degree of more than 30%. Check pointing cost was halved for 4 out of 5 the C class NPB benchmarks.
AbstractList Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall system performance painfully slow. Recognizing contention as a dominant performance factor, in this paper we propose a systematic approach named check pointing orchestration to reduce write contention, which combines the marshaling of concurrent checkpoint requests and the adopting of vertical data access in coordination. A prototype of the proposed check pointing orchestration approach has been implemented at the system-level under Open MPI over the PVFS2 file system. Extensive experiments based on NPB benchmarks have been conducted to verify the design and implementation. Experimental results show that check pointing orchestration reduced the check pointing cost at a degree of more than 30%. Check pointing cost was halved for 4 out of 5 the C class NPB benchmarks.
Author Tao Ke
Xian-He Sun
Hui Jin
Yong Chen
Author_xml – sequence: 1
  surname: Hui Jin
  fullname: Hui Jin
  email: hjin6@iit.edu
  organization: Illinois Inst. of Technol., Chicago, IL, USA
– sequence: 2
  surname: Tao Ke
  fullname: Tao Ke
  email: tke1@iit.edu
  organization: Illinois Inst. of Technol., Chicago, IL, USA
– sequence: 3
  surname: Yong Chen
  fullname: Yong Chen
  email: yong.chen@ttu.edu
  organization: Texas Tech Univ., Lubbock, TX, USA
– sequence: 4
  surname: Xian-He Sun
  fullname: Xian-He Sun
  email: sun@cs.iit.edu
  organization: Illinois Inst. of Technol., Chicago, IL, USA
BookMark eNotzLtOwzAUAFAjQIKWriws_oEEvx9sKOoDqahIZK_s-IZapE7lGBB_3wGms50ZukpjAoTuKakpJfaxadY5hpoRympFL9CMaGWlUJbaSzSjQmlOuZXyBi2mKXrClFZSK3KLXpsDdJ-nMaYS0wfe5e4AU8muxDE94Xb8cTlgh987Nzg_AN68NXjlvoZSteMA2aWCl-k75jEdIZU7dN27YYLFv3PUrpZts6m2u_VL87ytHFWkVFxI0evQa0usD84Z7TmV0BvNvBHBgjGCCiM7az0zuucEiOiChmBAe8bn6OGvjQCwP-V4dPl3rxjVgjN-BhxfT9Q
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CCGrid.2012.61
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 0769546919
9780769546919
EndPage 283
ExternalDocumentID 6217432
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-a160t-3454f7df7909bdaa87b315ef872b84d9e8841485c99b287f30e04cd7ed8e7b23
IEDL.DBID RIE
ISBN 1467313955
9781467313957
IngestDate Wed Aug 27 04:23:59 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a160t-3454f7df7909bdaa87b315ef872b84d9e8841485c99b287f30e04cd7ed8e7b23
PageCount 8
ParticipantIDs ieee_primary_6217432
PublicationCentury 2000
PublicationDate 2012-May
PublicationDateYYYYMMDD 2012-05-01
PublicationDate_xml – month: 05
  year: 2012
  text: 2012-May
PublicationDecade 2010
PublicationTitle 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
PublicationTitleAbbrev ccgrid
PublicationYear 2012
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib026765760
ssib035550372
ssj0000702850
Score 1.5529822
Snippet Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially...
SourceID ieee
SourceType Publisher
StartPage 276
SubjectTerms Bandwidth
Benchmark testing
Checkpointing
Fault tolerance
Fault tolerant systems
File systems
Parallel File System
Servers
Title Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment
URI https://ieeexplore.ieee.org/document/6217432
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVKxcAEqEV8ywMjaZM4ztmsUUsHKJWIULfKHxdRUTVVSfn9OG7aLixstifrfPY72_feEfIQFTECMhNYHesgQbeldFwUgUaFwiG0Mv5p4OMFxmMxncpJizzuuTCI6JPPsFc3_V--Lc2mfirrpz5-dgfuEQBsuVo734lTSF3ovPc9B6M8ZA0n1J_K4JCUh57blQJzgQ_nO8mnpg-NqGMUyn6WPa_ntZJoFPe8gPah9IpHnuHp_-Z8RroHCh-d7MHpnLRw2SGv2Sear1U59-Uh6NvaV8va-sATzX0GLVX03a1bzaiio0lGh2qzqIK8XKBDtYoODsS4LsmHgzwbBU09hUBFaVgFLOFJAbYAGUptlRKgWcSxEBBrkViJQiTudsSNlNpdpAoWYpgYC2gFgo7ZBWkvyyVeEuriCGUFB0x5nHCTSJsWTDKrDFgb6fCKdGpLzFZbxYxZY4Trv4dvyElt520a4S1pV-sN3pFj81PNv9f3fpl_AcVfovE
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZQQYIJECDeeGAkkPgR26wRpYhSKhEhtsqPi6io2iqk_H4cJ20XFjbbk3U--zvb932H0HVSEBBAbeQMMREDv6UMKYrIgAbpEVrb8DTw3heDgfz4UMMNdLPiwgBASD6D27oZ_vLdzC7qp7K7NMTP_sDd5IyRpGFrLb2HpCL1wfPK-zyQ8pi2rNBwLguPpTwO7K5UUB_6cL4UfWr7opV1TGJ1l2WP5bjWEk3IbZDQXhdfCdjT3f3frPfQ4ZrEh4creNpHGzA9QC_ZJ9iv-WwcCkTg1zLUy2q84B7nIYcWa_zmV67mVOHeMMNdvZhUUT6bgMe1Cj-sqXGHKO8-5FkvaisqRDpJ4yqijLNCuEKoWBmntRSGJhwKKYiRzCmQkvn7EbdKGX-VKmgMMbNOgJMgDKFHqDOdTeEYYR9JaCe5gJQTxi1TLi2ook5b4Vxi4hN0UFtiNG80M0atEU7_Hr5C2738pT_qPw2ez9BObfMmqfAcdapyARdoy_5U4-_yMiz5L3Yvpjg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+12th+IEEE%2FACM+International+Symposium+on+Cluster%2C+Cloud+and+Grid+Computing&rft.atitle=Checkpointing+Orchestration%3A+Toward+a+Scalable+HPC+Fault-Tolerant+Environment&rft.au=Hui+Jin&rft.au=Tao+Ke&rft.au=Yong+Chen&rft.au=Xian-He+Sun&rft.date=2012-05-01&rft.pub=IEEE&rft.isbn=9781467313957&rft.spage=276&rft.epage=283&rft_id=info:doi/10.1109%2FCCGrid.2012.61&rft.externalDocID=6217432
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467313957/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467313957/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467313957/sc.gif&client=summon&freeimage=true