Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of w...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing s. 276 - 283
Hlavní autoři: Hui Jin, Tao Ke, Yong Chen, Xian-He Sun
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.05.2012
Témata:
ISBN:1467313955, 9781467313957
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall system performance painfully slow. Recognizing contention as a dominant performance factor, in this paper we propose a systematic approach named check pointing orchestration to reduce write contention, which combines the marshaling of concurrent checkpoint requests and the adopting of vertical data access in coordination. A prototype of the proposed check pointing orchestration approach has been implemented at the system-level under Open MPI over the PVFS2 file system. Extensive experiments based on NPB benchmarks have been conducted to verify the design and implementation. Experimental results show that check pointing orchestration reduced the check pointing cost at a degree of more than 30%. Check pointing cost was halved for 4 out of 5 the C class NPB benchmarks.
ISBN:1467313955
9781467313957
DOI:10.1109/CCGrid.2012.61