Checkpointing Tools in a Supercomputer Center

The article describes the problem of automatic checkpoints creation/data restoration for the jobs that run on a single supercomputer node. The paper formulates the requirements for the checkpointing and restore tools in the supercomputer job management system. Berkeley Lab Checkpoint/Restart (BLCR),...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Lobachevskii journal of mathematics Ročník 41; číslo 12; s. 2603 - 2613
Hlavní autoři: Savin, G. I., Shabanov, B. M., Fedorov, R. S., Baranov, A. V., Telegin, P. N.
Médium: Journal Article
Jazyk:angličtina
Vydáno: Moscow Pleiades Publishing 01.12.2020
Springer Nature B.V
Témata:
ISSN:1995-0802, 1818-9962
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The article describes the problem of automatic checkpoints creation/data restoration for the jobs that run on a single supercomputer node. The paper formulates the requirements for the checkpointing and restore tools in the supercomputer job management system. Berkeley Lab Checkpoint/Restart (BLCR), Checkpoint Restore In Userspace (CRIU), and Distributed MultiThreaded Checkpointing (DMTCP) tools are examined. It is shown that the DMTCP tool better meets the stated requirements. Experimental estimates of computational performance and impact on efficiency for DMTCP are presented. The problems of checkpointing tools’ integration into the SUPPZ job management system used at the JSCC RAS are considered. Recommendations on the practical use of automatic checkpoint/restore tools are given.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1995-0802
1818-9962
DOI:10.1134/S1995080220120355