Scalable I/O aggregation for asynchronous multi-level checkpointing

Checkpointing distributed HPC applications is a common I/O pattern with many use cases: resilience, job management, reproducibility, revisiting previous intermediate results, etc. This is a difficult pattern for a large number of processes that need to capture massive data sizes and write them persi...

Full description

Saved in:
Bibliographic Details
Published in:Future generation computer systems Vol. 160; no. C; pp. 420 - 432
Main Authors: Gossman, Mikaila J., Nicolae, Bogdan, Calhoun, Jon C.
Format: Journal Article
Language:English
Published: Netherlands Elsevier B.V 01.11.2024
Elsevier
Subjects:
ISSN:0167-739X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Be the first to leave a comment!
You must be logged in first