Scalable I/O aggregation for asynchronous multi-level checkpointing

Checkpointing distributed HPC applications is a common I/O pattern with many use cases: resilience, job management, reproducibility, revisiting previous intermediate results, etc. This is a difficult pattern for a large number of processes that need to capture massive data sizes and write them persi...

Full description

Saved in:

Bibliographic Details
Published in:	Future generation computer systems Vol. 160; no. C; pp. 420 - 432
Main Authors:	Gossman, Mikaila J., Nicolae, Bogdan, Calhoun, Jon C.
Format:	Journal Article
Language:	English
Published:	Netherlands Elsevier B.V 01.11.2024 Elsevier
Subjects:	Asynchronous I/O Checkpoint-restart Distributed I/O aggregation Distributed I/O aggregation Asynchronous I/O Checkpoint-restart
ISSN:	0167-739X
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Checkpointing distributed HPC applications is a common I/O pattern with many use cases: resilience, job management, reproducibility, revisiting previous intermediate results, etc. This is a difficult pattern for a large number of processes that need to capture massive data sizes and write them persistently to shared storage (e.g., parallel file system), which is subject to I/O bottlenecks due to limited I/O bandwidth under concurrency. In addition to I/O performance and scalability considerations, there are often limits that users impose on the number of files or objects that can be used to capture the checkpoints. For example, users need to move checkpoints between HPC systems or parallel file systems, which is inefficient for a large number of files, or need to use the checkpoints in workflows that expect related objects to be grouped together. As a consequence, I/O aggregation is often used to reduce the number of files and objects persistent to shared storage such that it is much lower than the number of processes. However, I/O aggregation is challenging for two reasons: (1) if more than one process is writing checkpointing data to the same file, this causes additional I/O contention that amplifies the I/O bottlenecks; (2) scalable state-of-art checkpointing techniques are asynchronous and rely on multi-level techniques to capture the data structures to local storage or memory, then flush it from there to shared storage in the background, which competes for resources (I/O, memory, network bandwidth) with the application that is running in the foreground. State of art approaches have addressed the problem of I/O aggregation for synchronous checkpointing but are insufficient for asynchronous checkpointing. To fill this gap, we contribute with a novel I/O aggregation strategy that operates efficiently in the background to complement asynchronous C/R. Specifically, we explore how to (1) develop a network of efficient, thread-safe I/O proxies that persist data via limited-sized write buffers, (2) prioritize remote (from non-proxy processes) and local data on I/O proxies to minimize write overhead, and (3) load-balance flushing on I/O proxies. We analyze trade-offs of developing such strategies and discuss the performance impact on large-scale micro-benchmarks, as well as a real HPC application (HACC). •Checkpointing is an increasingly frequent and needed operation of HPC applications.•Asynchronous checkpointing frameworks overlap computations and I/O to mask latency.•Such overlap results in applications and checkpointing frameworks sharing resources.•Asynchronous checkpointing uses one-file-per-process writing to ease I/O bottlenecks.•However, file-per-process writing is unsustainable for users and systems at scale.•Aggregation is necessary to alleviate usability and performance bottlenecks.•Yet, the impact of aggregation on asynchronous checkpointing is largely unexplored.•We implement an optimized aggregation scheme designed for asynchronous checkpointing.
Bibliography:	USDOE AC02-06CH11.357
ISSN:	0167-739X
DOI:	10.1016/j.future.2024.06.003