Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models

Saved in:
Bibliographic Details
Title: Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models
Authors: Huber, Dominik, Schreiber, Martin, Schulz, Martin, Pritchard, Howard, Holmes, Daniel
Contributors: Technische Universität Munchen - Technical University Munich - Université Technique de Munich (TUM), Université Grenoble Alpes (UGA), Laboratoire Jean Kuntzmann (LJK), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP), Mathematics and computing applied to oceanic and atmospheric flows (AIRSEA), Centre Inria de l'Université Grenoble Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA)-Laboratoire Jean Kuntzmann (LJK), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP), École nationale supérieure d'informatique et de mathématiques appliquées (Grenoble INP ENSIMAG), Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP), Los Alamos National Laboratory (LANL), Intel Corporation Santa Clara, Intel Corporation USA, European Project: 955606,H2020-JTI-EuroHPC-2019-1,H2020-JTI-EuroHPC-2019-1,DEEP-SEA(2021), European Project: 956560,H2020-JTI-EuroHPC-2019-1,H2020-JTI-EuroHPC-2019-1,REGALE(2021), European Project: 955701,H2020-JTI-EuroHPC-2019-1,H2020-JTI-EuroHPC-2019-1,TIME-X(2021)
Source: https://hal.science/hal-04523470 ; 2024.
Publisher Information: CCSD
Publication Year: 2024
Collection: Université Grenoble Alpes: HAL
Subject Terms: Dynamic Resource Management, Parallel Programming Models, High Performance Computing, [INFO]Computer Science [cs]
Description: With Dynamic Resource Management (DRM) the resources assigned to a job can be changed dynamically during its execution. From the system's perspective, DRM opens a new level of flexibility in resource allocation and job scheduling and therefore has the potential to improve system efficiency metrics such as the utilization rate, job throughput, energy efficiency, and responsiveness. From the application perspective, users can tailor the resources they request to their needs offering potential optimizations in queuing time or charged costs. Despite these obvious advantages and many attempts over the last decade to establish DRM in HPC, it remains a concept discussed in academia rather than being successfully deployed on production systems. This stems from the fact that support for DRM requires changes in all the layers of the HPC system software stack including applications, programming models, process managers, and resource management software, as well as an extensive and holistic co-design process to establish new techniques and policies for scheduling and resource optimization. In this work, we therefore start with the assumption that resources are accessible by processes executed either on them (e.g., on CPU) or controlling them (e.g., GPU-offloading). Then, the overall DRM problem can be decomposed into dynamic process management (DPM) and dynamic resource mapping or allocation (DRA). The former determines which processes (or which change in processes) must be managed and the latter identifies the resources where they will be executed. The interfaces for such DPM/DPA in these layers need to be standardized, which requires a careful design to be interoperable while providing high flexibility. Based on a survey of existing approaches we propose design principles, that form the basis of a holistic approach to DMR in HPC and provide a prototype implementation using MPI.
Document Type: report
Language: English
Relation: info:eu-repo/semantics/altIdentifier/arxiv/2403.17107; info:eu-repo/grantAgreement//955606/EU/DEEP – SOFTWARE FOR EXASCALE ARCHITECTURES/DEEP-SEA; info:eu-repo/grantAgreement//956560/EU/An open architecture to equip next generation HPC applications with exascale capabilities/REGALE; info:eu-repo/grantAgreement//955701/EU/TIME parallelisation: for eXascale computing and beyond/TIME-X; ARXIV: 2403.17107
Availability: https://hal.science/hal-04523470
Accession Number: edsbas.FBABB60C
Database: BASE
Be the first to leave a comment!
You must be logged in first