View in EDS

Towards high scalability and fine-grained parallelism on distributed HPC platforms

Saved in:

Bibliographic Details
Title:	Towards high scalability and fine-grained parallelism on distributed HPC platforms
Authors:	De Haro Ruiz, Juan Miguel, Álvarez Martínez, Carlos, Jiménez González, Daniel, Morais, Lucas Henrique, Martorell Bofill, Xavier
Contributors:	Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. PM - Programming Models
Publication Year:	2025
Collection:	Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Subject Terms:	Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Task scheduling, Task-based programming, MPI, FPGA, RISC-V, Accelerators, HPC, Distributed computing, Runtime systems
Description:	Current High-Performance Computing systems rely on massive parallelism to achieve exascale performance. They use task scheduling and message-passing programming models to explore complementary sources of parallelism. Combining the two holds the promise of allowing seamless exploitation of both intra- and inter-node concurrency while leveraging widely-known programming abstractions. Still, the interaction between the two raises coordination problems that could make work distribution excessively costly, limiting performance. This work is the first to evaluate comprehensive hardware acceleration of their combined use integrating them in a programming model that further exploits their synergies. The hardware/software co-design approach proposed for this purpose is prototyped on a cluster of 64 FPGA nodes, where each holds a RISC-V Rocket Chip CPU with 8 cores. On one hand, this article combines OMPIF and Picos, which are hardware accelerators for message passing and task scheduling respectively. They interface with the CPU through RoCC-based custom RISC-V instructions. On the other hand, we present the Implicit Message Passing (IMP) programming model, that extends task scheduling abstractions to leverage MPI-mediated inter-node parallelism without requiring explicit MPI calls. Thus, IMP transparently allows the dataflow-style execution induced by task scheduling to span multiple nodes. We implement three benchmarks, N-body, Heat, and Cholesky, each with two different strategies, IMP and explicit MPI, and evaluate them on the multi-core FPGA-based cluster. We demonstrate our hardware-software co-design approach achieves near-linear scalability with IMP and the OMPIF/Picos accelerators, and reduces task management overhead from 2200 to 300 cycles per task. Furthermore, when leveraging all 512 cores (split among the 64 nodes), we measure speedups of 2.04x (40x in communication), 1.25x (7x in communication), and 7.29x (25x in communication) compared with unaccelerated MPI for N-body, Heat, and Cholesky respectively. ...
Document Type:	article in journal/newspaper
File Description:	22 p.; application/pdf
Language:	English
Relation:	info:eu-repo/grantAgreement/EC/H2020/946002/EU/The MareNostrum Experimental Exascale Platform/MEEP; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2023-147979NB-C21/ES/HERRAMIENTAS SOFTWARE PARA HPC - BSC/; https://hdl.handle.net/2117/451993
DOI:	10.1145/3774815
Availability:	https://hdl.handle.net/2117/451993 https://doi.org/10.1145/3774815
Rights:	http://creativecommons.org/licenses/by/4.0/ ; Open Access
Accession Number:	edsbas.FE95F89F
Database:	BASE

View record from BASE

Nájsť tento článok vo Web of Science

Description
Abstract:	Current High-Performance Computing systems rely on massive parallelism to achieve exascale performance. They use task scheduling and message-passing programming models to explore complementary sources of parallelism. Combining the two holds the promise of allowing seamless exploitation of both intra- and inter-node concurrency while leveraging widely-known programming abstractions. Still, the interaction between the two raises coordination problems that could make work distribution excessively costly, limiting performance. This work is the first to evaluate comprehensive hardware acceleration of their combined use integrating them in a programming model that further exploits their synergies. The hardware/software co-design approach proposed for this purpose is prototyped on a cluster of 64 FPGA nodes, where each holds a RISC-V Rocket Chip CPU with 8 cores. On one hand, this article combines OMPIF and Picos, which are hardware accelerators for message passing and task scheduling respectively. They interface with the CPU through RoCC-based custom RISC-V instructions. On the other hand, we present the Implicit Message Passing (IMP) programming model, that extends task scheduling abstractions to leverage MPI-mediated inter-node parallelism without requiring explicit MPI calls. Thus, IMP transparently allows the dataflow-style execution induced by task scheduling to span multiple nodes. We implement three benchmarks, N-body, Heat, and Cholesky, each with two different strategies, IMP and explicit MPI, and evaluate them on the multi-core FPGA-based cluster. We demonstrate our hardware-software co-design approach achieves near-linear scalability with IMP and the OMPIF/Picos accelerators, and reduces task management overhead from 2200 to 300 cycles per task. Furthermore, when leveraging all 512 cores (split among the 64 nodes), we measure speedups of 2.04x (40x in communication), 1.25x (7x in communication), and 7.29x (25x in communication) compared with unaccelerated MPI for N-body, Heat, and Cholesky respectively. ...
DOI:	10.1145/3774815