Towards high scalability and fine-grained parallelism on distributed HPC platforms
Saved in:
| Title: | Towards high scalability and fine-grained parallelism on distributed HPC platforms |
|---|---|
| Authors: | De Haro Ruiz, Juan Miguel, Álvarez Martínez, Carlos, Jiménez González, Daniel, Morais, Lucas Henrique, Martorell Bofill, Xavier |
| Contributors: | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. PM - Programming Models |
| Publication Year: | 2025 |
| Collection: | Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge |
| Subject Terms: | Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Task scheduling, Task-based programming, MPI, FPGA, RISC-V, Accelerators, HPC, Distributed computing, Runtime systems |
| Description: | Current High-Performance Computing systems rely on massive parallelism to achieve exascale performance. They use task scheduling and message-passing programming models to explore complementary sources of parallelism. Combining the two holds the promise of allowing seamless exploitation of both intra- and inter-node concurrency while leveraging widely-known programming abstractions. Still, the interaction between the two raises coordination problems that could make work distribution excessively costly, limiting performance. This work is the first to evaluate comprehensive hardware acceleration of their combined use integrating them in a programming model that further exploits their synergies. The hardware/software co-design approach proposed for this purpose is prototyped on a cluster of 64 FPGA nodes, where each holds a RISC-V Rocket Chip CPU with 8 cores. On one hand, this article combines OMPIF and Picos, which are hardware accelerators for message passing and task scheduling respectively. They interface with the CPU through RoCC-based custom RISC-V instructions. On the other hand, we present the Implicit Message Passing (IMP) programming model, that extends task scheduling abstractions to leverage MPI-mediated inter-node parallelism without requiring explicit MPI calls. Thus, IMP transparently allows the dataflow-style execution induced by task scheduling to span multiple nodes. We implement three benchmarks, N-body, Heat, and Cholesky, each with two different strategies, IMP and explicit MPI, and evaluate them on the multi-core FPGA-based cluster. We demonstrate our hardware-software co-design approach achieves near-linear scalability with IMP and the OMPIF/Picos accelerators, and reduces task management overhead from 2200 to 300 cycles per task. Furthermore, when leveraging all 512 cores (split among the 64 nodes), we measure speedups of 2.04x (40x in communication), 1.25x (7x in communication), and 7.29x (25x in communication) compared with unaccelerated MPI for N-body, Heat, and Cholesky respectively. ... |
| Document Type: | article in journal/newspaper |
| File Description: | 22 p.; application/pdf |
| Language: | English |
| Relation: | info:eu-repo/grantAgreement/EC/H2020/946002/EU/The MareNostrum Experimental Exascale Platform/MEEP; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2023-147979NB-C21/ES/HERRAMIENTAS SOFTWARE PARA HPC - BSC/; https://hdl.handle.net/2117/451993 |
| DOI: | 10.1145/3774815 |
| Availability: | https://hdl.handle.net/2117/451993 https://doi.org/10.1145/3774815 |
| Rights: | http://creativecommons.org/licenses/by/4.0/ ; Open Access |
| Accession Number: | edsbas.FE95F89F |
| Database: | BASE |
| Abstract: | Current High-Performance Computing systems rely on massive parallelism to achieve exascale performance. They use task scheduling and message-passing programming models to explore complementary sources of parallelism. Combining the two holds the promise of allowing seamless exploitation of both intra- and inter-node concurrency while leveraging widely-known programming abstractions. Still, the interaction between the two raises coordination problems that could make work distribution excessively costly, limiting performance. This work is the first to evaluate comprehensive hardware acceleration of their combined use integrating them in a programming model that further exploits their synergies. The hardware/software co-design approach proposed for this purpose is prototyped on a cluster of 64 FPGA nodes, where each holds a RISC-V Rocket Chip CPU with 8 cores. On one hand, this article combines OMPIF and Picos, which are hardware accelerators for message passing and task scheduling respectively. They interface with the CPU through RoCC-based custom RISC-V instructions. On the other hand, we present the Implicit Message Passing (IMP) programming model, that extends task scheduling abstractions to leverage MPI-mediated inter-node parallelism without requiring explicit MPI calls. Thus, IMP transparently allows the dataflow-style execution induced by task scheduling to span multiple nodes. We implement three benchmarks, N-body, Heat, and Cholesky, each with two different strategies, IMP and explicit MPI, and evaluate them on the multi-core FPGA-based cluster. We demonstrate our hardware-software co-design approach achieves near-linear scalability with IMP and the OMPIF/Picos accelerators, and reduces task management overhead from 2200 to 300 cycles per task. Furthermore, when leveraging all 512 cores (split among the 64 nodes), we measure speedups of 2.04x (40x in communication), 1.25x (7x in communication), and 7.29x (25x in communication) compared with unaccelerated MPI for N-body, Heat, and Cholesky respectively. ... |
|---|---|
| DOI: | 10.1145/3774815 |
Nájsť tento článok vo Web of Science