Accelerating communication for parallel programming models on GPU systems

As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aw...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Parallel computing Ročník 113; s. 102969
Hlavní autori:	Choi, Jaemin, Fink, Zane, White, Sam, Bhat, Nitin, Richards, David F., Kale, Laxmikant V.
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier B.V 01.10.2022
Predmet:	AMPI Charm Charm4py CUDA-aware MPI GPU-aware communication Python UCX CUDA-aware MPI Charm4py Charm UCX AMPI GPU-aware communication Python
ISSN:	0167-8191, 1872-7336
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.1x in Charm++, 11.7x in AMPI, and 17.4x in Charm4py. We also observe increases in bandwidth of up to 10.1x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py. •UCX is used to support GPU-aware communication in the Charm++ ecosystem.•New Channel API in Charm++ substantially improves communication performance.•Performance improvements are evaluated with OSU micro-benchmarks and Jacobi3D.
ISSN:	0167-8191 1872-7336
DOI:	10.1016/j.parco.2022.102969