Uniconn: A Uniform High-Level Communication Library for Portable Multi-GPU Programming

Modern HPC and AI systems increasingly rely on multi-GPU clusters, where communication libraries such as MPI, NCCL/RCCL, and NVSHMEM enable data movement across GPUs. While these libraries are widely used in frameworks and solver packages, their distinct APIs, synchronization models, and integration...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings / IEEE International Conference on Cluster Computing s. 1 - 12
Hlavní autoři:	Sagbili, Dogan, Ekmekcibasi, Sinan, Ibrahim, Khaled Z., Nguyen, Tan, Unat, Didem
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 02.09.2025
Témata:	Benchmark testing C++ languages Codes communication libraries Complexity theory GPU Graphics processing units Jacobian matrices Libraries MPI Multi-GPUs NCCL/RCCL NVSHMEM Performance evaluation Programming Switches
ISSN:	2168-9253
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Modern HPC and AI systems increasingly rely on multi-GPU clusters, where communication libraries such as MPI, NCCL/RCCL, and NVSHMEM enable data movement across GPUs. While these libraries are widely used in frameworks and solver packages, their distinct APIs, synchronization models, and integration mechanisms introduce programming complexity and limit portability. Performance also varies across workloads and system architectures, making it difficult to achieve consistent efficiency. These issues present a significant obstacle to writing portable, high-performance code for large-scale GPU systems. We present Uniconn, a unified, portable high-level C++ communication library that supports both point-to-point and collective operations across GPU clusters. Uniconn enables seamless switching between backends and APIs (host or device) with minimal or no changes to application code. We describe its design and core constructs, and evaluate its performance using network benchmarks, a Jacobi solver, and a Conjugate Gradient solver. Across three supercomputers, we compare Uniconn's overhead against CUDA/ROCm-aware MPI, NCCL/RCCL, and NVSHMEM on up to 64 GPUs. In most cases, Uniconn incurs negligible overhead, typically under 1 % for the Jacobi solver and under 2% for the Conjugate Gradient solver.
ISSN:	2168-9253
DOI:	10.1109/CLUSTER59342.2025.11186498