Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores
State-of-the-art designs for the hierarchical reduction collective operation in MPI that work on the concept of distributed address spaces incur the cost of intermediate copies inside the MPI library to stage the data between processes. Such additional copies can severely affect the performance espe...
Uložené v:
| Vydané v: | 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) s. 1020 - 1029 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
01.05.2018
|
| Predmet: | |
| ISSN: | 1530-2075 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | State-of-the-art designs for the hierarchical reduction collective operation in MPI that work on the concept of distributed address spaces incur the cost of intermediate copies inside the MPI library to stage the data between processes. Such additional copies can severely affect the performance especially on emerging many-core architectures like Intel Xeon/Xeon Phi and OpenPOWER. In this paper, we take up this challenge and study the trade-offs involved in designing high-performance and scalable, "shared address space"-based communication primitives on top of XPMEM using basic point-to-point primitives in MPI. We then redesign the reduction collective operations using the knowledge gained from the initial studies. Our proposed designs at the collective level enable a process to offload communication and computation operations to intra-node peers without the need for additional intermediate copies resulting in a truly "zero-copy" design for MPI_Reduce and MPI_Allreduce. We further develop a theoretical model to analytically study the impact such designs can have on the performance of collective communication primitives. We evaluate the proposed designs with microbenchmarks, HPC, and Deep Learning applications on three different multi-/many-core architectures (Broadwell, Knights Landing, and OpenPOWER). The proposed designs show up to 3x improvement in latency of Reduce and Allreduce benchmarks, up to 37% improvement in the runtime of MiniAMR, and up to 19% reduction in the training time of AlexNet deep neural network compared to existing state-of-the-art MPI libraries. To the best of our knowledge, this is the first research work that studies the impact of XPMEM based shared address space designs on the performance of collective operations in a distributed memory programming model like MPI at scale. |
|---|---|
| ISSN: | 1530-2075 |
| DOI: | 10.1109/IPDPS.2018.00111 |