Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores

State-of-the-art designs for the hierarchical reduction collective operation in MPI that work on the concept of distributed address spaces incur the cost of intermediate copies inside the MPI library to stage the data between processes. Such additional copies can severely affect the performance espe...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) s. 1020 - 1029
Hlavní autori: Hashmi, Jahanzeb Maqbool, Chakraborty, Sourav, Bayatpour, Mohammadreza, Subramoni, Hari, Panda, Dhabaleswar K.
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.05.2018
Predmet:
ISSN:1530-2075
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:State-of-the-art designs for the hierarchical reduction collective operation in MPI that work on the concept of distributed address spaces incur the cost of intermediate copies inside the MPI library to stage the data between processes. Such additional copies can severely affect the performance especially on emerging many-core architectures like Intel Xeon/Xeon Phi and OpenPOWER. In this paper, we take up this challenge and study the trade-offs involved in designing high-performance and scalable, "shared address space"-based communication primitives on top of XPMEM using basic point-to-point primitives in MPI. We then redesign the reduction collective operations using the knowledge gained from the initial studies. Our proposed designs at the collective level enable a process to offload communication and computation operations to intra-node peers without the need for additional intermediate copies resulting in a truly "zero-copy" design for MPI_Reduce and MPI_Allreduce. We further develop a theoretical model to analytically study the impact such designs can have on the performance of collective communication primitives. We evaluate the proposed designs with microbenchmarks, HPC, and Deep Learning applications on three different multi-/many-core architectures (Broadwell, Knights Landing, and OpenPOWER). The proposed designs show up to 3x improvement in latency of Reduce and Allreduce benchmarks, up to 37% improvement in the runtime of MiniAMR, and up to 19% reduction in the training time of AlexNet deep neural network compared to existing state-of-the-art MPI libraries. To the best of our knowledge, this is the first research work that studies the impact of XPMEM based shared address space designs on the performance of collective operations in a distributed memory programming model like MPI at scale.
ISSN:1530-2075
DOI:10.1109/IPDPS.2018.00111