UWOmppro: UWOmp++ with Point-to-Point Synchronization, Reduction and Schedules

OpenMP is one of the most popular APIs widely used to realize parallelism in C/C++ and FORTRAN programs. For efficient execution, an OpenMP program internally creates a team of threads, which share a given set of activities (for example, iterations of a parallel-for-loop). While OpenMP allows synchr...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) s. 27 - 38
Hlavní autoři: Agrawal, Aditya, Nandivada, V. Krishna
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 21.10.2023
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:OpenMP is one of the most popular APIs widely used to realize parallelism in C/C++ and FORTRAN programs. For efficient execution, an OpenMP program internally creates a team of threads, which share a given set of activities (for example, iterations of a parallel-for-loop). While OpenMP allows synchronization among these threads, many classes of computations can be conveniently expressed by specifying synchronization among the parallel activities. However, OpenMP currently restricts arbitrary synchronization among the parallel activities; otherwise, the behavior of the program can be unpredictable. While extensions like UWOmp++ (and UW-OpenMP) support all-to-all barriers among the activities, currently there is very limited support for performing point-to-point synchronization among them. In this paper, we present UWOmp pro as an extension to UWOmp++ (and OpenMP) to address these challenges and realize more expressive and efficient codes. UWOmp pro allows point-to-point synchronization among the activities of a parallel-for-loop and supports reduction operations (during synchronization). We present a translation scheme to compile UWOmp pro code to efficient OpenMP code, such that the translated code does not invoke any synchronization operation(s) within parallel-for-loops. Our translation takes advantage of continuation-passing-style (CPS) to efficiently realize wait and continue operations. We also present a runtime, based on a novel communication subsystem to support efficient signal, wait, and reduction operations. We have implemented our scheme in the IMOP compiler framework and performed a thorough evaluation. We show that our approach leads to highly performant codes.
DOI:10.1109/PACT58117.2023.00011