UWOmppro: UWOmp++ with Point-to-Point Synchronization, Reduction and Schedules
OpenMP is one of the most popular APIs widely used to realize parallelism in C/C++ and FORTRAN programs. For efficient execution, an OpenMP program internally creates a team of threads, which share a given set of activities (for example, iterations of a parallel-for-loop). While OpenMP allows synchr...
Uloženo v:
| Vydáno v: | 2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) s. 27 - 38 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
21.10.2023
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | OpenMP is one of the most popular APIs widely used to realize parallelism in C/C++ and FORTRAN programs. For efficient execution, an OpenMP program internally creates a team of threads, which share a given set of activities (for example, iterations of a parallel-for-loop). While OpenMP allows synchronization among these threads, many classes of computations can be conveniently expressed by specifying synchronization among the parallel activities. However, OpenMP currently restricts arbitrary synchronization among the parallel activities; otherwise, the behavior of the program can be unpredictable. While extensions like UWOmp++ (and UW-OpenMP) support all-to-all barriers among the activities, currently there is very limited support for performing point-to-point synchronization among them. In this paper, we present UWOmp pro as an extension to UWOmp++ (and OpenMP) to address these challenges and realize more expressive and efficient codes. UWOmp pro allows point-to-point synchronization among the activities of a parallel-for-loop and supports reduction operations (during synchronization). We present a translation scheme to compile UWOmp pro code to efficient OpenMP code, such that the translated code does not invoke any synchronization operation(s) within parallel-for-loops. Our translation takes advantage of continuation-passing-style (CPS) to efficiently realize wait and continue operations. We also present a runtime, based on a novel communication subsystem to support efficient signal, wait, and reduction operations. We have implemented our scheme in the IMOP compiler framework and performed a thorough evaluation. We show that our approach leads to highly performant codes. |
|---|---|
| DOI: | 10.1109/PACT58117.2023.00011 |