Enhancing Kokkos with OpenACC
Saved in:
| Title: | Enhancing Kokkos with OpenACC |
|---|---|
| Authors: | Valero Lara, Pedro, Lee, Seyong, González Tallada, Marc, Denny, Joel, Teranishi, Keita, Vetter, Jeffrey |
| Contributors: | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. PM - Programming Models |
| Publisher Information: | SAGE publishing |
| Publication Year: | 2024 |
| Collection: | Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge |
| Subject Terms: | Àrees temàtiques de la UPC::Informàtica::Programació, OpenACC, C++ metaprogramming, Kokkos, CUDA, OpenMP target, Parallel programming models |
| Description: | C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method. ; This research used resources from the Experimental Computing Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy (DOE) under contract DE-AC05-00OR22725. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration. This research was also supported in part by the DOE Office of Science, Office of Advanced Scientific Computing Research, and Scientific Discovery through Advanced Computing program. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with DOE. ; Peer Reviewed ; Postprint (author's final draft) |
| Document Type: | article in journal/newspaper |
| File Description: | 18 p.; application/pdf |
| Language: | English |
| Relation: | https://hdl.handle.net/2117/419896 |
| DOI: | 10.1177/10943420241261987 |
| Availability: | https://hdl.handle.net/2117/419896 https://doi.org/10.1177/10943420241261987 |
| Rights: | Open Access |
| Accession Number: | edsbas.BF09E808 |
| Database: | BASE |
| Abstract: | C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method. ; This research used resources from the Experimental Computing Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy (DOE) under contract DE-AC05-00OR22725. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration. This research was also supported in part by the DOE Office of Science, Office of Advanced Scientific Computing Research, and Scientific Discovery through Advanced Computing program. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with DOE. ; Peer Reviewed ; Postprint (author's final draft) |
|---|---|
| DOI: | 10.1177/10943420241261987 |
Nájsť tento článok vo Web of Science