A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications

Uloženo v:
Podrobná bibliografie
Název: A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications
Autoři: Trümper, Lukas, Schaad, Philipp, Ates, Berke, id_orcid:0 000-0003-0242-3640, Calotoiu, Alexandru, Copik, Marcin, Hoefler, Torsten
Přispěvatelé: Doerfert, Johannes, Grosser, Tobias, Leather, Hugh, Sadayappan, Ponnuswamy
Zdroj: Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization
Informace o vydavateli: Association for Computing Machinery
Rok vydání: 2025
Sbírka: ETH Zürich Research Collection
Témata: loop normalization, loop optimization, polyhedral analysis, compiler, code optimization
Popis: The same computations are often expressed differently across software projects and programming languages. In particular, how computations involving loops are expressed varies due to the many possibilities to permute and compose loops. Since each variant may have unique performance properties, automatic approaches to loop scheduling must support many different optimization recipes. In this paper, we propose a priori loop nest normalization to align loop nests and reduce the variation before the optimization. Specifically, we define and apply normalization criteria, mapping loop nests with different memory access patterns to the same canonical form. Since the memory access pattern is susceptible to loop variations and critical for performance, this normalization allows many loop nests to be optimized by the same optimization recipe. To evaluate our approach, we apply the normalization with optimizations designed for only the canonical form, improving the performance of many different loop nest variants. Across multiple implementations of 15 benchmarks using different languages, we outperform a baseline compiler in C on average by a factor of 21.13, state-of-the-art auto-schedulers such as Polly and the Tiramisu auto-scheduler by 2.31 and 2.89, as well as performance-oriented Python-based frameworks such as NumPy, Numba, and DaCe by 9.04, 3.92, and 1.47. Furthermore, we apply the concept to the CLOUDSC cloud microphysics scheme, an actively used component of the Integrated Forecasting System, achieving a 10% speedup over the highly-tuned Fortran code.
Druh dokumentu: conference object
Popis souboru: application/application/pdf
Jazyk: English
Relation: info:eu-repo/semantics/altIdentifier/isbn/979-8-4007-1275-3; info:eu-repo/grantAgreement/EC/H2020/101002047; info:eu-repo/grantAgreement/EC/H2020/101034126; http://hdl.handle.net/20.500.11850/729779
DOI: 10.3929/ethz-b-000729779
Dostupnost: https://hdl.handle.net/20.500.11850/729779
https://doi.org/10.3929/ethz-b-000729779
Rights: info:eu-repo/semantics/openAccess ; http://creativecommons.org/licenses/by/4.0/ ; Creative Commons Attribution 4.0 International
Přístupové číslo: edsbas.B234C3C4
Databáze: BASE
Popis
Abstrakt:The same computations are often expressed differently across software projects and programming languages. In particular, how computations involving loops are expressed varies due to the many possibilities to permute and compose loops. Since each variant may have unique performance properties, automatic approaches to loop scheduling must support many different optimization recipes. In this paper, we propose a priori loop nest normalization to align loop nests and reduce the variation before the optimization. Specifically, we define and apply normalization criteria, mapping loop nests with different memory access patterns to the same canonical form. Since the memory access pattern is susceptible to loop variations and critical for performance, this normalization allows many loop nests to be optimized by the same optimization recipe. To evaluate our approach, we apply the normalization with optimizations designed for only the canonical form, improving the performance of many different loop nest variants. Across multiple implementations of 15 benchmarks using different languages, we outperform a baseline compiler in C on average by a factor of 21.13, state-of-the-art auto-schedulers such as Polly and the Tiramisu auto-scheduler by 2.31 and 2.89, as well as performance-oriented Python-based frameworks such as NumPy, Numba, and DaCe by 9.04, 3.92, and 1.47. Furthermore, we apply the concept to the CLOUDSC cloud microphysics scheme, an actively used component of the Integrated Forecasting System, achieving a 10% speedup over the highly-tuned Fortran code.
DOI:10.3929/ethz-b-000729779