Efficient and Distributed Computation of Electron Repulsion Integrals on AMD AI Engines

Computing electron repulsion integrals (ERIs) is the major computational bottleneck of many quantum mechanical simulation methods, requiring trillions of ERI evaluations per time step. While the computation of independent ERIs is embarrassingly parallel, the efficient computation of individual ERIs...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Online) s. 95 - 104
Hlavní autori: Menzel, Johannes, Plessl, Christian
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 04.05.2025
Predmet:
ISSN:2576-2621
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Computing electron repulsion integrals (ERIs) is the major computational bottleneck of many quantum mechanical simulation methods, requiring trillions of ERI evaluations per time step. While the computation of independent ERIs is embarrassingly parallel, the efficient computation of individual ERIs on modern processor cores is difficult due to both an insufficient cache size for intermediates of the computation and irregular memory access patterns that are difficult to vectorize. In this paper, we present how our implementation on the AI Engine (AIE) architecture addresses both of these problems. First, we have defined a flexible graph structure, which we call an ERI-Engine, that can be implemented for all 231 canonical ERI quartets from {ss|ss} to {hh| hh} by distributing the computation over 2-14 AIEs. Second, for the larger quartets, we have devised a novel vectorization scheme that leverages the advanced floating-point unit of the AIEs, while also supporting vectorization of independent ERIs for the smaller quartets. Finally, ERI-Engines are horizontally and vertically stackable to fill the entire AIE array, and in particular, the vertically stacked ERI-Engines form a column that uses one or more time-shared channels to stream the results out of the AIE array, almost completely hiding the computational phases of individual ERI-Engines. In terms of absolute performance, we are competitive with recent high-performance implementations of ERI algorithms on FPGAs (SERI) and GPUs (LibintX), as well as well-established highly optimized CPU libraries (Libint, Libcint), while being the unequivocal leader in terms of energy efficiency.
ISSN:2576-2621
DOI:10.1109/FCCM62733.2025.00044