Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme

Computing derivatives is key to many algorithms in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LL VM compiler plugin that performs reverse-mode automatic differentiation (AD) and thus generates high performance gradi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	SC21: International Conference for High Performance Computing, Networking, Storage and Analysis S. 1 - 18
Hauptverfasser:	Moses, William S., Churavy, Valentin, Paehler, Ludger, Huckelheim, Jan, Narayanan, Sri Hari Krishna, Schanen, Michel, Doerfert, Johannes
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	ACM 14.11.2021
Schlagworte:	Automatic Differentiation Biochemistry CUDA GPU Graphics processing units HPC LL VM Parallel processing ROCm Scalability Schedules Scientific computing Uncertainty
ISSN:	2167-4337
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Computing derivatives is key to many algorithms in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LL VM compiler plugin that performs reverse-mode automatic differentiation (AD) and thus generates high performance gradients of programs in languages including \mathrm{C}/\mathrm{C}++ , Fortran, Julia, and Rust. Prior to this work, Enzyme and other AD tools were not capable of generating gradi-ents of GPU kernels. Our paper presents a combination of novel techniques that make Enzyme the first fully automatic reverse-mode AD tool to generate gradients of GPU kernels. Since unlike other tools Enzyme performs automatic differentiation within a general-purpose compiler, we are able to introduce several novel GPU and AD-specific optimizations. To show the generality and efficiency of our approach, we compute gradients office GPU-based HPC applications, executed on NVIDIA and AMD GPUs. All bench-marks run within an order of magnitude of the original program's execution time. Without GPU and AD-specific optimizations, gra-dients of GPU kernels either fail to run from a lack of resources or have infeasible overhead. Finally, we demonstrate that increasing the problem size by either increasing the number of threads or increasing the work per thread, does not substantially impact the overhead from differentiation.
ISSN:	2167-4337
DOI:	10.1145/3458817.3476165