Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs

•A batched BLAS design based on device function and big-tile setting was proposed on the GPU.•The batched BLAS approach is used to optimize solving the batched bi-diagonalization problem.•For the first time, batched BLAS in HIP code on AMD platform was presented and compared against NVIDIA CUDA plat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computational science Jg. 26; H. C; S. 237 - 245
Hauptverfasser: Dong, Tingxing, Haidar, Azzam, Tomov, Stanimire, Dongarra, Jack
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Netherlands Elsevier B.V 01.05.2018
Elsevier
Schlagworte:
ISSN:1877-7503, 1877-7511
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•A batched BLAS design based on device function and big-tile setting was proposed on the GPU.•The batched BLAS approach is used to optimize solving the batched bi-diagonalization problem.•For the first time, batched BLAS in HIP code on AMD platform was presented and compared against NVIDIA CUDA platform in research. The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in applications, calling them directly from highly optimized, standard numerical libraries, like (batched) BLAS and LAPACK. While most of the developments have been for one-sided factorizations and solvers, many important applications – from big data analytics to information retrieval, low-rank approximations for solvers and preconditioners – require two-sided factorizations, and most notably the SVD factorization. To address these needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We propose a device functions-based methodology and big-tile setting techniques in our batched BLAS design. The different optimization techniques result in many software versions that must be tuned, for which we adopt an auto-tuning strategy to automatically derive the optimized instances of the routines. We illustrate our batched BLAS approach to optimize batched SVD bi-diagonalization progressively on GPUs. The progression is illustrated on an NVIDIA K40c GPU, and also, ported and presented on AMD Fiji Nano GPU, using AMD's Heterogeneous–Compute Interface for Portability (HIP) C++ runtime API. We demonstrate achieving 80% of the theoretically achievable peak performance for the overall algorithm, and significant acceleration of the Level-2 BLAS GEMV and Level-3 BLAS GEMM needed compared to vendor-optimized libraries on GPUs and multicore CPUs. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.
Bibliographie:USDOE National Nuclear Security Administration (NNSA)
ISSN:1877-7503
1877-7511
DOI:10.1016/j.jocs.2018.01.007