Accelerating sparse Cholesky factorization on GPUs

•Sparse direct factorization is a critical component of scientific computing•GPU acceleration of it is difficult due to algorithmic irregularity and PCIe perf.•We present an algorithm for GPU acceleration which resolves these issues.•The algorithm shows ∼ 2x speedup vs. best alternative CPU-only per...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Parallel computing Ročník 59; s. 140 - 150
Hlavní autori: Rennich, Steven C., Stosic, Darko, Davis, Timothy A.
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 01.11.2016
Predmet:
ISSN:0167-8191, 1872-7336
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:•Sparse direct factorization is a critical component of scientific computing•GPU acceleration of it is difficult due to algorithmic irregularity and PCIe perf.•We present an algorithm for GPU acceleration which resolves these issues.•The algorithm shows ∼ 2x speedup vs. best alternative CPU-only perf.•Aspects of this Cholesky case should be useful in other algorithms (LDL, LU, ... ). Sparse factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. For factorizations which involve sufficient dense math, the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost. However, for many other cases, the prevalence of small/irregular dense math and the relatively slow communication between the host and device over the PCIe bus, make it challenging to significantly accelerate sparse factorization using the GPU. In this paper we describe a left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream subtrees of the elimination tree through the GPU and perform the factorization of each subtree entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these subtrees, many independent, small, dense operations are batched to minimize kernel launch overhead and many of these batched kernels are executed concurrently to maximize device utilization. Performance results for commonly studied matrices are presented along with suggested actions for further optimization.
ISSN:0167-8191
1872-7336
DOI:10.1016/j.parco.2016.06.004