Soft error resilient QR factorization for hybrid system with GPGPU

► A soft error resilient QR algorithm for hybrid architecture with CPU and GPU was developed. ► Soft errors during QR factorization in the entire matrix can be detected and corrected. ► Performance impact to the baseline MAGMA QR implementation is low. ► Similar methods can be applied to other one-s...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of computational science Ročník 4; číslo 6; s. 457 - 464
Hlavní autori: Du, Peng, Luszczek, Piotr, Tomov, Stan, Dongarra, Jack
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 01.11.2013
Predmet:
ISSN:1877-7503, 1877-7511
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:► A soft error resilient QR algorithm for hybrid architecture with CPU and GPU was developed. ► Soft errors during QR factorization in the entire matrix can be detected and corrected. ► Performance impact to the baseline MAGMA QR implementation is low. ► Similar methods can be applied to other one-sided factorization like Cholesky and LU factorization. The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.
ISSN:1877-7503
1877-7511
DOI:10.1016/j.jocs.2013.01.004