Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm
•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and r...
Uložené v:
| Vydané v: | Parallel computing Ročník 40; číslo 7; s. 224 - 238 |
|---|---|
| Hlavní autori: | , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Elsevier B.V
01.07.2014
|
| Predmet: | |
| ISSN: | 0167-8191, 1872-7336 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | •The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and runtime compared to CG and CR.
Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix–vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG. |
|---|---|
| Bibliografia: | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 |
| ISSN: | 0167-8191 1872-7336 |
| DOI: | 10.1016/j.parco.2013.06.001 |