Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm

•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and r...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Parallel computing Ročník 40; číslo 7; s. 224 - 238
Hlavní autoři:	Ghysels, P., Vanroose, W.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier B.V 01.07.2014
Témata:	Algorithms Conjugate gradients Conjugate residuals Distributed memory Global communication Latency hiding Mathematical models Parallelization Run time (computers) Synchronism Synchronization Conjugate gradients Conjugate residuals Latency hiding Global communication Parallelization
ISSN:	0167-8191, 1872-7336
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and runtime compared to CG and CR. Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix–vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG.
Bibliografie:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	0167-8191 1872-7336
DOI:	10.1016/j.parco.2013.06.001