Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm

•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and r...

Full description

Saved in:

Bibliographic Details
Published in:	Parallel computing Vol. 40; no. 7; pp. 224 - 238
Main Authors:	Ghysels, P., Vanroose, W.
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 01.07.2014
Subjects:	Algorithms Conjugate gradients Conjugate residuals Distributed memory Global communication Latency hiding Mathematical models Parallelization Run time (computers) Synchronism Synchronization Conjugate gradients Conjugate residuals Latency hiding Global communication Parallelization
ISSN:	0167-8191, 1872-7336
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and runtime compared to CG and CR. Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix–vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	0167-8191 1872-7336
DOI:	10.1016/j.parco.2013.06.001