Pipelined Preconditioned Conjugate Gradient Methods for real and complex linear systems for distributed memory architectures

•We introduce PIPECG-OATI-c to solve complex Hermitian and symmetric linear systems.•PIPECG-OATI-c reduces synchronizations in PCG and overlaps them with computations.•We provide optimized implementations for PIPECG-OATI-c.•We obtain 25% performance improvement over PCG for 1B problem on 16K cores.•...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of parallel and distributed computing Ročník 163; s. 147 - 155
Hlavní autoři: Tiwari, Manasi, Vadhiyar, Sathish
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Inc 01.05.2022
Témata:
ISSN:0743-7315, 1096-0848
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:•We introduce PIPECG-OATI-c to solve complex Hermitian and symmetric linear systems.•PIPECG-OATI-c reduces synchronizations in PCG and overlaps them with computations.•We provide optimized implementations for PIPECG-OATI-c.•We obtain 25% performance improvement over PCG for 1B problem on 16K cores.•We obtain 2.48X performance improvement over PCG for 20M problem on 16K cores. Preconditioned Conjugate Gradient (PCG) is a popular method for solving large and sparse linear systems of equations. The performance of PCG at scale is affected due to the costly global synchronization steps that arise in dot-products on distributed memory systems. Pipelined PCG (PIPECG) removes the costly global synchronization steps from PCG by only executing a single non-blocking allreduce per iteration and overlapping it with independent computations. In our previous work, we have developed a novel pipelined PCG algorithm called PIPECG-OATI (One Allreduce per Two Iterations) for real linear systems which executes a single non-blocking allreduce per two iterations and provides a large overlap of global communication with independent computations at higher number of cores. Our method achieves this overlap by using iteration combination and by introducing new recurrence and non-recurrence computations. We implement optimizations in the PIPECG-OATI method to use cache memory efficiently. In this work, we present PIPECG-OATI-c method for linear systems with complex Hermitian positive definite and complex symmetric matrices. We compare our method with various pipelined CG methods on a variety of problems and demonstrate that our method always gives the least run times. We performed experiments with our method using 20M and 30M unknowns on up to 16K cores and obtained up to 2.48X performance improvement over PCG and 2.14X performance improvement over PIPECG methods. We also experimented with up to 1-billion unknowns on 16K cores, the largest problem size explored for the CG problem, to our knowledge, and obtained about 25% improvement over PCG.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2022.01.008