Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm

•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and r...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Parallel computing Ročník 40; číslo 7; s. 224 - 238
Hlavní autoři:	Ghysels, P., Vanroose, W.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier B.V 01.07.2014
Témata:	Algorithms Conjugate gradients Conjugate residuals Distributed memory Global communication Latency hiding Mathematical models Parallelization Run time (computers) Synchronism Synchronization Conjugate gradients Conjugate residuals Latency hiding Global communication Parallelization
ISSN:	0167-8191, 1872-7336
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and runtime compared to CG and CR. Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix–vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG.
AbstractList	•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and runtime compared to CG and CR. Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix–vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG. Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix-vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG.
Author	Vanroose, W. Ghysels, P.
Author_xml	– sequence: 1 givenname: P. surname: Ghysels fullname: Ghysels, P. email: pieter.ghysels@ua.ac.be organization: University of Antwerp, Department of Mathematics and Computer Science, Middelheimlaan 1, B-2020 Antwerp, Belgium – sequence: 2 givenname: W. surname: Vanroose fullname: Vanroose, W. email: wim.vanroose@ua.ac.be organization: University of Antwerp, Department of Mathematics and Computer Science, Middelheimlaan 1, B-2020 Antwerp, Belgium
BookMark	eNqFkLFOwzAQQC1UJErhC1g8siTYdew4AwOqoEWqxAISm-U4l9RVahfbRSpfT9oyMcB0w7130r1LNHLeAUI3lOSUUHG3zrc6GJ9PCWU5ETkh9AyNqSynWcmYGKHxQJWZpBW9QJcxrgkhopBkjN4XtrGuw13va93juHdmFbyzXzpZ73CvEzizx9bhtAK8DWC8a-xhBw2eebfedQOC50E3FlzCuu98sGm1uULnre4jXP_MCXp7enydLbLly_x59rDMTMF4ynhbN0BbLgypJKkZLaCoyrLWhRSmARAtMM1rLlrZktowaXilWUsqzgpRCsom6PZ0dxv8xw5iUhsbDfS9duB3UVHOq5JyyeSAVifUBB9jgFYZm45_pqBtryhRh5pqrY411aGmIkINNQeX_XK3wW502P9j3Z8sGAp8WggqmiGTgcYOJZNqvP3T_wZI6JOO
CitedBy_id	crossref_primary_10_1017_S1431927621012836 crossref_primary_10_1002_nla_2425 crossref_primary_10_1177_1094342020966835 crossref_primary_10_1016_j_cam_2020_113117 crossref_primary_10_1016_j_procs_2015_05_479 crossref_primary_10_1016_j_scs_2019_102010 crossref_primary_10_1145_2907944 crossref_primary_10_1177_10943420221107880 crossref_primary_10_1002_cpe_3820 crossref_primary_10_1177_1094342015611952 crossref_primary_10_1109_TPDS_2022_3221085 crossref_primary_10_1002_cpe_6816 crossref_primary_10_3847_1538_4357_ad98f4 crossref_primary_10_1137_15M1049130 crossref_primary_10_1137_16M1103361 crossref_primary_10_1109_TPDS_2021_3128827 crossref_primary_10_1007_s00607_021_00976_0 crossref_primary_10_1016_j_jpdc_2017_12_004 crossref_primary_10_1109_TBDATA_2022_3225959 crossref_primary_10_3390_e25030436 crossref_primary_10_1016_j_jcp_2015_10_045 crossref_primary_10_1016_j_parco_2016_04_004 crossref_primary_10_1371_journal_pone_0169130 crossref_primary_10_1007_s13160_025_00732_3 crossref_primary_10_1016_j_parco_2019_05_002 crossref_primary_10_1016_j_jpdc_2022_01_008 crossref_primary_10_1137_23M1582333 crossref_primary_10_1137_17M1117872 crossref_primary_10_1137_18M122858X crossref_primary_10_1080_10407790_2019_1690875 crossref_primary_10_1016_j_cpc_2018_07_007 crossref_primary_10_1029_2020MS002238 crossref_primary_10_1016_j_amc_2023_127868 crossref_primary_10_1109_TGRS_2023_3284475 crossref_primary_10_1137_18M1196285 crossref_primary_10_1177_1094342019899997 crossref_primary_10_1088_1742_6596_1031_1_012021 crossref_primary_10_1016_j_amc_2019_06_017 crossref_primary_10_1007_s42514_025_00226_1 crossref_primary_10_1145_3580003 crossref_primary_10_3390_w10101461 crossref_primary_10_1137_16M1107942 crossref_primary_10_1137_19M1276856 crossref_primary_10_1177_1094342015593157 crossref_primary_10_1137_15M1026419 crossref_primary_10_1016_j_jpdc_2023_04_012 crossref_primary_10_1007_s11075_025_02037_5 crossref_primary_10_1016_j_camwa_2020_06_007 crossref_primary_10_1016_j_parco_2017_04_005 crossref_primary_10_1088_1742_6596_1391_1_012093 crossref_primary_10_1109_TPDS_2019_2917663 crossref_primary_10_1016_j_advengsoft_2025_103936 crossref_primary_10_1007_s11227_019_03100_4 crossref_primary_10_1137_20M1346249 crossref_primary_10_1145_3054946
Cites_doi	10.1007/BF02309342 10.1137/0728088 10.2172/10176473 10.6028/jres.049.044 10.1016/j.parco.2007.04.004 10.1109/IPDPS.2008.4536305 10.1002/nla.1808 10.1137/12086563X 10.1016/0377-0427(89)90045-9 10.1016/0168-9274(95)00079-A 10.1137/0910073 10.1080/00207169208804107 10.1137/0905015 10.1007/978-3-642-33078-0_31 10.21236/ADA561766 10.1177/109434209200600411 10.1093/imanum/14.4.563 10.1137/S1064827599353865 10.1016/0167-8191(87)90037-8 10.1017/S096249290000235X 10.1016/0167-8191(96)00022-1 10.21236/ADA555879
ContentType	Journal Article
Copyright	2013 Elsevier B.V.
Copyright_xml	– notice: 2013 Elsevier B.V.
DBID	AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1016/j.parco.2013.06.001
DatabaseName	CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1872-7336
EndPage	238
ExternalDocumentID	10_1016_j_parco_2013_06_001 S0167819113000719
GroupedDBID	--K --M -~X .DC .~1 0R~ 123 1B1 1~. 1~5 29O 4.4 457 4G. 5VS 6OB 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABJNI ABMAC ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 DU5 EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA KOM LG9 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SCC SDF SDG SDP SES SEW SPC SPCBC SST SSV SSZ T5K WH7 WUQ XPP ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c435t-5fbde1f56c0980b314e4977ba486cdee6fe3a5b56f8f0bc38c59a3f0953467613
ISICitedReferencesCount	99
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000339598400007&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	0167-8191
IngestDate	Thu Oct 02 10:05:45 EDT 2025 Sat Nov 29 04:06:55 EST 2025 Tue Nov 18 21:58:08 EST 2025 Fri Feb 23 02:29:26 EST 2024
IsPeerReviewed	true
IsScholarly	true
Issue	7
Keywords	Conjugate gradients Conjugate residuals Latency hiding Global communication Parallelization
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c435t-5fbde1f56c0980b314e4977ba486cdee6fe3a5b56f8f0bc38c59a3f0953467613
Notes	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
PQID	1559715838
PQPubID	23500
PageCount	15
ParticipantIDs	proquest_miscellaneous_1559715838 crossref_citationtrail_10_1016_j_parco_2013_06_001 crossref_primary_10_1016_j_parco_2013_06_001 elsevier_sciencedirect_doi_10_1016_j_parco_2013_06_001
PublicationCentury	2000
PublicationDate	2014-07-01
PublicationDateYYYYMMDD	2014-07-01
PublicationDate_xml	– month: 07 year: 2014 text: 2014-07-01 day: 01
PublicationDecade	2010
PublicationTitle	Parallel computing
PublicationYear	2014
Publisher	Elsevier B.V
Publisher_xml	– name: Elsevier B.V
References	J. Brown, User-defined nonblocking collectives must make progress, in: IEEE Technical Committee on Scalable Computing (TCSC), 2012. Hockney, Eastwood (b0120) 1988 Joubert, Carey (b0140) 1992; 44 Hestenes, Stiefel (b0115) 1952; 49 Yang, Lin (b0235) 1997 T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, W. Rehm, Non-blocking collective operations for MPI-2, Open Systems Lab, Indiana University, Tech. Rep, 8, 2006. Christen, Schenk, Burkhart (b0045) 2011 De Sturler, Van der Vorst (b0075) 1995; 18 Van der Vorst (b0200) 2003; vol. 13 M. Hoemmen, Communication-avoiding Krylov subspace methods, Ph.D. Thesis, University of California, 2010. Chronopoulos (b0050) 1991; 28 E.F. D’Azevedo , C.H. Romine, Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors, Technical report, Oak Ridge National Lab, TN, 1992. S.A. Toledo, Quantitative performance modeling of scientific computations and creating locality in numerical algorithms, Ph.D. Thesis, Massachusetts Institute of Technology, 1995. Hoefler, Schneider, Lumsdaine (b0125) 2010 E. Carson, N. Knight, J. Demmel, Avoiding communication in two-sided Krylov subspace methods, Technical report, University of California, Berkeley, CA, USA, 2011. Saad (b0170) 2003 L.C. McInnes, B. Smith, H. Zhang, R. Tran Mills. Hierarchical and nested Krylov methods for extreme-scale computing, Technical Report ANL/MCS-P2097-0612, Argonne National Laboratory, 2012. E. Carson, J. Demmel, A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods, Technical Report UCB/EECS-2012-44, University of California, Berkeley, CA, USA, 2012. W. Gropp, Update on libraries for blue waters Saad (b0160) 1984; 5 Williams, Kalamkar, Singh, Deshpande, Van Straalen, Smelyanskiy, Almgren, Dubey, Shalf, Oliker (b0210) 2012 Schäfer, Fey (b0175) 2008 Ashby, Ghysels, Heirman, Vanroose (b0005) 2012 L. Grigori, S. Moufawad, Communication avoiding ILU(0) preconditioner, Rapport de recherche RR-8266, INRIA, March 2013. Van Der Vorst, Ye (b0205) 1999; 22 Yang, Brent (b0225) 2002 Kim, Chronopoulos (b0145) 1992; 6 Meurant (b0155) 1987; 5 Barrett, Berry, Chan, Demmel, Donato, Dongarra, Eijkhout, Pozo, Romine, Van der Vorst (b0020) 1994 Chronopoulos, Gear (b0055) 1989; 25 Bai, Hu, Reichel (b0010) 1994; 14 S. Balay, J. Brown, K. Buschelman, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman McInnes, B.F. Smith, H. Zhang, PETSc Web page, 2013 . Yang (b0220) 2002 Demmel, Heath, Van Der Vorst (b0080) 1993; 2 Ghysels, Ashby, Meerbergen, Vanroose (b0090) 2013; 35 J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick, Avoiding communication in sparse matrix computations, in: 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–12. Strakoš, Tichỳ (b0190) 2002; 13 Ghysels, Kłosiewicz, Vanroose (b0095) 2012; 19 D. Xie, L.R. Scott, An analysis of parallel U-cycle multigrid method. Jed Brown, Barry F. Smith, Aron Ahmadia, Achieving textbook multigrid efficiency for hydrostatic ice flow, SIAM Journal on Scientific Computing 35 (2) (2013) 359–375. Also, preprint ANL/MCS-P743-1298. Yang, Brent (b0230) 2003 Chronopoulos, Swanson (b0060) 1996; 22 Saad (b0165) 1989; 10 Hernandez, Roman, Tomas (b0110) 2007; 33 J.R. Shewchuk, An introduction to the conjugate gradient method without the agonizing pain, 1994. E.F. D’Azevedo, V.L. Eijkhout, C.H. Romine, Lapack working Note 56 conjugate gradient algorithms with reduced synchronization overhead on distributed memory multiprocessors, 1999. Sleijpen, van der Vorst (b0185) 1996; 56 10.1016/j.parco.2013.06.001_b0035 Hernandez (10.1016/j.parco.2013.06.001_b0110) 2007; 33 Meurant (10.1016/j.parco.2013.06.001_b0155) 1987; 5 10.1016/j.parco.2013.06.001_b0215 Ghysels (10.1016/j.parco.2013.06.001_b0090) 2013; 35 10.1016/j.parco.2013.06.001_b0015 Williams (10.1016/j.parco.2013.06.001_b0210) 2012 10.1016/j.parco.2013.06.001_b0135 Barrett (10.1016/j.parco.2013.06.001_b0020) 1994 10.1016/j.parco.2013.06.001_b0070 Saad (10.1016/j.parco.2013.06.001_b0160) 1984; 5 Sleijpen (10.1016/j.parco.2013.06.001_b0185) 1996; 56 Saad (10.1016/j.parco.2013.06.001_b0170) 2003 10.1016/j.parco.2013.06.001_b0130 10.1016/j.parco.2013.06.001_b0030 10.1016/j.parco.2013.06.001_b0195 Chronopoulos (10.1016/j.parco.2013.06.001_b0060) 1996; 22 Hestenes (10.1016/j.parco.2013.06.001_b0115) 1952; 49 10.1016/j.parco.2013.06.001_b0150 Yang (10.1016/j.parco.2013.06.001_b0235) 1997 Bai (10.1016/j.parco.2013.06.001_b0010) 1994; 14 Christen (10.1016/j.parco.2013.06.001_b0045) 2011 Schäfer (10.1016/j.parco.2013.06.001_b0175) 2008 Yang (10.1016/j.parco.2013.06.001_b0225) 2002 Van Der Vorst (10.1016/j.parco.2013.06.001_b0205) 1999; 22 Van der Vorst (10.1016/j.parco.2013.06.001_b0200) 2003; vol. 13 Yang (10.1016/j.parco.2013.06.001_b0220) 2002 Ashby (10.1016/j.parco.2013.06.001_b0005) 2012 Strakoš (10.1016/j.parco.2013.06.001_b0190) 2002; 13 Chronopoulos (10.1016/j.parco.2013.06.001_b0055) 1989; 25 10.1016/j.parco.2013.06.001_b0100 Kim (10.1016/j.parco.2013.06.001_b0145) 1992; 6 10.1016/j.parco.2013.06.001_b0065 10.1016/j.parco.2013.06.001_b0105 Ghysels (10.1016/j.parco.2013.06.001_b0095) 2012; 19 10.1016/j.parco.2013.06.001_b0025 Yang (10.1016/j.parco.2013.06.001_b0230) 2003 Saad (10.1016/j.parco.2013.06.001_b0165) 1989; 10 10.1016/j.parco.2013.06.001_b0180 10.1016/j.parco.2013.06.001_b0085 10.1016/j.parco.2013.06.001_b0040 Hoefler (10.1016/j.parco.2013.06.001_b0125) 2010 Hockney (10.1016/j.parco.2013.06.001_b0120) 1988 De Sturler (10.1016/j.parco.2013.06.001_b0075) 1995; 18 Demmel (10.1016/j.parco.2013.06.001_b0080) 1993; 2 Chronopoulos (10.1016/j.parco.2013.06.001_b0050) 1991; 28 Joubert (10.1016/j.parco.2013.06.001_b0140) 1992; 44
References_xml	– reference: L. Grigori, S. Moufawad, Communication avoiding ILU(0) preconditioner, Rapport de recherche RR-8266, INRIA, March 2013. – reference: D. Xie, L.R. Scott, An analysis of parallel U-cycle multigrid method. – volume: 19 start-page: 253 year: 2012 end-page: 267 ident: b0095 article-title: Improving the arithmetic intensity of multigrid with the help of polynomial smoothers publication-title: Numerical linear algebra with applications – reference: M. Hoemmen, Communication-avoiding Krylov subspace methods, Ph.D. Thesis, University of California, 2010. – year: 2003 ident: b0230 article-title: The improved parallel BiCG method for large and sparse unsymmetric linear systems on distributed memory architectures publication-title: Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002 – reference: J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick, Avoiding communication in sparse matrix computations, in: 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–12. – volume: 2 start-page: 111 year: 1993 end-page: 197 ident: b0080 article-title: Parallel numerical linear algebra publication-title: Acta Numerica – volume: 10 start-page: 1200 year: 1989 ident: b0165 article-title: Krylov subspace methods on supercomputers publication-title: SIAM Journal on Scientific and Statistical Computing – start-page: 389 year: 1997 end-page: 399 ident: b0235 article-title: The improved quasi-minimal residual method on massively distributed memory computers publication-title: High-Performance Computing and Networking – year: 1994 ident: b0020 article-title: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods – reference: S. Balay, J. Brown, K. Buschelman, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman McInnes, B.F. Smith, H. Zhang, PETSc Web page, 2013, – reference: J.R. Shewchuk, An introduction to the conjugate gradient method without the agonizing pain, 1994. – reference: W. Gropp, Update on libraries for blue waters, – start-page: 324 year: 2002 end-page: 328 ident: b0225 article-title: The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures publication-title: Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002 – reference: E.F. D’Azevedo , C.H. Romine, Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors, Technical report, Oak Ridge National Lab, TN, 1992. – volume: 33 start-page: 521 year: 2007 end-page: 540 ident: b0110 article-title: Parallel Arnoldi eigensolvers with enhanced scalability via global communications rearrangement publication-title: Parallel Computing – volume: 6 start-page: 407 year: 1992 end-page: 420 ident: b0145 article-title: An efficient parallel algorithm for extreme eigenvalues of sparse nonsymmetric matrices publication-title: International Journal of High Performance Computing Applications – start-page: 285 year: 2008 end-page: 294 ident: b0175 article-title: LibGeoDecomp: a grid-enabled library for geometric decomposition codes publication-title: Proceedings of the 15th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface – volume: 5 start-page: 267 year: 1987 end-page: 280 ident: b0155 article-title: Multitasking the conjugate gradient method on the CRAY X-MP/48 publication-title: Parallel Computing – volume: 25 start-page: 153 year: 1989 end-page: 168 ident: b0055 article-title: s-Step iterative methods for symmetric linear systems publication-title: Journal of Computational and Applied Mathematics – start-page: 597 year: 2010 end-page: 604 ident: b0125 article-title: LogGOPSim – simulating large-scale applications in the LogGOPS model publication-title: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing – volume: 56 start-page: 141 year: 1996 end-page: 163 ident: b0185 article-title: Reliable updated residuals in hybrid Bi-CG methods publication-title: Computing – reference: S.A. Toledo, Quantitative performance modeling of scientific computations and creating locality in numerical algorithms, Ph.D. Thesis, Massachusetts Institute of Technology, 1995. – reference: Jed Brown, Barry F. Smith, Aron Ahmadia, Achieving textbook multigrid efficiency for hydrostatic ice flow, SIAM Journal on Scientific Computing 35 (2) (2013) 359–375. Also, preprint ANL/MCS-P743-1298. – volume: 49 year: 1952 ident: b0115 article-title: Methods of conjugate gradients for solving linear systems publication-title: Journal of Research of the National Bureau of Standards – reference: E. Carson, J. Demmel, A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods, Technical Report UCB/EECS-2012-44, University of California, Berkeley, CA, USA, 2012. – volume: 28 start-page: 1776 year: 1991 end-page: 1789 ident: b0050 article-title: s-Step iterative methods for (non) symmetric (in) definite linear systems publication-title: SIAM Journal on Numerical Analysis – reference: T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, W. Rehm, Non-blocking collective operations for MPI-2, Open Systems Lab, Indiana University, Tech. Rep, 8, 2006. – volume: 5 start-page: 203 year: 1984 end-page: 228 ident: b0160 article-title: Practical use of some Krylov subspace methods for solving indefinite and nonsymmetric linear systems publication-title: SIAM Journal on Scientific and Statistical Computing – reference: L.C. McInnes, B. Smith, H. Zhang, R. Tran Mills. Hierarchical and nested Krylov methods for extreme-scale computing, Technical Report ANL/MCS-P2097-0612, Argonne National Laboratory, 2012. – volume: vol. 13 year: 2003 ident: b0200 publication-title: Iterative Krylov Methods for Large Linear Systems – volume: 22 start-page: 835 year: 1999 end-page: 852 ident: b0205 article-title: Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals publication-title: SIAM Journal on Scientific Computing – reference: E. Carson, N. Knight, J. Demmel, Avoiding communication in two-sided Krylov subspace methods, Technical report, University of California, Berkeley, CA, USA, 2011. – volume: 13 start-page: 56 year: 2002 end-page: 80 ident: b0190 article-title: On error estimation in the conjugate gradient method and why it works in finite precision computations publication-title: Electronic Transactions on Numerical Analysis – volume: 35 start-page: C48 year: 2013 end-page: C71 ident: b0090 article-title: Hiding global communication latency in the GMRES algorithm on massively parallel machines publication-title: SIAM Journal on Scientific Computing – start-page: 232 year: 2002 end-page: 237 ident: b0220 article-title: The improved CGS method for large and sparse linear systems on bulk synchronous parallel architectures publication-title: Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002 – start-page: 96 year: 2012 ident: b0210 article-title: Optimization of geometric multigrid for emerging multi-and manycore processors publication-title: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis – reference: . – start-page: 428 year: 2012 end-page: 442 ident: b0005 article-title: The impact of global communication latency at extreme scales on Krylov methods publication-title: Algorithms and Architectures for Parallel Processing – reference: J. Brown, User-defined nonblocking collectives must make progress, in: IEEE Technical Committee on Scalable Computing (TCSC), 2012. – reference: E.F. D’Azevedo, V.L. Eijkhout, C.H. Romine, Lapack working Note 56 conjugate gradient algorithms with reduced synchronization overhead on distributed memory multiprocessors, 1999. – volume: 18 start-page: 441 year: 1995 end-page: 459 ident: b0075 article-title: Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers publication-title: Applied Numerical Mathematics – year: 1988 ident: b0120 article-title: Computer Simulation Using Particles – start-page: 676 year: 2011 end-page: 687 ident: b0045 article-title: Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures publication-title: Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International – volume: 44 start-page: 243 year: 1992 end-page: 267 ident: b0140 article-title: Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: Theory publication-title: International Journal of Computer Mathematics – volume: 14 start-page: 563 year: 1994 end-page: 581 ident: b0010 article-title: A Newton basis GMRES implementation publication-title: IMA Journal of Numerical Analysis – year: 2003 ident: b0170 article-title: Iterative Methods for Sparse Linear Systems – volume: 22 start-page: 623 year: 1996 end-page: 641 ident: b0060 article-title: Parallel iterative s-step methods for unsymmetric linear systems publication-title: Parallel Computing – start-page: 597 year: 2010 ident: 10.1016/j.parco.2013.06.001_b0125 article-title: LogGOPSim – simulating large-scale applications in the LogGOPS model – volume: 56 start-page: 141 issue: 2 year: 1996 ident: 10.1016/j.parco.2013.06.001_b0185 article-title: Reliable updated residuals in hybrid Bi-CG methods publication-title: Computing doi: 10.1007/BF02309342 – volume: 28 start-page: 1776 issue: 6 year: 1991 ident: 10.1016/j.parco.2013.06.001_b0050 article-title: s-Step iterative methods for (non) symmetric (in) definite linear systems publication-title: SIAM Journal on Numerical Analysis doi: 10.1137/0728088 – ident: 10.1016/j.parco.2013.06.001_b0070 doi: 10.2172/10176473 – volume: 49 issue: 6 year: 1952 ident: 10.1016/j.parco.2013.06.001_b0115 article-title: Methods of conjugate gradients for solving linear systems publication-title: Journal of Research of the National Bureau of Standards doi: 10.6028/jres.049.044 – volume: 33 start-page: 521 issue: 7–8 year: 2007 ident: 10.1016/j.parco.2013.06.001_b0110 article-title: Parallel Arnoldi eigensolvers with enhanced scalability via global communications rearrangement publication-title: Parallel Computing doi: 10.1016/j.parco.2007.04.004 – ident: 10.1016/j.parco.2013.06.001_b0085 doi: 10.1109/IPDPS.2008.4536305 – year: 1988 ident: 10.1016/j.parco.2013.06.001_b0120 – ident: 10.1016/j.parco.2013.06.001_b0025 – ident: 10.1016/j.parco.2013.06.001_b0130 – volume: 19 start-page: 253 issue: 2 year: 2012 ident: 10.1016/j.parco.2013.06.001_b0095 article-title: Improving the arithmetic intensity of multigrid with the help of polynomial smoothers publication-title: Numerical linear algebra with applications doi: 10.1002/nla.1808 – volume: 35 start-page: C48 issue: 1 year: 2013 ident: 10.1016/j.parco.2013.06.001_b0090 article-title: Hiding global communication latency in the GMRES algorithm on massively parallel machines publication-title: SIAM Journal on Scientific Computing doi: 10.1137/12086563X – ident: 10.1016/j.parco.2013.06.001_b0150 – year: 1994 ident: 10.1016/j.parco.2013.06.001_b0020 – volume: 25 start-page: 153 issue: 2 year: 1989 ident: 10.1016/j.parco.2013.06.001_b0055 article-title: s-Step iterative methods for symmetric linear systems publication-title: Journal of Computational and Applied Mathematics doi: 10.1016/0377-0427(89)90045-9 – start-page: 324 year: 2002 ident: 10.1016/j.parco.2013.06.001_b0225 article-title: The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures – volume: 18 start-page: 441 issue: 4 year: 1995 ident: 10.1016/j.parco.2013.06.001_b0075 article-title: Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers publication-title: Applied Numerical Mathematics doi: 10.1016/0168-9274(95)00079-A – ident: 10.1016/j.parco.2013.06.001_b0195 – ident: 10.1016/j.parco.2013.06.001_b0065 – start-page: 96 year: 2012 ident: 10.1016/j.parco.2013.06.001_b0210 article-title: Optimization of geometric multigrid for emerging multi-and manycore processors – year: 2003 ident: 10.1016/j.parco.2013.06.001_b0170 – ident: 10.1016/j.parco.2013.06.001_b0015 – ident: 10.1016/j.parco.2013.06.001_b0105 – volume: 10 start-page: 1200 year: 1989 ident: 10.1016/j.parco.2013.06.001_b0165 article-title: Krylov subspace methods on supercomputers publication-title: SIAM Journal on Scientific and Statistical Computing doi: 10.1137/0910073 – ident: 10.1016/j.parco.2013.06.001_b0030 – volume: 44 start-page: 243 issue: 1–4 year: 1992 ident: 10.1016/j.parco.2013.06.001_b0140 article-title: Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: Theory publication-title: International Journal of Computer Mathematics doi: 10.1080/00207169208804107 – volume: 5 start-page: 203 issue: 1 year: 1984 ident: 10.1016/j.parco.2013.06.001_b0160 article-title: Practical use of some Krylov subspace methods for solving indefinite and nonsymmetric linear systems publication-title: SIAM Journal on Scientific and Statistical Computing doi: 10.1137/0905015 – start-page: 428 year: 2012 ident: 10.1016/j.parco.2013.06.001_b0005 article-title: The impact of global communication latency at extreme scales on Krylov methods publication-title: Algorithms and Architectures for Parallel Processing doi: 10.1007/978-3-642-33078-0_31 – ident: 10.1016/j.parco.2013.06.001_b0035 doi: 10.21236/ADA561766 – volume: 13 start-page: 56 year: 2002 ident: 10.1016/j.parco.2013.06.001_b0190 article-title: On error estimation in the conjugate gradient method and why it works in finite precision computations publication-title: Electronic Transactions on Numerical Analysis – year: 2003 ident: 10.1016/j.parco.2013.06.001_b0230 article-title: The improved parallel BiCG method for large and sparse unsymmetric linear systems on distributed memory architectures – volume: 6 start-page: 407 issue: 4 year: 1992 ident: 10.1016/j.parco.2013.06.001_b0145 article-title: An efficient parallel algorithm for extreme eigenvalues of sparse nonsymmetric matrices publication-title: International Journal of High Performance Computing Applications doi: 10.1177/109434209200600411 – start-page: 232 year: 2002 ident: 10.1016/j.parco.2013.06.001_b0220 article-title: The improved CGS method for large and sparse linear systems on bulk synchronous parallel architectures – volume: vol. 13 year: 2003 ident: 10.1016/j.parco.2013.06.001_b0200 – ident: 10.1016/j.parco.2013.06.001_b0135 – volume: 14 start-page: 563 issue: 4 year: 1994 ident: 10.1016/j.parco.2013.06.001_b0010 article-title: A Newton basis GMRES implementation publication-title: IMA Journal of Numerical Analysis doi: 10.1093/imanum/14.4.563 – start-page: 676 year: 2011 ident: 10.1016/j.parco.2013.06.001_b0045 article-title: Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures – start-page: 389 year: 1997 ident: 10.1016/j.parco.2013.06.001_b0235 article-title: The improved quasi-minimal residual method on massively distributed memory computers – volume: 22 start-page: 835 issue: 3 year: 1999 ident: 10.1016/j.parco.2013.06.001_b0205 article-title: Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals publication-title: SIAM Journal on Scientific Computing doi: 10.1137/S1064827599353865 – volume: 5 start-page: 267 issue: 3 year: 1987 ident: 10.1016/j.parco.2013.06.001_b0155 article-title: Multitasking the conjugate gradient method on the CRAY X-MP/48 publication-title: Parallel Computing doi: 10.1016/0167-8191(87)90037-8 – volume: 2 start-page: 111 issue: -1 year: 1993 ident: 10.1016/j.parco.2013.06.001_b0080 article-title: Parallel numerical linear algebra publication-title: Acta Numerica doi: 10.1017/S096249290000235X – ident: 10.1016/j.parco.2013.06.001_b0100 – volume: 22 start-page: 623 issue: 5 year: 1996 ident: 10.1016/j.parco.2013.06.001_b0060 article-title: Parallel iterative s-step methods for unsymmetric linear systems publication-title: Parallel Computing doi: 10.1016/0167-8191(96)00022-1 – ident: 10.1016/j.parco.2013.06.001_b0180 – ident: 10.1016/j.parco.2013.06.001_b0040 doi: 10.21236/ADA555879 – ident: 10.1016/j.parco.2013.06.001_b0215 – start-page: 285 year: 2008 ident: 10.1016/j.parco.2013.06.001_b0175 article-title: LibGeoDecomp: a grid-enabled library for geometric decomposition codes
SSID	ssj0006480
Score	2.377402
Snippet	•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows... Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines....
SourceID	proquest crossref elsevier
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	224
SubjectTerms	Algorithms Conjugate gradients Conjugate residuals Distributed memory Global communication Latency hiding Mathematical models Parallelization Run time (computers) Synchronism Synchronization
Title	Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm
URI	https://dx.doi.org/10.1016/j.parco.2013.06.001 https://www.proquest.com/docview/1559715838
Volume	40
WOSCitedRecordID	wos000339598400007&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVESC databaseName: ScienceDirect database customDbUrl: eissn: 1872-7336 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0006480 issn: 0167-8191 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtZ3Nb9MwFMCtsnHgwjdiGyAjIS4jU9M4jnNCE-oYaCo7dJCb5dgxtCpZSNNp_Pc8f2XdJqZx4BJVUT4q_5z3nv2-EHqj5XAklKRRzlQZEZbCJ6VzHZVxKqQ2fiPfbCKbTFhR5MeDwfuQC3O2yOqanZ_nzX9FDecAtkmd_Qfc_UPhBPwG6HAE7HC8FfjDmc1T8ZU-lr9raevfunTL3YXobLKlD29s7IJYuYJFyuT_zVdmY233Y2tjwcyG8PfTdtb9-Lluxh6L1vRgWdiI9FUX1J-J4wHuoG4vZY59FTXY566D47f1bYaY9CGp_c4jSFSzulsXna7Skp8i2bocdInRXqWOXAGXa9LabRzM9xoAbRIx42TP-YYulFNwyE--8IOToyM-HRfTt82vyLQNM-5130PlDtocZWkOYm1z_9O4-NwrY0ps87z-34fCUzbE79p7_2acXFHT1vaYPkT3_aIB7zvYj9Cgqh-jB6EhB_by-QkqHHvs2OMr7LFnj2c1Bvb4Mnvcs8eBPe7ZP0UnB-Pph8PIt86IJNi_XZTqUlWxTqkc5mxYJjGpCFj6pSCMSlVVVFeJSMuUaqaHpUyYTHORaFN8EDQnmHjP0EYNL3-OsEqEzGJ4HHzsRFNSKk2EoiyXLFNUiC00CmPGpa8rb9qbLHgIIJxzO9DcDDR3YZRb6F1_U-PKqtx8OQ0wuLcMncXHYSrdfOPrgI6D3DTOMFFXp6slN-74LDZBA9u3uGYH3bv4KF6gja5dVS_RXXnWzZbtKz_p_gCV75V4
linkProvider	Elsevier
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Hiding+global+synchronization+latency+in+the+preconditioned+Conjugate+Gradient+algorithm&rft.jtitle=Parallel+computing&rft.au=Ghysels%2C+P&rft.au=Vanroose%2C+W&rft.date=2014-07-01&rft.issn=0167-8191&rft.volume=40&rft.issue=7&rft.spage=224&rft.epage=238&rft_id=info:doi/10.1016%2Fj.parco.2013.06.001&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-8191&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-8191&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-8191&client=summon