Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm

•The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and r...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Parallel computing Ročník 40; číslo 7; s. 224 - 238
Hlavní autoři: Ghysels, P., Vanroose, W.
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.07.2014
Témata:
ISSN:0167-8191, 1872-7336
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract •The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and runtime compared to CG and CR. Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix–vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG.
AbstractList •The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows how global communication can be overlapped with local work.•It shows numerical stability of the methods.•It shows improved scalability and runtime compared to CG and CR. Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix–vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG.
Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines. In this work, a modified preconditioned Conjugate Gradient (CG) method is presented that removes the costly global synchronization steps from the standard CG algorithm by only performing a single non-blocking reduction per iteration. This global communication phase can be overlapped by the matrix-vector product, which typically only requires local communication. The resulting algorithm will be referred to as pipelined CG. An alternative pipelined method, mathematically equivalent to the Conjugate Residual (CR) method that makes different trade-offs with regard to scalability and serial runtime is also considered. These methods are compared to a recently proposed asynchronous CG algorithm by Gropp. Extensive numerical experiments demonstrate the numerical stability of the methods. Moreover, it is shown that hiding the global synchronization step improves scalability on distributed memory machines using the message passing paradigm and leads to significant speedups compared to standard preconditioned CG.
Author Vanroose, W.
Ghysels, P.
Author_xml – sequence: 1
  givenname: P.
  surname: Ghysels
  fullname: Ghysels, P.
  email: pieter.ghysels@ua.ac.be
  organization: University of Antwerp, Department of Mathematics and Computer Science, Middelheimlaan 1, B-2020 Antwerp, Belgium
– sequence: 2
  givenname: W.
  surname: Vanroose
  fullname: Vanroose, W.
  email: wim.vanroose@ua.ac.be
  organization: University of Antwerp, Department of Mathematics and Computer Science, Middelheimlaan 1, B-2020 Antwerp, Belgium
BookMark eNqFkLFOwzAQQC1UJErhC1g8siTYdew4AwOqoEWqxAISm-U4l9RVahfbRSpfT9oyMcB0w7130r1LNHLeAUI3lOSUUHG3zrc6GJ9PCWU5ETkh9AyNqSynWcmYGKHxQJWZpBW9QJcxrgkhopBkjN4XtrGuw13va93juHdmFbyzXzpZ73CvEzizx9bhtAK8DWC8a-xhBw2eebfedQOC50E3FlzCuu98sGm1uULnre4jXP_MCXp7enydLbLly_x59rDMTMF4ynhbN0BbLgypJKkZLaCoyrLWhRSmARAtMM1rLlrZktowaXilWUsqzgpRCsom6PZ0dxv8xw5iUhsbDfS9duB3UVHOq5JyyeSAVifUBB9jgFYZm45_pqBtryhRh5pqrY411aGmIkINNQeX_XK3wW502P9j3Z8sGAp8WggqmiGTgcYOJZNqvP3T_wZI6JOO
CitedBy_id crossref_primary_10_1017_S1431927621012836
crossref_primary_10_1002_nla_2425
crossref_primary_10_1177_1094342020966835
crossref_primary_10_1016_j_cam_2020_113117
crossref_primary_10_1016_j_procs_2015_05_479
crossref_primary_10_1016_j_scs_2019_102010
crossref_primary_10_1145_2907944
crossref_primary_10_1177_10943420221107880
crossref_primary_10_1002_cpe_3820
crossref_primary_10_1177_1094342015611952
crossref_primary_10_1109_TPDS_2022_3221085
crossref_primary_10_1002_cpe_6816
crossref_primary_10_3847_1538_4357_ad98f4
crossref_primary_10_1137_15M1049130
crossref_primary_10_1137_16M1103361
crossref_primary_10_1109_TPDS_2021_3128827
crossref_primary_10_1007_s00607_021_00976_0
crossref_primary_10_1016_j_jpdc_2017_12_004
crossref_primary_10_1109_TBDATA_2022_3225959
crossref_primary_10_3390_e25030436
crossref_primary_10_1016_j_jcp_2015_10_045
crossref_primary_10_1016_j_parco_2016_04_004
crossref_primary_10_1371_journal_pone_0169130
crossref_primary_10_1007_s13160_025_00732_3
crossref_primary_10_1016_j_parco_2019_05_002
crossref_primary_10_1016_j_jpdc_2022_01_008
crossref_primary_10_1137_23M1582333
crossref_primary_10_1137_17M1117872
crossref_primary_10_1137_18M122858X
crossref_primary_10_1080_10407790_2019_1690875
crossref_primary_10_1016_j_cpc_2018_07_007
crossref_primary_10_1029_2020MS002238
crossref_primary_10_1016_j_amc_2023_127868
crossref_primary_10_1109_TGRS_2023_3284475
crossref_primary_10_1137_18M1196285
crossref_primary_10_1177_1094342019899997
crossref_primary_10_1088_1742_6596_1031_1_012021
crossref_primary_10_1016_j_amc_2019_06_017
crossref_primary_10_1007_s42514_025_00226_1
crossref_primary_10_1145_3580003
crossref_primary_10_3390_w10101461
crossref_primary_10_1137_16M1107942
crossref_primary_10_1137_19M1276856
crossref_primary_10_1177_1094342015593157
crossref_primary_10_1137_15M1026419
crossref_primary_10_1016_j_jpdc_2023_04_012
crossref_primary_10_1007_s11075_025_02037_5
crossref_primary_10_1016_j_camwa_2020_06_007
crossref_primary_10_1016_j_parco_2017_04_005
crossref_primary_10_1088_1742_6596_1391_1_012093
crossref_primary_10_1109_TPDS_2019_2917663
crossref_primary_10_1016_j_advengsoft_2025_103936
crossref_primary_10_1007_s11227_019_03100_4
crossref_primary_10_1137_20M1346249
crossref_primary_10_1145_3054946
Cites_doi 10.1007/BF02309342
10.1137/0728088
10.2172/10176473
10.6028/jres.049.044
10.1016/j.parco.2007.04.004
10.1109/IPDPS.2008.4536305
10.1002/nla.1808
10.1137/12086563X
10.1016/0377-0427(89)90045-9
10.1016/0168-9274(95)00079-A
10.1137/0910073
10.1080/00207169208804107
10.1137/0905015
10.1007/978-3-642-33078-0_31
10.21236/ADA561766
10.1177/109434209200600411
10.1093/imanum/14.4.563
10.1137/S1064827599353865
10.1016/0167-8191(87)90037-8
10.1017/S096249290000235X
10.1016/0167-8191(96)00022-1
10.21236/ADA555879
ContentType Journal Article
Copyright 2013 Elsevier B.V.
Copyright_xml – notice: 2013 Elsevier B.V.
DBID AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1016/j.parco.2013.06.001
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1872-7336
EndPage 238
ExternalDocumentID 10_1016_j_parco_2013_06_001
S0167819113000719
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
123
1B1
1~.
1~5
29O
4.4
457
4G.
5VS
6OB
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABJNI
ABMAC
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EFLBG
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
KOM
LG9
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SCC
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
WH7
WUQ
XPP
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
CITATION
EFKBS
~HD
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c435t-5fbde1f56c0980b314e4977ba486cdee6fe3a5b56f8f0bc38c59a3f0953467613
ISICitedReferencesCount 99
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000339598400007&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0167-8191
IngestDate Thu Oct 02 10:05:45 EDT 2025
Sat Nov 29 04:06:55 EST 2025
Tue Nov 18 21:58:08 EST 2025
Fri Feb 23 02:29:26 EST 2024
IsPeerReviewed true
IsScholarly true
Issue 7
Keywords Conjugate gradients
Conjugate residuals
Latency hiding
Global communication
Parallelization
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c435t-5fbde1f56c0980b314e4977ba486cdee6fe3a5b56f8f0bc38c59a3f0953467613
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
PQID 1559715838
PQPubID 23500
PageCount 15
ParticipantIDs proquest_miscellaneous_1559715838
crossref_citationtrail_10_1016_j_parco_2013_06_001
crossref_primary_10_1016_j_parco_2013_06_001
elsevier_sciencedirect_doi_10_1016_j_parco_2013_06_001
PublicationCentury 2000
PublicationDate 2014-07-01
PublicationDateYYYYMMDD 2014-07-01
PublicationDate_xml – month: 07
  year: 2014
  text: 2014-07-01
  day: 01
PublicationDecade 2010
PublicationTitle Parallel computing
PublicationYear 2014
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References J. Brown, User-defined nonblocking collectives must make progress, in: IEEE Technical Committee on Scalable Computing (TCSC), 2012.
Hockney, Eastwood (b0120) 1988
Joubert, Carey (b0140) 1992; 44
Hestenes, Stiefel (b0115) 1952; 49
Yang, Lin (b0235) 1997
T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, W. Rehm, Non-blocking collective operations for MPI-2, Open Systems Lab, Indiana University, Tech. Rep, 8, 2006.
Christen, Schenk, Burkhart (b0045) 2011
De Sturler, Van der Vorst (b0075) 1995; 18
Van der Vorst (b0200) 2003; vol. 13
M. Hoemmen, Communication-avoiding Krylov subspace methods, Ph.D. Thesis, University of California, 2010.
Chronopoulos (b0050) 1991; 28
E.F. D’Azevedo , C.H. Romine, Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors, Technical report, Oak Ridge National Lab, TN, 1992.
S.A. Toledo, Quantitative performance modeling of scientific computations and creating locality in numerical algorithms, Ph.D. Thesis, Massachusetts Institute of Technology, 1995.
Hoefler, Schneider, Lumsdaine (b0125) 2010
E. Carson, N. Knight, J. Demmel, Avoiding communication in two-sided Krylov subspace methods, Technical report, University of California, Berkeley, CA, USA, 2011.
Saad (b0170) 2003
L.C. McInnes, B. Smith, H. Zhang, R. Tran Mills. Hierarchical and nested Krylov methods for extreme-scale computing, Technical Report ANL/MCS-P2097-0612, Argonne National Laboratory, 2012.
E. Carson, J. Demmel, A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods, Technical Report UCB/EECS-2012-44, University of California, Berkeley, CA, USA, 2012.
W. Gropp, Update on libraries for blue waters
Saad (b0160) 1984; 5
Williams, Kalamkar, Singh, Deshpande, Van Straalen, Smelyanskiy, Almgren, Dubey, Shalf, Oliker (b0210) 2012
Schäfer, Fey (b0175) 2008
Ashby, Ghysels, Heirman, Vanroose (b0005) 2012
L. Grigori, S. Moufawad, Communication avoiding ILU(0) preconditioner, Rapport de recherche RR-8266, INRIA, March 2013.
Van Der Vorst, Ye (b0205) 1999; 22
Yang, Brent (b0225) 2002
Kim, Chronopoulos (b0145) 1992; 6
Meurant (b0155) 1987; 5
Barrett, Berry, Chan, Demmel, Donato, Dongarra, Eijkhout, Pozo, Romine, Van der Vorst (b0020) 1994
Chronopoulos, Gear (b0055) 1989; 25
Bai, Hu, Reichel (b0010) 1994; 14
S. Balay, J. Brown, K. Buschelman, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman McInnes, B.F. Smith, H. Zhang, PETSc Web page, 2013
.
Yang (b0220) 2002
Demmel, Heath, Van Der Vorst (b0080) 1993; 2
Ghysels, Ashby, Meerbergen, Vanroose (b0090) 2013; 35
J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick, Avoiding communication in sparse matrix computations, in: 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–12.
Strakoš, Tichỳ (b0190) 2002; 13
Ghysels, Kłosiewicz, Vanroose (b0095) 2012; 19
D. Xie, L.R. Scott, An analysis of parallel U-cycle multigrid method.
Jed Brown, Barry F. Smith, Aron Ahmadia, Achieving textbook multigrid efficiency for hydrostatic ice flow, SIAM Journal on Scientific Computing 35 (2) (2013) 359–375. Also, preprint ANL/MCS-P743-1298.
Yang, Brent (b0230) 2003
Chronopoulos, Swanson (b0060) 1996; 22
Saad (b0165) 1989; 10
Hernandez, Roman, Tomas (b0110) 2007; 33
J.R. Shewchuk, An introduction to the conjugate gradient method without the agonizing pain, 1994.
E.F. D’Azevedo, V.L. Eijkhout, C.H. Romine, Lapack working Note 56 conjugate gradient algorithms with reduced synchronization overhead on distributed memory multiprocessors, 1999.
Sleijpen, van der Vorst (b0185) 1996; 56
10.1016/j.parco.2013.06.001_b0035
Hernandez (10.1016/j.parco.2013.06.001_b0110) 2007; 33
Meurant (10.1016/j.parco.2013.06.001_b0155) 1987; 5
10.1016/j.parco.2013.06.001_b0215
Ghysels (10.1016/j.parco.2013.06.001_b0090) 2013; 35
10.1016/j.parco.2013.06.001_b0015
Williams (10.1016/j.parco.2013.06.001_b0210) 2012
10.1016/j.parco.2013.06.001_b0135
Barrett (10.1016/j.parco.2013.06.001_b0020) 1994
10.1016/j.parco.2013.06.001_b0070
Saad (10.1016/j.parco.2013.06.001_b0160) 1984; 5
Sleijpen (10.1016/j.parco.2013.06.001_b0185) 1996; 56
Saad (10.1016/j.parco.2013.06.001_b0170) 2003
10.1016/j.parco.2013.06.001_b0130
10.1016/j.parco.2013.06.001_b0030
10.1016/j.parco.2013.06.001_b0195
Chronopoulos (10.1016/j.parco.2013.06.001_b0060) 1996; 22
Hestenes (10.1016/j.parco.2013.06.001_b0115) 1952; 49
10.1016/j.parco.2013.06.001_b0150
Yang (10.1016/j.parco.2013.06.001_b0235) 1997
Bai (10.1016/j.parco.2013.06.001_b0010) 1994; 14
Christen (10.1016/j.parco.2013.06.001_b0045) 2011
Schäfer (10.1016/j.parco.2013.06.001_b0175) 2008
Yang (10.1016/j.parco.2013.06.001_b0225) 2002
Van Der Vorst (10.1016/j.parco.2013.06.001_b0205) 1999; 22
Van der Vorst (10.1016/j.parco.2013.06.001_b0200) 2003; vol. 13
Yang (10.1016/j.parco.2013.06.001_b0220) 2002
Ashby (10.1016/j.parco.2013.06.001_b0005) 2012
Strakoš (10.1016/j.parco.2013.06.001_b0190) 2002; 13
Chronopoulos (10.1016/j.parco.2013.06.001_b0055) 1989; 25
10.1016/j.parco.2013.06.001_b0100
Kim (10.1016/j.parco.2013.06.001_b0145) 1992; 6
10.1016/j.parco.2013.06.001_b0065
10.1016/j.parco.2013.06.001_b0105
Ghysels (10.1016/j.parco.2013.06.001_b0095) 2012; 19
10.1016/j.parco.2013.06.001_b0025
Yang (10.1016/j.parco.2013.06.001_b0230) 2003
Saad (10.1016/j.parco.2013.06.001_b0165) 1989; 10
10.1016/j.parco.2013.06.001_b0180
10.1016/j.parco.2013.06.001_b0085
10.1016/j.parco.2013.06.001_b0040
Hoefler (10.1016/j.parco.2013.06.001_b0125) 2010
Hockney (10.1016/j.parco.2013.06.001_b0120) 1988
De Sturler (10.1016/j.parco.2013.06.001_b0075) 1995; 18
Demmel (10.1016/j.parco.2013.06.001_b0080) 1993; 2
Chronopoulos (10.1016/j.parco.2013.06.001_b0050) 1991; 28
Joubert (10.1016/j.parco.2013.06.001_b0140) 1992; 44
References_xml – reference: L. Grigori, S. Moufawad, Communication avoiding ILU(0) preconditioner, Rapport de recherche RR-8266, INRIA, March 2013.
– reference: D. Xie, L.R. Scott, An analysis of parallel U-cycle multigrid method.
– volume: 19
  start-page: 253
  year: 2012
  end-page: 267
  ident: b0095
  article-title: Improving the arithmetic intensity of multigrid with the help of polynomial smoothers
  publication-title: Numerical linear algebra with applications
– reference: M. Hoemmen, Communication-avoiding Krylov subspace methods, Ph.D. Thesis, University of California, 2010.
– year: 2003
  ident: b0230
  article-title: The improved parallel BiCG method for large and sparse unsymmetric linear systems on distributed memory architectures
  publication-title: Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002
– reference: J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick, Avoiding communication in sparse matrix computations, in: 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–12.
– volume: 2
  start-page: 111
  year: 1993
  end-page: 197
  ident: b0080
  article-title: Parallel numerical linear algebra
  publication-title: Acta Numerica
– volume: 10
  start-page: 1200
  year: 1989
  ident: b0165
  article-title: Krylov subspace methods on supercomputers
  publication-title: SIAM Journal on Scientific and Statistical Computing
– start-page: 389
  year: 1997
  end-page: 399
  ident: b0235
  article-title: The improved quasi-minimal residual method on massively distributed memory computers
  publication-title: High-Performance Computing and Networking
– year: 1994
  ident: b0020
  article-title: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods
– reference: S. Balay, J. Brown, K. Buschelman, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman McInnes, B.F. Smith, H. Zhang, PETSc Web page, 2013,
– reference: J.R. Shewchuk, An introduction to the conjugate gradient method without the agonizing pain, 1994.
– reference: W. Gropp, Update on libraries for blue waters,
– start-page: 324
  year: 2002
  end-page: 328
  ident: b0225
  article-title: The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures
  publication-title: Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002
– reference: E.F. D’Azevedo , C.H. Romine, Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors, Technical report, Oak Ridge National Lab, TN, 1992.
– volume: 33
  start-page: 521
  year: 2007
  end-page: 540
  ident: b0110
  article-title: Parallel Arnoldi eigensolvers with enhanced scalability via global communications rearrangement
  publication-title: Parallel Computing
– volume: 6
  start-page: 407
  year: 1992
  end-page: 420
  ident: b0145
  article-title: An efficient parallel algorithm for extreme eigenvalues of sparse nonsymmetric matrices
  publication-title: International Journal of High Performance Computing Applications
– start-page: 285
  year: 2008
  end-page: 294
  ident: b0175
  article-title: LibGeoDecomp: a grid-enabled library for geometric decomposition codes
  publication-title: Proceedings of the 15th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
– volume: 5
  start-page: 267
  year: 1987
  end-page: 280
  ident: b0155
  article-title: Multitasking the conjugate gradient method on the CRAY X-MP/48
  publication-title: Parallel Computing
– volume: 25
  start-page: 153
  year: 1989
  end-page: 168
  ident: b0055
  article-title: s-Step iterative methods for symmetric linear systems
  publication-title: Journal of Computational and Applied Mathematics
– start-page: 597
  year: 2010
  end-page: 604
  ident: b0125
  article-title: LogGOPSim – simulating large-scale applications in the LogGOPS model
  publication-title: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
– volume: 56
  start-page: 141
  year: 1996
  end-page: 163
  ident: b0185
  article-title: Reliable updated residuals in hybrid Bi-CG methods
  publication-title: Computing
– reference: S.A. Toledo, Quantitative performance modeling of scientific computations and creating locality in numerical algorithms, Ph.D. Thesis, Massachusetts Institute of Technology, 1995.
– reference: Jed Brown, Barry F. Smith, Aron Ahmadia, Achieving textbook multigrid efficiency for hydrostatic ice flow, SIAM Journal on Scientific Computing 35 (2) (2013) 359–375. Also, preprint ANL/MCS-P743-1298.
– volume: 49
  year: 1952
  ident: b0115
  article-title: Methods of conjugate gradients for solving linear systems
  publication-title: Journal of Research of the National Bureau of Standards
– reference: E. Carson, J. Demmel, A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods, Technical Report UCB/EECS-2012-44, University of California, Berkeley, CA, USA, 2012.
– volume: 28
  start-page: 1776
  year: 1991
  end-page: 1789
  ident: b0050
  article-title: s-Step iterative methods for (non) symmetric (in) definite linear systems
  publication-title: SIAM Journal on Numerical Analysis
– reference: T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, W. Rehm, Non-blocking collective operations for MPI-2, Open Systems Lab, Indiana University, Tech. Rep, 8, 2006.
– volume: 5
  start-page: 203
  year: 1984
  end-page: 228
  ident: b0160
  article-title: Practical use of some Krylov subspace methods for solving indefinite and nonsymmetric linear systems
  publication-title: SIAM Journal on Scientific and Statistical Computing
– reference: L.C. McInnes, B. Smith, H. Zhang, R. Tran Mills. Hierarchical and nested Krylov methods for extreme-scale computing, Technical Report ANL/MCS-P2097-0612, Argonne National Laboratory, 2012.
– volume: vol. 13
  year: 2003
  ident: b0200
  publication-title: Iterative Krylov Methods for Large Linear Systems
– volume: 22
  start-page: 835
  year: 1999
  end-page: 852
  ident: b0205
  article-title: Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals
  publication-title: SIAM Journal on Scientific Computing
– reference: E. Carson, N. Knight, J. Demmel, Avoiding communication in two-sided Krylov subspace methods, Technical report, University of California, Berkeley, CA, USA, 2011.
– volume: 13
  start-page: 56
  year: 2002
  end-page: 80
  ident: b0190
  article-title: On error estimation in the conjugate gradient method and why it works in finite precision computations
  publication-title: Electronic Transactions on Numerical Analysis
– volume: 35
  start-page: C48
  year: 2013
  end-page: C71
  ident: b0090
  article-title: Hiding global communication latency in the GMRES algorithm on massively parallel machines
  publication-title: SIAM Journal on Scientific Computing
– start-page: 232
  year: 2002
  end-page: 237
  ident: b0220
  article-title: The improved CGS method for large and sparse linear systems on bulk synchronous parallel architectures
  publication-title: Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002
– start-page: 96
  year: 2012
  ident: b0210
  article-title: Optimization of geometric multigrid for emerging multi-and manycore processors
  publication-title: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
– reference: .
– start-page: 428
  year: 2012
  end-page: 442
  ident: b0005
  article-title: The impact of global communication latency at extreme scales on Krylov methods
  publication-title: Algorithms and Architectures for Parallel Processing
– reference: J. Brown, User-defined nonblocking collectives must make progress, in: IEEE Technical Committee on Scalable Computing (TCSC), 2012.
– reference: E.F. D’Azevedo, V.L. Eijkhout, C.H. Romine, Lapack working Note 56 conjugate gradient algorithms with reduced synchronization overhead on distributed memory multiprocessors, 1999.
– volume: 18
  start-page: 441
  year: 1995
  end-page: 459
  ident: b0075
  article-title: Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers
  publication-title: Applied Numerical Mathematics
– year: 1988
  ident: b0120
  article-title: Computer Simulation Using Particles
– start-page: 676
  year: 2011
  end-page: 687
  ident: b0045
  article-title: Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures
  publication-title: Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
– volume: 44
  start-page: 243
  year: 1992
  end-page: 267
  ident: b0140
  article-title: Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: Theory
  publication-title: International Journal of Computer Mathematics
– volume: 14
  start-page: 563
  year: 1994
  end-page: 581
  ident: b0010
  article-title: A Newton basis GMRES implementation
  publication-title: IMA Journal of Numerical Analysis
– year: 2003
  ident: b0170
  article-title: Iterative Methods for Sparse Linear Systems
– volume: 22
  start-page: 623
  year: 1996
  end-page: 641
  ident: b0060
  article-title: Parallel iterative s-step methods for unsymmetric linear systems
  publication-title: Parallel Computing
– start-page: 597
  year: 2010
  ident: 10.1016/j.parco.2013.06.001_b0125
  article-title: LogGOPSim – simulating large-scale applications in the LogGOPS model
– volume: 56
  start-page: 141
  issue: 2
  year: 1996
  ident: 10.1016/j.parco.2013.06.001_b0185
  article-title: Reliable updated residuals in hybrid Bi-CG methods
  publication-title: Computing
  doi: 10.1007/BF02309342
– volume: 28
  start-page: 1776
  issue: 6
  year: 1991
  ident: 10.1016/j.parco.2013.06.001_b0050
  article-title: s-Step iterative methods for (non) symmetric (in) definite linear systems
  publication-title: SIAM Journal on Numerical Analysis
  doi: 10.1137/0728088
– ident: 10.1016/j.parco.2013.06.001_b0070
  doi: 10.2172/10176473
– volume: 49
  issue: 6
  year: 1952
  ident: 10.1016/j.parco.2013.06.001_b0115
  article-title: Methods of conjugate gradients for solving linear systems
  publication-title: Journal of Research of the National Bureau of Standards
  doi: 10.6028/jres.049.044
– volume: 33
  start-page: 521
  issue: 7–8
  year: 2007
  ident: 10.1016/j.parco.2013.06.001_b0110
  article-title: Parallel Arnoldi eigensolvers with enhanced scalability via global communications rearrangement
  publication-title: Parallel Computing
  doi: 10.1016/j.parco.2007.04.004
– ident: 10.1016/j.parco.2013.06.001_b0085
  doi: 10.1109/IPDPS.2008.4536305
– year: 1988
  ident: 10.1016/j.parco.2013.06.001_b0120
– ident: 10.1016/j.parco.2013.06.001_b0025
– ident: 10.1016/j.parco.2013.06.001_b0130
– volume: 19
  start-page: 253
  issue: 2
  year: 2012
  ident: 10.1016/j.parco.2013.06.001_b0095
  article-title: Improving the arithmetic intensity of multigrid with the help of polynomial smoothers
  publication-title: Numerical linear algebra with applications
  doi: 10.1002/nla.1808
– volume: 35
  start-page: C48
  issue: 1
  year: 2013
  ident: 10.1016/j.parco.2013.06.001_b0090
  article-title: Hiding global communication latency in the GMRES algorithm on massively parallel machines
  publication-title: SIAM Journal on Scientific Computing
  doi: 10.1137/12086563X
– ident: 10.1016/j.parco.2013.06.001_b0150
– year: 1994
  ident: 10.1016/j.parco.2013.06.001_b0020
– volume: 25
  start-page: 153
  issue: 2
  year: 1989
  ident: 10.1016/j.parco.2013.06.001_b0055
  article-title: s-Step iterative methods for symmetric linear systems
  publication-title: Journal of Computational and Applied Mathematics
  doi: 10.1016/0377-0427(89)90045-9
– start-page: 324
  year: 2002
  ident: 10.1016/j.parco.2013.06.001_b0225
  article-title: The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures
– volume: 18
  start-page: 441
  issue: 4
  year: 1995
  ident: 10.1016/j.parco.2013.06.001_b0075
  article-title: Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers
  publication-title: Applied Numerical Mathematics
  doi: 10.1016/0168-9274(95)00079-A
– ident: 10.1016/j.parco.2013.06.001_b0195
– ident: 10.1016/j.parco.2013.06.001_b0065
– start-page: 96
  year: 2012
  ident: 10.1016/j.parco.2013.06.001_b0210
  article-title: Optimization of geometric multigrid for emerging multi-and manycore processors
– year: 2003
  ident: 10.1016/j.parco.2013.06.001_b0170
– ident: 10.1016/j.parco.2013.06.001_b0015
– ident: 10.1016/j.parco.2013.06.001_b0105
– volume: 10
  start-page: 1200
  year: 1989
  ident: 10.1016/j.parco.2013.06.001_b0165
  article-title: Krylov subspace methods on supercomputers
  publication-title: SIAM Journal on Scientific and Statistical Computing
  doi: 10.1137/0910073
– ident: 10.1016/j.parco.2013.06.001_b0030
– volume: 44
  start-page: 243
  issue: 1–4
  year: 1992
  ident: 10.1016/j.parco.2013.06.001_b0140
  article-title: Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: Theory
  publication-title: International Journal of Computer Mathematics
  doi: 10.1080/00207169208804107
– volume: 5
  start-page: 203
  issue: 1
  year: 1984
  ident: 10.1016/j.parco.2013.06.001_b0160
  article-title: Practical use of some Krylov subspace methods for solving indefinite and nonsymmetric linear systems
  publication-title: SIAM Journal on Scientific and Statistical Computing
  doi: 10.1137/0905015
– start-page: 428
  year: 2012
  ident: 10.1016/j.parco.2013.06.001_b0005
  article-title: The impact of global communication latency at extreme scales on Krylov methods
  publication-title: Algorithms and Architectures for Parallel Processing
  doi: 10.1007/978-3-642-33078-0_31
– ident: 10.1016/j.parco.2013.06.001_b0035
  doi: 10.21236/ADA561766
– volume: 13
  start-page: 56
  year: 2002
  ident: 10.1016/j.parco.2013.06.001_b0190
  article-title: On error estimation in the conjugate gradient method and why it works in finite precision computations
  publication-title: Electronic Transactions on Numerical Analysis
– year: 2003
  ident: 10.1016/j.parco.2013.06.001_b0230
  article-title: The improved parallel BiCG method for large and sparse unsymmetric linear systems on distributed memory architectures
– volume: 6
  start-page: 407
  issue: 4
  year: 1992
  ident: 10.1016/j.parco.2013.06.001_b0145
  article-title: An efficient parallel algorithm for extreme eigenvalues of sparse nonsymmetric matrices
  publication-title: International Journal of High Performance Computing Applications
  doi: 10.1177/109434209200600411
– start-page: 232
  year: 2002
  ident: 10.1016/j.parco.2013.06.001_b0220
  article-title: The improved CGS method for large and sparse linear systems on bulk synchronous parallel architectures
– volume: vol. 13
  year: 2003
  ident: 10.1016/j.parco.2013.06.001_b0200
– ident: 10.1016/j.parco.2013.06.001_b0135
– volume: 14
  start-page: 563
  issue: 4
  year: 1994
  ident: 10.1016/j.parco.2013.06.001_b0010
  article-title: A Newton basis GMRES implementation
  publication-title: IMA Journal of Numerical Analysis
  doi: 10.1093/imanum/14.4.563
– start-page: 676
  year: 2011
  ident: 10.1016/j.parco.2013.06.001_b0045
  article-title: Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures
– start-page: 389
  year: 1997
  ident: 10.1016/j.parco.2013.06.001_b0235
  article-title: The improved quasi-minimal residual method on massively distributed memory computers
– volume: 22
  start-page: 835
  issue: 3
  year: 1999
  ident: 10.1016/j.parco.2013.06.001_b0205
  article-title: Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals
  publication-title: SIAM Journal on Scientific Computing
  doi: 10.1137/S1064827599353865
– volume: 5
  start-page: 267
  issue: 3
  year: 1987
  ident: 10.1016/j.parco.2013.06.001_b0155
  article-title: Multitasking the conjugate gradient method on the CRAY X-MP/48
  publication-title: Parallel Computing
  doi: 10.1016/0167-8191(87)90037-8
– volume: 2
  start-page: 111
  issue: -1
  year: 1993
  ident: 10.1016/j.parco.2013.06.001_b0080
  article-title: Parallel numerical linear algebra
  publication-title: Acta Numerica
  doi: 10.1017/S096249290000235X
– ident: 10.1016/j.parco.2013.06.001_b0100
– volume: 22
  start-page: 623
  issue: 5
  year: 1996
  ident: 10.1016/j.parco.2013.06.001_b0060
  article-title: Parallel iterative s-step methods for unsymmetric linear systems
  publication-title: Parallel Computing
  doi: 10.1016/0167-8191(96)00022-1
– ident: 10.1016/j.parco.2013.06.001_b0180
– ident: 10.1016/j.parco.2013.06.001_b0040
  doi: 10.21236/ADA555879
– ident: 10.1016/j.parco.2013.06.001_b0215
– start-page: 285
  year: 2008
  ident: 10.1016/j.parco.2013.06.001_b0175
  article-title: LibGeoDecomp: a grid-enabled library for geometric decomposition codes
SSID ssj0006480
Score 2.377402
Snippet •The manuscript presents a highly scalable preconditioned Conjugate Gradient method.•It presents a pipelined preconditioned Conjugate Residual method.•It shows...
Scalability of Krylov subspace methods suffers from costly global synchronization steps that arise in dot-products and norm calculations on parallel machines....
SourceID proquest
crossref
elsevier
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 224
SubjectTerms Algorithms
Conjugate gradients
Conjugate residuals
Distributed memory
Global communication
Latency hiding
Mathematical models
Parallelization
Run time (computers)
Synchronism
Synchronization
Title Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm
URI https://dx.doi.org/10.1016/j.parco.2013.06.001
https://www.proquest.com/docview/1559715838
Volume 40
WOSCitedRecordID wos000339598400007&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: ScienceDirect database
  customDbUrl:
  eissn: 1872-7336
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0006480
  issn: 0167-8191
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtZ3Nb9MwFMCtsnHgwjdiGyAjIS4jU9M4jnNCE-oYaCo7dJCb5dgxtCpZSNNp_Pc8f2XdJqZx4BJVUT4q_5z3nv2-EHqj5XAklKRRzlQZEZbCJ6VzHZVxKqQ2fiPfbCKbTFhR5MeDwfuQC3O2yOqanZ_nzX9FDecAtkmd_Qfc_UPhBPwG6HAE7HC8FfjDmc1T8ZU-lr9raevfunTL3YXobLKlD29s7IJYuYJFyuT_zVdmY233Y2tjwcyG8PfTdtb9-Lluxh6L1vRgWdiI9FUX1J-J4wHuoG4vZY59FTXY566D47f1bYaY9CGp_c4jSFSzulsXna7Skp8i2bocdInRXqWOXAGXa9LabRzM9xoAbRIx42TP-YYulFNwyE--8IOToyM-HRfTt82vyLQNM-5130PlDtocZWkOYm1z_9O4-NwrY0ps87z-34fCUzbE79p7_2acXFHT1vaYPkT3_aIB7zvYj9Cgqh-jB6EhB_by-QkqHHvs2OMr7LFnj2c1Bvb4Mnvcs8eBPe7ZP0UnB-Pph8PIt86IJNi_XZTqUlWxTqkc5mxYJjGpCFj6pSCMSlVVVFeJSMuUaqaHpUyYTHORaFN8EDQnmHjP0EYNL3-OsEqEzGJ4HHzsRFNSKk2EoiyXLFNUiC00CmPGpa8rb9qbLHgIIJxzO9DcDDR3YZRb6F1_U-PKqtx8OQ0wuLcMncXHYSrdfOPrgI6D3DTOMFFXp6slN-74LDZBA9u3uGYH3bv4KF6gja5dVS_RXXnWzZbtKz_p_gCV75V4
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Hiding+global+synchronization+latency+in+the+preconditioned+Conjugate+Gradient+algorithm&rft.jtitle=Parallel+computing&rft.au=Ghysels%2C+P&rft.au=Vanroose%2C+W&rft.date=2014-07-01&rft.issn=0167-8191&rft.volume=40&rft.issue=7&rft.spage=224&rft.epage=238&rft_id=info:doi/10.1016%2Fj.parco.2013.06.001&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-8191&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-8191&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-8191&client=summon