Parallel matrix transpose algorithms on distributed memory concurrent computers

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Parallel computing Jg. 21; H. 9; S. 1387 - 1405
Hauptverfasser: Choi, Jaeyoung, Dongarra, Jack J., Walker, David W.
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Amsterdam Elsevier B.V 01.09.1995
Elsevier
Schlagworte:
ISSN:0167-8191, 1872-7336
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor ( GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A · B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A T · B T , in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
AbstractList This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor ( GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A · B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A T · B T , in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
Author Choi, Jaeyoung
Walker, David W.
Dongarra, Jack J.
Author_xml – sequence: 1
  givenname: Jaeyoung
  surname: Choi
  fullname: Choi, Jaeyoung
  email: choi@msr.epm.ornl.gov
  organization: School of Computing, Soongsil University, 1-1 Sangdo-Dong, Dongjak-Ku, Seoul 156-743, South Korea
– sequence: 2
  givenname: Jack J.
  surname: Dongarra
  fullname: Dongarra, Jack J.
  organization: Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Bldg. 6012, Oak Ridge, TN 37831-6367, USA
– sequence: 3
  givenname: David W.
  surname: Walker
  fullname: Walker, David W.
  organization: Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Bldg. 6012, Oak Ridge, TN 37831-6367, USA
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=3683593$$DView record in Pascal Francis
BookMark eNqFkEFPwyAUx4mZidv0G3jowYMeqrRQSj2YmEWdyZJ50DN5BaqYFhZgxn17mTM7eNADgZf3-788fhM0ss5qhE4LfFnggl2lU-e8aIrzprrAOJX5_ACNC16XeU0IG6HxHjlCkxDeE8Qox2O0fAIPfa_7bIDozWcWPdiwckFn0L86b-LbEDJnM2VC6rfrqFU26MH5TSadlWvvtY3pOaxSy4djdNhBH_TJzz1FL_d3z7N5vlg-PM5uF7kkhMQcylLRWrVQQUsKUEwzSXlDOWANnFDAsmtJLSvaVFTxkijKGIeGtWVLoAYyRWe7uSsIEvoubS1NECtvBvAbQRgnVUMSRneY9C4Er7s9UWCxdSe2YsRWjGgq8e1OzFPs-ldMmgjROJv0mP6_8M0urNP_P4z2IkijrdTKeC2jUM78PeALnCGL4w
CODEN PACOEJ
CitedBy_id crossref_primary_10_1016_j_micpro_2018_09_002
crossref_primary_10_1016_j_jocs_2023_101945
crossref_primary_10_1002_cpe_639
crossref_primary_10_1016_j_parco_2019_102597
crossref_primary_10_1109_TPDS_2021_3131657
crossref_primary_10_1177_10943420231205601
crossref_primary_10_1007_s10766_017_0515_0
crossref_primary_10_1016_j_parco_2009_01_003
crossref_primary_10_1016_j_parco_2020_102624
Cites_doi 10.1137/0609037
10.1002/cpe.4330060702
10.1109/T-C.1972.223584
10.1109/TC.1987.5009457
ContentType Journal Article
Copyright 1995
1995 INIST-CNRS
Copyright_xml – notice: 1995
– notice: 1995 INIST-CNRS
DBID AAYXX
CITATION
IQODW
DOI 10.1016/0167-8191(95)00016-H
DatabaseName CrossRef
Pascal-Francis
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
Applied Sciences
EISSN 1872-7336
EndPage 1405
ExternalDocumentID 3683593
10_1016_0167_8191_95_00016_H
016781919500016H
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
123
1B1
1~.
1~5
29O
4.4
457
4G.
5VS
6OB
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABJNI
ABMAC
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EFLBG
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
KOM
LG9
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SCC
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
WH7
WUQ
XPP
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
CITATION
EFKBS
~HD
AFXIZ
AGCQF
AGRNS
BNPGV
IQODW
SSH
ID FETCH-LOGICAL-c333t-a22d47dba5ab31ad6e6c48948a0ea834a0cfb37c54954d823d4668a96b2b3a7a3
ISICitedReferencesCount 24
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=016781919500016H&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0167-8191
IngestDate Mon Jul 21 09:17:05 EDT 2025
Tue Nov 18 21:26:31 EST 2025
Sat Nov 29 03:58:55 EST 2025
Fri Feb 23 02:30:42 EST 2024
IsPeerReviewed true
IsScholarly true
Issue 9
Keywords Distributed memory multiprocessors
Matrix transpose algorithm
Intel Touchstone Delta
Point-to-point communication
Linear algebra
Matrix diagonalization
Parallel algorithm
Distributed memory multiprocessor system
Matrix inversion
Distributed algorithm
Matrix calculus
Matrix product
Parallelism
Distributed system
Implementation
Point to point communication
Language English
License https://www.elsevier.com/tdm/userlicense/1.0
CC BY 4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c333t-a22d47dba5ab31ad6e6c48948a0ea834a0cfb37c54954d823d4668a96b2b3a7a3
PageCount 19
ParticipantIDs pascalfrancis_primary_3683593
crossref_primary_10_1016_0167_8191_95_00016_H
crossref_citationtrail_10_1016_0167_8191_95_00016_H
elsevier_sciencedirect_doi_10_1016_0167_8191_95_00016_H
PublicationCentury 1900
PublicationDate 1995-09-01
PublicationDateYYYYMMDD 1995-09-01
PublicationDate_xml – month: 09
  year: 1995
  text: 1995-09-01
  day: 01
PublicationDecade 1990
PublicationPlace Amsterdam
PublicationPlace_xml – name: Amsterdam
PublicationTitle Parallel computing
PublicationYear 1995
Publisher Elsevier B.V
Elsevier
Publisher_xml – name: Elsevier B.V
– name: Elsevier
References Azari, Bojanczyk, Lee (BIB1) 1988
Dongarra, van de Geijn, Walker (BIB6) 1992
Choi, Dongarra, Walker (BIB4) 1992
O'Leary (BIB12) 1987; 36
Johnsson, Ho (BIB10) 1988; 9
Littlefield (BIB11) 1992
Bokhari, Berryman (BIB2) 1992
Golub, Van Loan (BIB8) 1989
Intel Corporation (BIB9) 1991
Strang (BIB13) 1988
Takkella, Seidel (BIB14) 1994
Choi, Dongarra, Pozo, Walker (BIB3) 1992
Choi, Dongarra, Walker (BIB5) 1994; 6
Eklundh (BIB7) 1972; 21
Choi (10.1016/0167-8191(95)00016-H_BIB5) 1994; 6
Eklundh (10.1016/0167-8191(95)00016-H_BIB7) 1972; 21
Choi (10.1016/0167-8191(95)00016-H_BIB4) 1992
O'Leary (10.1016/0167-8191(95)00016-H_BIB12) 1987; 36
Azari (10.1016/0167-8191(95)00016-H_BIB1) 1988
Bokhari (10.1016/0167-8191(95)00016-H_BIB2) 1992
Choi (10.1016/0167-8191(95)00016-H_BIB3) 1992
Golub (10.1016/0167-8191(95)00016-H_BIB8) 1989
Johnsson (10.1016/0167-8191(95)00016-H_BIB10) 1988; 9
Intel Corporation (10.1016/0167-8191(95)00016-H_BIB9) 1991
Littlefield (10.1016/0167-8191(95)00016-H_BIB11) 1992
Dongarra (10.1016/0167-8191(95)00016-H_BIB6) 1992
Strang (10.1016/0167-8191(95)00016-H_BIB13) 1988
Takkella (10.1016/0167-8191(95)00016-H_BIB14) 1994
References_xml – year: 1991
  ident: BIB9
  publication-title: Touchstone Delta Fortran Calls Reference Manual
– volume: 6
  start-page: 543
  year: 1994
  end-page: 570
  ident: BIB5
  article-title: PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers
  publication-title: Concurrency: Practice and Experience
– volume: 36
  start-page: 117
  year: 1987
  end-page: 122
  ident: BIB12
  article-title: Systolic arrays for matrix transpose and other reorderings
  publication-title: IEEE Trans. Comput.
– start-page: 277
  year: 1988
  end-page: 288
  ident: BIB1
  article-title: Synchronous and asynchronous algorithms for matrix transposition on MCAP
  publication-title: SPIE Vol. 975, Advanced Algorithms and Architecture for Signal Processing III
– volume: 9
  start-page: 419
  year: 1988
  end-page: 454
  ident: BIB10
  article-title: Algorithms for matrix transposition on boolean n-cube configured ensemble architecture
  publication-title: SIAM J. Matrix Anal. Appl.
– start-page: 372
  year: 1992
  end-page: 379
  ident: BIB6
  article-title: A look at scalable linear algebra libraries
  publication-title: Proc. 1992 Scalable High Performance Computing Conf.
– year: 1989
  ident: BIB8
  publication-title: Matrix Computations
– volume: 21
  start-page: 801
  year: 1972
  end-page: 803
  ident: BIB7
  article-title: A fast computer method for matrix transposing
  publication-title: IEEE Trans. Comput.
– start-page: 422
  year: 1994
  end-page: 428
  ident: BIB14
  article-title: Complete exchange and broadcast algorithms for meshes
  publication-title: Proc. Scalable High Performance Computing Conf.
– start-page: 300
  year: 1992
  end-page: 306
  ident: BIB2
  article-title: Complete exchange on a circuit switched mesh
  publication-title: Proc. Scalable High Performance Computing Conf.
– year: 1992
  ident: BIB3
  article-title: ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers
  publication-title: Proc. Fourth Symp. on the Frontiers of Massively Parallel Computation (McLean, Virginia)
– year: 1992
  ident: BIB4
  article-title: The design of scalable software libraries for distributed memory concurrent computers
  publication-title: Proc. Environment and Tools for Parallel Scientific Computing Workshop (Saint Hilaire du Touvet, France)
– start-page: 179
  year: 1992
  end-page: 190
  ident: BIB11
  article-title: Characterizing and tuning communications performance for real applications
  publication-title: Proc. First Intel Delta Application Workshop, CCSF-14-92
– year: 1988
  ident: BIB13
  publication-title: Linear Algebra and Its Applications
– year: 1992
  ident: 10.1016/0167-8191(95)00016-H_BIB4
  article-title: The design of scalable software libraries for distributed memory concurrent computers
– volume: 9
  start-page: 419
  year: 1988
  ident: 10.1016/0167-8191(95)00016-H_BIB10
  article-title: Algorithms for matrix transposition on boolean n-cube configured ensemble architecture
  publication-title: SIAM J. Matrix Anal. Appl.
  doi: 10.1137/0609037
– year: 1989
  ident: 10.1016/0167-8191(95)00016-H_BIB8
– volume: 6
  start-page: 543
  year: 1994
  ident: 10.1016/0167-8191(95)00016-H_BIB5
  article-title: PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers
  publication-title: Concurrency: Practice and Experience
  doi: 10.1002/cpe.4330060702
– year: 1988
  ident: 10.1016/0167-8191(95)00016-H_BIB13
– start-page: 422
  year: 1994
  ident: 10.1016/0167-8191(95)00016-H_BIB14
  article-title: Complete exchange and broadcast algorithms for meshes
– volume: 21
  start-page: 801
  year: 1972
  ident: 10.1016/0167-8191(95)00016-H_BIB7
  article-title: A fast computer method for matrix transposing
  publication-title: IEEE Trans. Comput.
  doi: 10.1109/T-C.1972.223584
– start-page: 372
  year: 1992
  ident: 10.1016/0167-8191(95)00016-H_BIB6
  article-title: A look at scalable linear algebra libraries
– year: 1991
  ident: 10.1016/0167-8191(95)00016-H_BIB9
  publication-title: Touchstone Delta Fortran Calls Reference Manual
– volume: 36
  start-page: 117
  year: 1987
  ident: 10.1016/0167-8191(95)00016-H_BIB12
  article-title: Systolic arrays for matrix transpose and other reorderings
  publication-title: IEEE Trans. Comput.
  doi: 10.1109/TC.1987.5009457
– year: 1992
  ident: 10.1016/0167-8191(95)00016-H_BIB3
  article-title: ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers
– start-page: 179
  year: 1992
  ident: 10.1016/0167-8191(95)00016-H_BIB11
  article-title: Characterizing and tuning communications performance for real applications
– start-page: 277
  year: 1988
  ident: 10.1016/0167-8191(95)00016-H_BIB1
  article-title: Synchronous and asynchronous algorithms for matrix transposition on MCAP
– start-page: 300
  year: 1992
  ident: 10.1016/0167-8191(95)00016-H_BIB2
  article-title: Complete exchange on a circuit switched mesh
SSID ssj0006480
Score 1.533283
Snippet This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q...
SourceID pascalfrancis
crossref
elsevier
SourceType Index Database
Enrichment Source
Publisher
StartPage 1387
SubjectTerms Algorithmics. Computability. Computer arithmetics
Applied sciences
Computer science; control theory; systems
Computer systems and distributed systems. User interface
Distributed memory multiprocessors
Exact sciences and technology
Intel Touchstone Delta
Linear algebra
Matrix transpose algorithm
Memory and file management (including protection and security)
Memory organisation. Data processing
Point-to-point communication
Software
Theoretical computing
Title Parallel matrix transpose algorithms on distributed memory concurrent computers
URI https://dx.doi.org/10.1016/0167-8191(95)00016-H
Volume 21
WOSCitedRecordID wos016781919500016H&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1872-7336
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0006480
  issn: 0167-8191
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1bb9MwFLag4wEJcUdsMOQHkEBTtiZ2nPgxYoUCU4umDbqnyLETNqlNq6ag8u85ji8NCFR44CWqnMaWfD6fm88FoecqjCWvuAoEVyygqowCEcZVIEDd5VSqMGorMX06SUajdDLh7ka3adsJJHWdrtd88V9JDWNAbJ06-w_k9pPCAPwGosMTyA7PvyL8R7HU_VGmBzNdfX-tm0Do8uVNeSCmX-bLq9XlrL0iULpkru52BSrnTMfbftch6NLWa5K220PT1V791Oatk3o6ACe7GFyMz7Xnazj2jtjj8ehtdnqatVA53NxAfc5OPpgIjONDE-FnHA8mkZt3HQ8uI6brnASmqw3ALnc1-c8WRbzDKkNiJa0Ru2Doxb9l6ca74OcGvZvHLyLeKqvBcCPG3NX9L9LNxxwSBsomJ9fRTpTEPO2hnezdYPLei21G2zZ7fiGXZxmyIz_2ksev7MJ_0mNuLUQDp6sybVE6usrZXXTbGhk4M-C4h66V9X10xxoc2LLzBoZcTw839gCNHY2xgQ_28MEb-OB5jTvwwQY-eAMf7OHzEJ2_GZy9Hga26UYgCSErOKSRookqRCwKEgrFSiZpymkq-qVICRV9WRUkkTFY1lSlEVGUsVRwVkQFEYkgj1CvntflY4SBv5dhKkEHLFMwrIUAcVJIUgmaVFFS9XcRcVuYS1uRXjdGmeYu9FBvfK43PudxGybB8uEuCvxXC1ORZcv_E0ed3GqVRlvMAWFbvtz_iZh-OQulvS3vn6Cbm3PzFPVWy6_lProhv62umuUzi78fpryblQ
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Parallel+matrix+transpose+algorithms+on+distributed+memory+concurrent+computers&rft.jtitle=Parallel+computing&rft.au=JAYEYOUNG+CHOI&rft.au=DONGARRA%2C+J.+J&rft.au=WALKER%2C+D.+W&rft.date=1995-09-01&rft.pub=Elsevier&rft.issn=0167-8191&rft.volume=21&rft.issue=9&rft.spage=1387&rft.epage=1405&rft_id=info:doi/10.1016%2F0167-8191%2895%2900016-H&rft.externalDBID=n%2Fa&rft.externalDocID=3683593
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-8191&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-8191&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-8191&client=summon