Parallel matrix transpose algorithms on distributed memory concurrent computers
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability....
Gespeichert in:
| Veröffentlicht in: | Parallel computing Jg. 21; H. 9; S. 1387 - 1405 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Amsterdam
Elsevier B.V
01.09.1995
Elsevier |
| Schlagworte: | |
| ISSN: | 0167-8191, 1872-7336 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a
P ×
Q processor template with a block cyclic data distribution.
P,
Q, and the block size can be arbitrary, so the algorithms have wide applicability.
The communication schemes of the algorithms are determined by the greatest common divisor (
GCD) of
P and
Q. If
P and
Q are relatively prime, the matrix transpose algorithm involves
complete exchange communication. If
P and
Q are not relatively prime, processors are divided into
GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose
GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with
LCM/GCD steps, where
LCM is the least common multiple of
P and
Q.
The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization.
Combined with the matrix multiplication routine,
C =
A ·
B, the algorithms are used to compute parallel multiplications of transposed matrices,
C =
A
T
·
B
T
, in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer. |
|---|---|
| AbstractList | This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a
P ×
Q processor template with a block cyclic data distribution.
P,
Q, and the block size can be arbitrary, so the algorithms have wide applicability.
The communication schemes of the algorithms are determined by the greatest common divisor (
GCD) of
P and
Q. If
P and
Q are relatively prime, the matrix transpose algorithm involves
complete exchange communication. If
P and
Q are not relatively prime, processors are divided into
GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose
GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with
LCM/GCD steps, where
LCM is the least common multiple of
P and
Q.
The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization.
Combined with the matrix multiplication routine,
C =
A ·
B, the algorithms are used to compute parallel multiplications of transposed matrices,
C =
A
T
·
B
T
, in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer. |
| Author | Choi, Jaeyoung Walker, David W. Dongarra, Jack J. |
| Author_xml | – sequence: 1 givenname: Jaeyoung surname: Choi fullname: Choi, Jaeyoung email: choi@msr.epm.ornl.gov organization: School of Computing, Soongsil University, 1-1 Sangdo-Dong, Dongjak-Ku, Seoul 156-743, South Korea – sequence: 2 givenname: Jack J. surname: Dongarra fullname: Dongarra, Jack J. organization: Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Bldg. 6012, Oak Ridge, TN 37831-6367, USA – sequence: 3 givenname: David W. surname: Walker fullname: Walker, David W. organization: Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Bldg. 6012, Oak Ridge, TN 37831-6367, USA |
| BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=3683593$$DView record in Pascal Francis |
| BookMark | eNqFkEFPwyAUx4mZidv0G3jowYMeqrRQSj2YmEWdyZJ50DN5BaqYFhZgxn17mTM7eNADgZf3-788fhM0ss5qhE4LfFnggl2lU-e8aIrzprrAOJX5_ACNC16XeU0IG6HxHjlCkxDeE8Qox2O0fAIPfa_7bIDozWcWPdiwckFn0L86b-LbEDJnM2VC6rfrqFU26MH5TSadlWvvtY3pOaxSy4djdNhBH_TJzz1FL_d3z7N5vlg-PM5uF7kkhMQcylLRWrVQQUsKUEwzSXlDOWANnFDAsmtJLSvaVFTxkijKGIeGtWVLoAYyRWe7uSsIEvoubS1NECtvBvAbQRgnVUMSRneY9C4Er7s9UWCxdSe2YsRWjGgq8e1OzFPs-ldMmgjROJv0mP6_8M0urNP_P4z2IkijrdTKeC2jUM78PeALnCGL4w |
| CODEN | PACOEJ |
| CitedBy_id | crossref_primary_10_1016_j_micpro_2018_09_002 crossref_primary_10_1016_j_jocs_2023_101945 crossref_primary_10_1002_cpe_639 crossref_primary_10_1016_j_parco_2019_102597 crossref_primary_10_1109_TPDS_2021_3131657 crossref_primary_10_1177_10943420231205601 crossref_primary_10_1007_s10766_017_0515_0 crossref_primary_10_1016_j_parco_2009_01_003 crossref_primary_10_1016_j_parco_2020_102624 |
| Cites_doi | 10.1137/0609037 10.1002/cpe.4330060702 10.1109/T-C.1972.223584 10.1109/TC.1987.5009457 |
| ContentType | Journal Article |
| Copyright | 1995 1995 INIST-CNRS |
| Copyright_xml | – notice: 1995 – notice: 1995 INIST-CNRS |
| DBID | AAYXX CITATION IQODW |
| DOI | 10.1016/0167-8191(95)00016-H |
| DatabaseName | CrossRef Pascal-Francis |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science Applied Sciences |
| EISSN | 1872-7336 |
| EndPage | 1405 |
| ExternalDocumentID | 3683593 10_1016_0167_8191_95_00016_H 016781919500016H |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 123 1B1 1~. 1~5 29O 4.4 457 4G. 5VS 6OB 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABJNI ABMAC ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 DU5 EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA KOM LG9 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SCC SDF SDG SDP SES SEW SPC SPCBC SST SSV SSZ T5K WH7 WUQ XPP ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD AFXIZ AGCQF AGRNS BNPGV IQODW SSH |
| ID | FETCH-LOGICAL-c333t-a22d47dba5ab31ad6e6c48948a0ea834a0cfb37c54954d823d4668a96b2b3a7a3 |
| ISICitedReferencesCount | 24 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=016781919500016H&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0167-8191 |
| IngestDate | Mon Jul 21 09:17:05 EDT 2025 Tue Nov 18 21:26:31 EST 2025 Sat Nov 29 03:58:55 EST 2025 Fri Feb 23 02:30:42 EST 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 9 |
| Keywords | Distributed memory multiprocessors Matrix transpose algorithm Intel Touchstone Delta Point-to-point communication Linear algebra Matrix diagonalization Parallel algorithm Distributed memory multiprocessor system Matrix inversion Distributed algorithm Matrix calculus Matrix product Parallelism Distributed system Implementation Point to point communication |
| Language | English |
| License | https://www.elsevier.com/tdm/userlicense/1.0 CC BY 4.0 |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c333t-a22d47dba5ab31ad6e6c48948a0ea834a0cfb37c54954d823d4668a96b2b3a7a3 |
| PageCount | 19 |
| ParticipantIDs | pascalfrancis_primary_3683593 crossref_primary_10_1016_0167_8191_95_00016_H crossref_citationtrail_10_1016_0167_8191_95_00016_H elsevier_sciencedirect_doi_10_1016_0167_8191_95_00016_H |
| PublicationCentury | 1900 |
| PublicationDate | 1995-09-01 |
| PublicationDateYYYYMMDD | 1995-09-01 |
| PublicationDate_xml | – month: 09 year: 1995 text: 1995-09-01 day: 01 |
| PublicationDecade | 1990 |
| PublicationPlace | Amsterdam |
| PublicationPlace_xml | – name: Amsterdam |
| PublicationTitle | Parallel computing |
| PublicationYear | 1995 |
| Publisher | Elsevier B.V Elsevier |
| Publisher_xml | – name: Elsevier B.V – name: Elsevier |
| References | Azari, Bojanczyk, Lee (BIB1) 1988 Dongarra, van de Geijn, Walker (BIB6) 1992 Choi, Dongarra, Walker (BIB4) 1992 O'Leary (BIB12) 1987; 36 Johnsson, Ho (BIB10) 1988; 9 Littlefield (BIB11) 1992 Bokhari, Berryman (BIB2) 1992 Golub, Van Loan (BIB8) 1989 Intel Corporation (BIB9) 1991 Strang (BIB13) 1988 Takkella, Seidel (BIB14) 1994 Choi, Dongarra, Pozo, Walker (BIB3) 1992 Choi, Dongarra, Walker (BIB5) 1994; 6 Eklundh (BIB7) 1972; 21 Choi (10.1016/0167-8191(95)00016-H_BIB5) 1994; 6 Eklundh (10.1016/0167-8191(95)00016-H_BIB7) 1972; 21 Choi (10.1016/0167-8191(95)00016-H_BIB4) 1992 O'Leary (10.1016/0167-8191(95)00016-H_BIB12) 1987; 36 Azari (10.1016/0167-8191(95)00016-H_BIB1) 1988 Bokhari (10.1016/0167-8191(95)00016-H_BIB2) 1992 Choi (10.1016/0167-8191(95)00016-H_BIB3) 1992 Golub (10.1016/0167-8191(95)00016-H_BIB8) 1989 Johnsson (10.1016/0167-8191(95)00016-H_BIB10) 1988; 9 Intel Corporation (10.1016/0167-8191(95)00016-H_BIB9) 1991 Littlefield (10.1016/0167-8191(95)00016-H_BIB11) 1992 Dongarra (10.1016/0167-8191(95)00016-H_BIB6) 1992 Strang (10.1016/0167-8191(95)00016-H_BIB13) 1988 Takkella (10.1016/0167-8191(95)00016-H_BIB14) 1994 |
| References_xml | – year: 1991 ident: BIB9 publication-title: Touchstone Delta Fortran Calls Reference Manual – volume: 6 start-page: 543 year: 1994 end-page: 570 ident: BIB5 article-title: PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers publication-title: Concurrency: Practice and Experience – volume: 36 start-page: 117 year: 1987 end-page: 122 ident: BIB12 article-title: Systolic arrays for matrix transpose and other reorderings publication-title: IEEE Trans. Comput. – start-page: 277 year: 1988 end-page: 288 ident: BIB1 article-title: Synchronous and asynchronous algorithms for matrix transposition on MCAP publication-title: SPIE Vol. 975, Advanced Algorithms and Architecture for Signal Processing III – volume: 9 start-page: 419 year: 1988 end-page: 454 ident: BIB10 article-title: Algorithms for matrix transposition on boolean n-cube configured ensemble architecture publication-title: SIAM J. Matrix Anal. Appl. – start-page: 372 year: 1992 end-page: 379 ident: BIB6 article-title: A look at scalable linear algebra libraries publication-title: Proc. 1992 Scalable High Performance Computing Conf. – year: 1989 ident: BIB8 publication-title: Matrix Computations – volume: 21 start-page: 801 year: 1972 end-page: 803 ident: BIB7 article-title: A fast computer method for matrix transposing publication-title: IEEE Trans. Comput. – start-page: 422 year: 1994 end-page: 428 ident: BIB14 article-title: Complete exchange and broadcast algorithms for meshes publication-title: Proc. Scalable High Performance Computing Conf. – start-page: 300 year: 1992 end-page: 306 ident: BIB2 article-title: Complete exchange on a circuit switched mesh publication-title: Proc. Scalable High Performance Computing Conf. – year: 1992 ident: BIB3 article-title: ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers publication-title: Proc. Fourth Symp. on the Frontiers of Massively Parallel Computation (McLean, Virginia) – year: 1992 ident: BIB4 article-title: The design of scalable software libraries for distributed memory concurrent computers publication-title: Proc. Environment and Tools for Parallel Scientific Computing Workshop (Saint Hilaire du Touvet, France) – start-page: 179 year: 1992 end-page: 190 ident: BIB11 article-title: Characterizing and tuning communications performance for real applications publication-title: Proc. First Intel Delta Application Workshop, CCSF-14-92 – year: 1988 ident: BIB13 publication-title: Linear Algebra and Its Applications – year: 1992 ident: 10.1016/0167-8191(95)00016-H_BIB4 article-title: The design of scalable software libraries for distributed memory concurrent computers – volume: 9 start-page: 419 year: 1988 ident: 10.1016/0167-8191(95)00016-H_BIB10 article-title: Algorithms for matrix transposition on boolean n-cube configured ensemble architecture publication-title: SIAM J. Matrix Anal. Appl. doi: 10.1137/0609037 – year: 1989 ident: 10.1016/0167-8191(95)00016-H_BIB8 – volume: 6 start-page: 543 year: 1994 ident: 10.1016/0167-8191(95)00016-H_BIB5 article-title: PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers publication-title: Concurrency: Practice and Experience doi: 10.1002/cpe.4330060702 – year: 1988 ident: 10.1016/0167-8191(95)00016-H_BIB13 – start-page: 422 year: 1994 ident: 10.1016/0167-8191(95)00016-H_BIB14 article-title: Complete exchange and broadcast algorithms for meshes – volume: 21 start-page: 801 year: 1972 ident: 10.1016/0167-8191(95)00016-H_BIB7 article-title: A fast computer method for matrix transposing publication-title: IEEE Trans. Comput. doi: 10.1109/T-C.1972.223584 – start-page: 372 year: 1992 ident: 10.1016/0167-8191(95)00016-H_BIB6 article-title: A look at scalable linear algebra libraries – year: 1991 ident: 10.1016/0167-8191(95)00016-H_BIB9 publication-title: Touchstone Delta Fortran Calls Reference Manual – volume: 36 start-page: 117 year: 1987 ident: 10.1016/0167-8191(95)00016-H_BIB12 article-title: Systolic arrays for matrix transpose and other reorderings publication-title: IEEE Trans. Comput. doi: 10.1109/TC.1987.5009457 – year: 1992 ident: 10.1016/0167-8191(95)00016-H_BIB3 article-title: ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers – start-page: 179 year: 1992 ident: 10.1016/0167-8191(95)00016-H_BIB11 article-title: Characterizing and tuning communications performance for real applications – start-page: 277 year: 1988 ident: 10.1016/0167-8191(95)00016-H_BIB1 article-title: Synchronous and asynchronous algorithms for matrix transposition on MCAP – start-page: 300 year: 1992 ident: 10.1016/0167-8191(95)00016-H_BIB2 article-title: Complete exchange on a circuit switched mesh |
| SSID | ssj0006480 |
| Score | 1.533283 |
| Snippet | This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a
P ×
Q... |
| SourceID | pascalfrancis crossref elsevier |
| SourceType | Index Database Enrichment Source Publisher |
| StartPage | 1387 |
| SubjectTerms | Algorithmics. Computability. Computer arithmetics Applied sciences Computer science; control theory; systems Computer systems and distributed systems. User interface Distributed memory multiprocessors Exact sciences and technology Intel Touchstone Delta Linear algebra Matrix transpose algorithm Memory and file management (including protection and security) Memory organisation. Data processing Point-to-point communication Software Theoretical computing |
| Title | Parallel matrix transpose algorithms on distributed memory concurrent computers |
| URI | https://dx.doi.org/10.1016/0167-8191(95)00016-H |
| Volume | 21 |
| WOSCitedRecordID | wos016781919500016H&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1872-7336 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0006480 issn: 0167-8191 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1bb9MwFLag4wEJcUdsMOQHkEBTtiZ2nPgxYoUCU4umDbqnyLETNqlNq6ag8u85ji8NCFR44CWqnMaWfD6fm88FoecqjCWvuAoEVyygqowCEcZVIEDd5VSqMGorMX06SUajdDLh7ka3adsJJHWdrtd88V9JDWNAbJ06-w_k9pPCAPwGosMTyA7PvyL8R7HU_VGmBzNdfX-tm0Do8uVNeSCmX-bLq9XlrL0iULpkru52BSrnTMfbftch6NLWa5K220PT1V791Oatk3o6ACe7GFyMz7Xnazj2jtjj8ehtdnqatVA53NxAfc5OPpgIjONDE-FnHA8mkZt3HQ8uI6brnASmqw3ALnc1-c8WRbzDKkNiJa0Ru2Doxb9l6ca74OcGvZvHLyLeKqvBcCPG3NX9L9LNxxwSBsomJ9fRTpTEPO2hnezdYPLei21G2zZ7fiGXZxmyIz_2ksev7MJ_0mNuLUQDp6sybVE6usrZXXTbGhk4M-C4h66V9X10xxoc2LLzBoZcTw839gCNHY2xgQ_28MEb-OB5jTvwwQY-eAMf7OHzEJ2_GZy9Hga26UYgCSErOKSRookqRCwKEgrFSiZpymkq-qVICRV9WRUkkTFY1lSlEVGUsVRwVkQFEYkgj1CvntflY4SBv5dhKkEHLFMwrIUAcVJIUgmaVFFS9XcRcVuYS1uRXjdGmeYu9FBvfK43PudxGybB8uEuCvxXC1ORZcv_E0ed3GqVRlvMAWFbvtz_iZh-OQulvS3vn6Cbm3PzFPVWy6_lProhv62umuUzi78fpryblQ |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Parallel+matrix+transpose+algorithms+on+distributed+memory+concurrent+computers&rft.jtitle=Parallel+computing&rft.au=JAYEYOUNG+CHOI&rft.au=DONGARRA%2C+J.+J&rft.au=WALKER%2C+D.+W&rft.date=1995-09-01&rft.pub=Elsevier&rft.issn=0167-8191&rft.volume=21&rft.issue=9&rft.spage=1387&rft.epage=1405&rft_id=info:doi/10.1016%2F0167-8191%2895%2900016-H&rft.externalDBID=n%2Fa&rft.externalDocID=3683593 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-8191&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-8191&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-8191&client=summon |