A high-performance matrix transposition for a new MIMD architecture processor PEZY-SC3s

Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:CCF transactions on high performance computing (Online) Ročník 7; číslo 4; s. 323 - 335
Hlavní autoři: Liang, Yaling, Wang, Qinglin, Yang, Shun, Xia, Rui, Guo, Weihao, Liu, Jie
Médium: Journal Article
Jazyk:angličtina
Vydáno: Beijing Springer Nature B.V 01.08.2025
Témata:
ISSN:2524-4922, 2524-4930
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving the performance of related applications and enhancing system resource utilization. The PEZY-SC3s, a new MIMD (Multiple Instruction Multiple Data) architecture processor, possesses numerous cores and supports SIMD instructions, demonstrating tremendous potential for high-performance computing. However, no matrix transposition algorithm currently exists tailored to the PEZY-SC3s architecture to leverage its computing potential fully. We propose a high-performance matrix transposition algorithm for PEZY-SC3s. First, we block the matrix according to the cache architecture at the microkernel level to improve the memory access pattern. Then, we separate read and write operations by utilizing the PEZY-SC3s’ Local Memory, solving the cache line contention. Finally, we design various processor-level parallel strategies and implement a dynamic selection strategy based on a performance heuristic algorithm for different matrix shapes, alleviating bank conflict and enhancing performance. Experimental results show that our implementation achieves an average speedup of 17.27 times across 60 matrices compared to the baseline algorithm, with a maximum bandwidth utilization of 87.7%.
AbstractList Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving the performance of related applications and enhancing system resource utilization. The PEZY-SC3s, a new MIMD (Multiple Instruction Multiple Data) architecture processor, possesses numerous cores and supports SIMD instructions, demonstrating tremendous potential for high-performance computing. However, no matrix transposition algorithm currently exists tailored to the PEZY-SC3s architecture to leverage its computing potential fully. We propose a high-performance matrix transposition algorithm for PEZY-SC3s. First, we block the matrix according to the cache architecture at the microkernel level to improve the memory access pattern. Then, we separate read and write operations by utilizing the PEZY-SC3s’ Local Memory, solving the cache line contention. Finally, we design various processor-level parallel strategies and implement a dynamic selection strategy based on a performance heuristic algorithm for different matrix shapes, alleviating bank conflict and enhancing performance. Experimental results show that our implementation achieves an average speedup of 17.27 times across 60 matrices compared to the baseline algorithm, with a maximum bandwidth utilization of 87.7%.
Author Liu, Jie
Liang, Yaling
Yang, Shun
Guo, Weihao
Wang, Qinglin
Xia, Rui
Author_xml – sequence: 1
  givenname: Yaling
  orcidid: 0009-0001-1207-2170
  surname: Liang
  fullname: Liang, Yaling
– sequence: 2
  givenname: Qinglin
  surname: Wang
  fullname: Wang, Qinglin
– sequence: 3
  givenname: Shun
  surname: Yang
  fullname: Yang, Shun
– sequence: 4
  givenname: Rui
  surname: Xia
  fullname: Xia, Rui
– sequence: 5
  givenname: Weihao
  surname: Guo
  fullname: Guo, Weihao
– sequence: 6
  givenname: Jie
  surname: Liu
  fullname: Liu, Jie
BookMark eNo9kMFKAzEQhoNUsNa-gKeA52gyye5mj6VWLbQoqIheQjadtVvs7pqkqG9vtOJpBubjn5_vmAzarkVCTgU_F5wXF0FBJhTjkDHOQQqmDsgQMlBMlZIP_neAIzIOYcMTVQgOkA_J04Sum9c169HXnd_a1iHd2uibTxq9bUPfhSY2XUvTlVra4gddzpeX1Hq3biK6uPNIe985DCERd7OXZ3Y_leGEHNb2LeD4b47I49XsYXrDFrfX8-lkwRyAjmylVzYHSN1yt1IaK6601VCospKyrDKpKtSlcGgBZSVdndtCV3mNStdFXnI5Imf73NThfYchmk238216aSQowXWyUiYK9pTzXQgea9P7Zmv9lxHc_Dg0e4cm0ebXoVHyG3cUZRo
Cites_doi 10.1109/HPCA.1999.744320
10.1109/IMW.2017.7939084
10.1007/978-0-85729-760-0
10.1007/s10915-024-02636-9
10.1145/3529113.3529122
10.1007/s11227-021-04282-6
10.1109/TPDS.2015.2412549
10.1177/1094342017710705
10.1145/3091966.3091968
10.3390/electronics11213550
10.1016/j.procs.2016.05.457
10.1016/S0043-1648(00)00427-0
10.1145/342001.339668
10.4218/etrij.2022-0297
10.1007/978-3-030-58814-4_13
10.1109/IA3.2016.015
10.1145/2692916.2555253
10.1109/IPDPS54959.2023.00045
10.1109/CANDAR.2016.0075
10.1109/TC.2020.3030592
10.1007/978-981-97-0801-7_2
10.1109/HPCA.2000.824350
10.1109/NorCAS58970.2023.10305472
10.1103/PhysRevA.75.014304
10.5121/ijcsit.2014.6305
10.1103/PRXQuantum.3.030334
10.1145/3555353
10.1109/ACCESS.2023.3283312
10.1038/s41598-024-58175-8
10.1016/j.ins.2023.119260
ContentType Journal Article
Copyright China Computer Federation (CCF) 2025.
Copyright_xml – notice: China Computer Federation (CCF) 2025.
DBID AAYXX
CITATION
JQ2
DOI 10.1007/s42514-025-00231-4
DatabaseName CrossRef
ProQuest Computer Science Collection
DatabaseTitle CrossRef
ProQuest Computer Science Collection
DatabaseTitleList ProQuest Computer Science Collection
DeliveryMethod fulltext_linktorsrc
EISSN 2524-4930
EndPage 335
ExternalDocumentID 10_1007_s42514_025_00231_4
GroupedDBID 0R~
406
AACDK
AAHNG
AAJBT
AASML
AATNV
AAUYE
AAYXX
ABAKF
ABBRH
ABDBE
ABDZT
ABECU
ABFSG
ABFTV
ABJNI
ABKCH
ABMQK
ABRTQ
ABTEG
ABTKH
ABTMW
ABXPI
ACAOD
ACDTI
ACHSB
ACMLO
ACOKC
ACPIV
ACSTC
ACZOJ
ADKNI
ADTPH
ADURQ
ADYFF
AEFQL
AEJRE
AEMSY
AEZWR
AFBBN
AFDZB
AFFHD
AFHIU
AFKRA
AFOHR
AFQWF
AGDGC
AGJBK
AGMZJ
AGQEE
AGRTI
AHPBZ
AHWEU
AIGIU
AILAN
AITGF
AIXLP
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
AMKLP
AMXSW
AMYLF
ARAPS
ATHPR
AXYYD
AYFIA
BENPR
BGLVJ
BGNMA
CCPQU
CITATION
DPUIP
EBLON
EBS
EJD
FIGPU
FINBP
FNLPD
FSGXE
GGCAI
H13
HCIFZ
IKXTQ
IWAJR
J-C
JZLTJ
K7-
KOV
LLZTM
M4Y
NPVJJ
NQJWS
NU0
PHGZM
PHGZT
PQGLB
PT4
ROL
RSV
SJYHP
SNE
SNPRN
SOHCF
SOJ
SRMVM
SSLCW
STPWE
TSG
UOJIU
UTJUX
VEKWB
VFIZW
ZMTXR
AESKC
JQ2
ID FETCH-LOGICAL-c228t-d8da6224926cd48eb048a82749b339b534be891cea2e3b3cf6a78b6fe48f76903
IEDL.DBID RSV
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001468989500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2524-4922
IngestDate Sat Nov 08 16:09:43 EST 2025
Sat Nov 29 07:37:42 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 4
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c228t-d8da6224926cd48eb048a82749b339b534be891cea2e3b3cf6a78b6fe48f76903
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0009-0001-1207-2170
PQID 3241080259
PQPubID 6587180
PageCount 13
ParticipantIDs proquest_journals_3241080259
crossref_primary_10_1007_s42514_025_00231_4
PublicationCentury 2000
PublicationDate 2025-08-00
20250801
PublicationDateYYYYMMDD 2025-08-01
PublicationDate_xml – month: 08
  year: 2025
  text: 2025-08-00
PublicationDecade 2020
PublicationPlace Beijing
PublicationPlace_xml – name: Beijing
PublicationTitle CCF transactions on high performance computing (Online)
PublicationYear 2025
Publisher Springer Nature B.V
Publisher_xml – name: Springer Nature B.V
References P Godard (231_CR12) 2020; 70
X Huang (231_CR18) 2022; 49
Z Ma (231_CR27) 2007; 75
231_CR30
T Yamazaki (231_CR38) 2019; 33
M Mannino (231_CR28) 2023; 11
231_CR32
231_CR11
S Liu (231_CR25) 2000; 243
J Gomez-Luna (231_CR13) 2015; 27
231_CR26
X Pei (231_CR33) 2023; 45
231_CR29
X Yang (231_CR39) 2022; 11
C Garner (231_CR10) 2024; 100
J Lee (231_CR22) 2023; 45
T Aoyama (231_CR3) 2016; 80
F Ming (231_CR31) 2023; 643
MH Gordon (231_CR14) 2022; 3
Z Chen (231_CR6) 2022; 78
231_CR7
R Li (231_CR24) 2024; 14
231_CR9
231_CR8
231_CR5
231_CR40
231_CR41
B Catanzaro (231_CR4) 2014; 49
231_CR20
231_CR21
231_CR1
231_CR23
S Rixner (231_CR34) 2000; 28
231_CR35
231_CR36
231_CR15
231_CR37
231_CR16
231_CR17
JNF Alves (231_CR2) 2022; 48
231_CR19
References_xml – ident: 231_CR11
  doi: 10.1109/HPCA.1999.744320
– ident: 231_CR20
  doi: 10.1109/IMW.2017.7939084
– ident: 231_CR9
– ident: 231_CR37
  doi: 10.1007/978-0-85729-760-0
– volume: 100
  start-page: 89
  issue: 3
  year: 2024
  ident: 231_CR10
  publication-title: J. Sci. Comput.
  doi: 10.1007/s10915-024-02636-9
– volume: 49
  start-page: 28
  issue: 3
  year: 2022
  ident: 231_CR18
  publication-title: ACM SIGMETRICS Performance Eval. Rev.
  doi: 10.1145/3529113.3529122
– volume: 78
  start-page: 9456
  issue: 7
  year: 2022
  ident: 231_CR6
  publication-title: J. Supercomput.
  doi: 10.1007/s11227-021-04282-6
– ident: 231_CR7
– volume: 27
  start-page: 776
  issue: 3
  year: 2015
  ident: 231_CR13
  publication-title: IEEE Trans. Parallel Distributed Syst.
  doi: 10.1109/TPDS.2015.2412549
– volume: 33
  start-page: 155
  issue: 1
  year: 2019
  ident: 231_CR38
  publication-title: Int. J.High Performance Comput. Appl.
  doi: 10.1177/1094342017710705
– ident: 231_CR23
– ident: 231_CR35
  doi: 10.1145/3091966.3091968
– volume: 11
  start-page: 3550
  issue: 21
  year: 2022
  ident: 231_CR39
  publication-title: Electronics
  doi: 10.3390/electronics11213550
– volume: 80
  start-page: 1418
  year: 2016
  ident: 231_CR3
  publication-title: Proc. Comput. Sci.
  doi: 10.1016/j.procs.2016.05.457
– volume: 243
  start-page: 101
  issue: 1–2
  year: 2000
  ident: 231_CR25
  publication-title: Wear
  doi: 10.1016/S0043-1648(00)00427-0
– ident: 231_CR16
– volume: 28
  start-page: 128
  issue: 2
  year: 2000
  ident: 231_CR34
  publication-title: ACM SIGARCH Comput. Architec. News
  doi: 10.1145/342001.339668
– ident: 231_CR21
– volume: 45
  start-page: 1035
  issue: 6
  year: 2023
  ident: 231_CR22
  publication-title: ETRI J.
  doi: 10.4218/etrij.2022-0297
– ident: 231_CR29
  doi: 10.1007/978-3-030-58814-4_13
– ident: 231_CR40
  doi: 10.1109/IA3.2016.015
– volume: 49
  start-page: 193
  issue: 8
  year: 2014
  ident: 231_CR4
  publication-title: ACM SIGPLAN Notices
  doi: 10.1145/2692916.2555253
– ident: 231_CR8
– ident: 231_CR30
– ident: 231_CR1
  doi: 10.1109/IPDPS54959.2023.00045
– ident: 231_CR26
– ident: 231_CR32
  doi: 10.1109/CANDAR.2016.0075
– volume: 70
  start-page: 1942
  issue: 11
  year: 2020
  ident: 231_CR12
  publication-title: IEEE Trans. Comput.
  doi: 10.1109/TC.2020.3030592
– ident: 231_CR15
  doi: 10.1007/978-981-97-0801-7_2
– ident: 231_CR5
  doi: 10.1109/HPCA.2000.824350
– ident: 231_CR17
  doi: 10.1109/NorCAS58970.2023.10305472
– volume: 75
  issue: 1
  year: 2007
  ident: 231_CR27
  publication-title: Phys. Rev. A-Atomic, Mol. Opt. Phys.
  doi: 10.1103/PhysRevA.75.014304
– ident: 231_CR41
  doi: 10.5121/ijcsit.2014.6305
– volume: 45
  start-page: 57
  issue: 1
  year: 2023
  ident: 231_CR33
  publication-title: J. Natl. Univ. Defense Technol.
– volume: 3
  issue: 3
  year: 2022
  ident: 231_CR14
  publication-title: PRX Quantum
  doi: 10.1103/PRXQuantum.3.030334
– volume: 48
  start-page: 1
  issue: 4
  year: 2022
  ident: 231_CR2
  publication-title: ACM Trans. Math. Softw.
  doi: 10.1145/3555353
– volume: 11
  start-page: 57514
  year: 2023
  ident: 231_CR28
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2023.3283312
– volume: 14
  start-page: 7608
  issue: 1
  year: 2024
  ident: 231_CR24
  publication-title: Sci. Rep.
  doi: 10.1038/s41598-024-58175-8
– ident: 231_CR36
– ident: 231_CR19
– volume: 643
  year: 2023
  ident: 231_CR31
  publication-title: Inform. Sci.
  doi: 10.1016/j.ins.2023.119260
SSID ssj0002710226
ssib053822361
Score 2.2991714
Snippet Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making...
SourceID proquest
crossref
SourceType Aggregation Database
Index Database
StartPage 323
SubjectTerms Algorithms
Bandwidths
Computation
Design
Energy efficiency
Heuristic methods
High performance computing
Microkernels
Microprocessors
MIMD (computers)
Optimization
Performance enhancement
Resource utilization
Supercomputers
Title A high-performance matrix transposition for a new MIMD architecture processor PEZY-SC3s
URI https://www.proquest.com/docview/3241080259
Volume 7
WOSCitedRecordID wos001468989500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAVX
  databaseName: SpringerLINK Contemporary 1997-Present
  customDbUrl:
  eissn: 2524-4930
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002710226
  issn: 2524-4922
  databaseCode: RSV
  dateStart: 20190501
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5jePDiD1ScTsnBmwbXJG2T45gbCm4Mp3N6KUmbgAfnWKv45_uStupADzu3pOXjvfe9R977HkJnyvDMRtQQKi0nnMUdogzlJIPi2UbSSOOXwUxv49FIzGZy3EAX_97gX-ZgVQEnbu2qIxioeCDgBhF16wruJtPaeMBxaS0k4sMwddzp163RED7OJaXV0Mzfp64S02pc9mQz2F7vN3fQVpVU4m5pBbuoYeZ76LGLnRYxWfyMBuBXJ8j_iYtS0bxs18LwFCsM6TUe3gyv8O-rBbwo5wjgjXH_-YlMeizfRw-D_n3vmlRrFEhKqShIJjIFcDllwDTjwmhwWiWgGpWaMalDxrURMkiNooZpltpIxUJH1nBhYyie2QFqzt_m5hBh3sl4oC1ToVNa4yFQGdXSBCJ1wn9KttB5jWGyKNUykm9dZA9QAgAlHqCEt1C7hjmpPCdPIMFzbY9QlR2tddgx2qQefNeb10bNYvluTtBG-lG85MtTbypf5f2xag
linkProvider Springer Nature
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+high-performance+matrix+transposition+for+a+new+MIMD+architecture+processor+PEZY-SC3s&rft.jtitle=CCF+transactions+on+high+performance+computing+%28Online%29&rft.au=Liang%2C+Yaling&rft.au=Wang%2C+Qinglin&rft.au=Yang%2C+Shun&rft.au=Xia%2C+Rui&rft.date=2025-08-01&rft.issn=2524-4922&rft.eissn=2524-4930&rft.volume=7&rft.issue=4&rft.spage=323&rft.epage=335&rft_id=info:doi/10.1007%2Fs42514-025-00231-4&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s42514_025_00231_4
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2524-4922&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2524-4922&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2524-4922&client=summon