A high-performance matrix transposition for a new MIMD architecture processor PEZY-SC3s

Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	CCF transactions on high performance computing (Online) Ročník 7; číslo 4; s. 323 - 335
Hlavní autoři:	Liang, Yaling, Wang, Qinglin, Yang, Shun, Xia, Rui, Guo, Weihao, Liu, Jie
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Beijing Springer Nature B.V 01.08.2025
Témata:	Algorithms Bandwidths Computation Design Energy efficiency Heuristic methods High performance computing Microkernels Microprocessors MIMD (computers) Optimization Performance enhancement Resource utilization Supercomputers
ISSN:	2524-4922, 2524-4930
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving the performance of related applications and enhancing system resource utilization. The PEZY-SC3s, a new MIMD (Multiple Instruction Multiple Data) architecture processor, possesses numerous cores and supports SIMD instructions, demonstrating tremendous potential for high-performance computing. However, no matrix transposition algorithm currently exists tailored to the PEZY-SC3s architecture to leverage its computing potential fully. We propose a high-performance matrix transposition algorithm for PEZY-SC3s. First, we block the matrix according to the cache architecture at the microkernel level to improve the memory access pattern. Then, we separate read and write operations by utilizing the PEZY-SC3s’ Local Memory, solving the cache line contention. Finally, we design various processor-level parallel strategies and implement a dynamic selection strategy based on a performance heuristic algorithm for different matrix shapes, alleviating bank conflict and enhancing performance. Experimental results show that our implementation achieves an average speedup of 17.27 times across 60 matrices compared to the baseline algorithm, with a maximum bandwidth utilization of 87.7%.
AbstractList	Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving the performance of related applications and enhancing system resource utilization. The PEZY-SC3s, a new MIMD (Multiple Instruction Multiple Data) architecture processor, possesses numerous cores and supports SIMD instructions, demonstrating tremendous potential for high-performance computing. However, no matrix transposition algorithm currently exists tailored to the PEZY-SC3s architecture to leverage its computing potential fully. We propose a high-performance matrix transposition algorithm for PEZY-SC3s. First, we block the matrix according to the cache architecture at the microkernel level to improve the memory access pattern. Then, we separate read and write operations by utilizing the PEZY-SC3s’ Local Memory, solving the cache line contention. Finally, we design various processor-level parallel strategies and implement a dynamic selection strategy based on a performance heuristic algorithm for different matrix shapes, alleviating bank conflict and enhancing performance. Experimental results show that our implementation achieves an average speedup of 17.27 times across 60 matrices compared to the baseline algorithm, with a maximum bandwidth utilization of 87.7%.
Author	Liu, Jie Liang, Yaling Yang, Shun Guo, Weihao Wang, Qinglin Xia, Rui
Author_xml	– sequence: 1 givenname: Yaling orcidid: 0009-0001-1207-2170 surname: Liang fullname: Liang, Yaling – sequence: 2 givenname: Qinglin surname: Wang fullname: Wang, Qinglin – sequence: 3 givenname: Shun surname: Yang fullname: Yang, Shun – sequence: 4 givenname: Rui surname: Xia fullname: Xia, Rui – sequence: 5 givenname: Weihao surname: Guo fullname: Guo, Weihao – sequence: 6 givenname: Jie surname: Liu fullname: Liu, Jie
BookMark	eNo9kMFKAzEQhoNUsNa-gKeA52gyye5mj6VWLbQoqIheQjadtVvs7pqkqG9vtOJpBubjn5_vmAzarkVCTgU_F5wXF0FBJhTjkDHOQQqmDsgQMlBMlZIP_neAIzIOYcMTVQgOkA_J04Sum9c169HXnd_a1iHd2uibTxq9bUPfhSY2XUvTlVra4gddzpeX1Hq3biK6uPNIe985DCERd7OXZ3Y_leGEHNb2LeD4b47I49XsYXrDFrfX8-lkwRyAjmylVzYHSN1yt1IaK6601VCospKyrDKpKtSlcGgBZSVdndtCV3mNStdFXnI5Imf73NThfYchmk238216aSQowXWyUiYK9pTzXQgea9P7Zmv9lxHc_Dg0e4cm0ebXoVHyG3cUZRo
Cites_doi	10.1109/HPCA.1999.744320 10.1109/IMW.2017.7939084 10.1007/978-0-85729-760-0 10.1007/s10915-024-02636-9 10.1145/3529113.3529122 10.1007/s11227-021-04282-6 10.1109/TPDS.2015.2412549 10.1177/1094342017710705 10.1145/3091966.3091968 10.3390/electronics11213550 10.1016/j.procs.2016.05.457 10.1016/S0043-1648(00)00427-0 10.1145/342001.339668 10.4218/etrij.2022-0297 10.1007/978-3-030-58814-4_13 10.1109/IA3.2016.015 10.1145/2692916.2555253 10.1109/IPDPS54959.2023.00045 10.1109/CANDAR.2016.0075 10.1109/TC.2020.3030592 10.1007/978-981-97-0801-7_2 10.1109/HPCA.2000.824350 10.1109/NorCAS58970.2023.10305472 10.1103/PhysRevA.75.014304 10.5121/ijcsit.2014.6305 10.1103/PRXQuantum.3.030334 10.1145/3555353 10.1109/ACCESS.2023.3283312 10.1038/s41598-024-58175-8 10.1016/j.ins.2023.119260
ContentType	Journal Article
Copyright	China Computer Federation (CCF) 2025.
Copyright_xml	– notice: China Computer Federation (CCF) 2025.
DBID	AAYXX CITATION JQ2
DOI	10.1007/s42514-025-00231-4
DatabaseName	CrossRef ProQuest Computer Science Collection
DatabaseTitle	CrossRef ProQuest Computer Science Collection
DatabaseTitleList	ProQuest Computer Science Collection
DeliveryMethod	fulltext_linktorsrc
EISSN	2524-4930
EndPage	335
ExternalDocumentID	10_1007_s42514_025_00231_4
GroupedDBID	0R~ 406 AACDK AAHNG AAJBT AASML AATNV AAUYE AAYXX ABAKF ABBRH ABDBE ABDZT ABECU ABFSG ABFTV ABJNI ABKCH ABMQK ABRTQ ABTEG ABTKH ABTMW ABXPI ACAOD ACDTI ACHSB ACMLO ACOKC ACPIV ACSTC ACZOJ ADKNI ADTPH ADURQ ADYFF AEFQL AEJRE AEMSY AEZWR AFBBN AFDZB AFFHD AFHIU AFKRA AFOHR AFQWF AGDGC AGJBK AGMZJ AGQEE AGRTI AHPBZ AHWEU AIGIU AILAN AITGF AIXLP AJZVZ ALMA_UNASSIGNED_HOLDINGS AMKLP AMXSW AMYLF ARAPS ATHPR AXYYD AYFIA BENPR BGLVJ BGNMA CCPQU CITATION DPUIP EBLON EBS EJD FIGPU FINBP FNLPD FSGXE GGCAI H13 HCIFZ IKXTQ IWAJR J-C JZLTJ K7- KOV LLZTM M4Y NPVJJ NQJWS NU0 PHGZM PHGZT PQGLB PT4 ROL RSV SJYHP SNE SNPRN SOHCF SOJ SRMVM SSLCW STPWE TSG UOJIU UTJUX VEKWB VFIZW ZMTXR AESKC JQ2
ID	FETCH-LOGICAL-c228t-d8da6224926cd48eb048a82749b339b534be891cea2e3b3cf6a78b6fe48f76903
IEDL.DBID	RSV
ISICitedReferencesCount	0
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001468989500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	2524-4922
IngestDate	Sat Nov 08 16:09:43 EST 2025 Sat Nov 29 07:37:42 EST 2025
IsPeerReviewed	true
IsScholarly	true
Issue	4
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c228t-d8da6224926cd48eb048a82749b339b534be891cea2e3b3cf6a78b6fe48f76903
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0009-0001-1207-2170
PQID	3241080259
PQPubID	6587180
PageCount	13
ParticipantIDs	proquest_journals_3241080259 crossref_primary_10_1007_s42514_025_00231_4
PublicationCentury	2000
PublicationDate	2025-08-00 20250801
PublicationDateYYYYMMDD	2025-08-01
PublicationDate_xml	– month: 08 year: 2025 text: 2025-08-00
PublicationDecade	2020
PublicationPlace	Beijing
PublicationPlace_xml	– name: Beijing
PublicationTitle	CCF transactions on high performance computing (Online)
PublicationYear	2025
Publisher	Springer Nature B.V
Publisher_xml	– name: Springer Nature B.V
References	P Godard (231_CR12) 2020; 70 X Huang (231_CR18) 2022; 49 Z Ma (231_CR27) 2007; 75 231_CR30 T Yamazaki (231_CR38) 2019; 33 M Mannino (231_CR28) 2023; 11 231_CR32 231_CR11 S Liu (231_CR25) 2000; 243 J Gomez-Luna (231_CR13) 2015; 27 231_CR26 X Pei (231_CR33) 2023; 45 231_CR29 X Yang (231_CR39) 2022; 11 C Garner (231_CR10) 2024; 100 J Lee (231_CR22) 2023; 45 T Aoyama (231_CR3) 2016; 80 F Ming (231_CR31) 2023; 643 MH Gordon (231_CR14) 2022; 3 Z Chen (231_CR6) 2022; 78 231_CR7 R Li (231_CR24) 2024; 14 231_CR9 231_CR8 231_CR5 231_CR40 231_CR41 B Catanzaro (231_CR4) 2014; 49 231_CR20 231_CR21 231_CR1 231_CR23 S Rixner (231_CR34) 2000; 28 231_CR35 231_CR36 231_CR15 231_CR37 231_CR16 231_CR17 JNF Alves (231_CR2) 2022; 48 231_CR19
References_xml	– ident: 231_CR11 doi: 10.1109/HPCA.1999.744320 – ident: 231_CR20 doi: 10.1109/IMW.2017.7939084 – ident: 231_CR9 – ident: 231_CR37 doi: 10.1007/978-0-85729-760-0 – volume: 100 start-page: 89 issue: 3 year: 2024 ident: 231_CR10 publication-title: J. Sci. Comput. doi: 10.1007/s10915-024-02636-9 – volume: 49 start-page: 28 issue: 3 year: 2022 ident: 231_CR18 publication-title: ACM SIGMETRICS Performance Eval. Rev. doi: 10.1145/3529113.3529122 – volume: 78 start-page: 9456 issue: 7 year: 2022 ident: 231_CR6 publication-title: J. Supercomput. doi: 10.1007/s11227-021-04282-6 – ident: 231_CR7 – volume: 27 start-page: 776 issue: 3 year: 2015 ident: 231_CR13 publication-title: IEEE Trans. Parallel Distributed Syst. doi: 10.1109/TPDS.2015.2412549 – volume: 33 start-page: 155 issue: 1 year: 2019 ident: 231_CR38 publication-title: Int. J.High Performance Comput. Appl. doi: 10.1177/1094342017710705 – ident: 231_CR23 – ident: 231_CR35 doi: 10.1145/3091966.3091968 – volume: 11 start-page: 3550 issue: 21 year: 2022 ident: 231_CR39 publication-title: Electronics doi: 10.3390/electronics11213550 – volume: 80 start-page: 1418 year: 2016 ident: 231_CR3 publication-title: Proc. Comput. Sci. doi: 10.1016/j.procs.2016.05.457 – volume: 243 start-page: 101 issue: 1–2 year: 2000 ident: 231_CR25 publication-title: Wear doi: 10.1016/S0043-1648(00)00427-0 – ident: 231_CR16 – volume: 28 start-page: 128 issue: 2 year: 2000 ident: 231_CR34 publication-title: ACM SIGARCH Comput. Architec. News doi: 10.1145/342001.339668 – ident: 231_CR21 – volume: 45 start-page: 1035 issue: 6 year: 2023 ident: 231_CR22 publication-title: ETRI J. doi: 10.4218/etrij.2022-0297 – ident: 231_CR29 doi: 10.1007/978-3-030-58814-4_13 – ident: 231_CR40 doi: 10.1109/IA3.2016.015 – volume: 49 start-page: 193 issue: 8 year: 2014 ident: 231_CR4 publication-title: ACM SIGPLAN Notices doi: 10.1145/2692916.2555253 – ident: 231_CR8 – ident: 231_CR30 – ident: 231_CR1 doi: 10.1109/IPDPS54959.2023.00045 – ident: 231_CR26 – ident: 231_CR32 doi: 10.1109/CANDAR.2016.0075 – volume: 70 start-page: 1942 issue: 11 year: 2020 ident: 231_CR12 publication-title: IEEE Trans. Comput. doi: 10.1109/TC.2020.3030592 – ident: 231_CR15 doi: 10.1007/978-981-97-0801-7_2 – ident: 231_CR5 doi: 10.1109/HPCA.2000.824350 – ident: 231_CR17 doi: 10.1109/NorCAS58970.2023.10305472 – volume: 75 issue: 1 year: 2007 ident: 231_CR27 publication-title: Phys. Rev. A-Atomic, Mol. Opt. Phys. doi: 10.1103/PhysRevA.75.014304 – ident: 231_CR41 doi: 10.5121/ijcsit.2014.6305 – volume: 45 start-page: 57 issue: 1 year: 2023 ident: 231_CR33 publication-title: J. Natl. Univ. Defense Technol. – volume: 3 issue: 3 year: 2022 ident: 231_CR14 publication-title: PRX Quantum doi: 10.1103/PRXQuantum.3.030334 – volume: 48 start-page: 1 issue: 4 year: 2022 ident: 231_CR2 publication-title: ACM Trans. Math. Softw. doi: 10.1145/3555353 – volume: 11 start-page: 57514 year: 2023 ident: 231_CR28 publication-title: IEEE Access doi: 10.1109/ACCESS.2023.3283312 – volume: 14 start-page: 7608 issue: 1 year: 2024 ident: 231_CR24 publication-title: Sci. Rep. doi: 10.1038/s41598-024-58175-8 – ident: 231_CR36 – ident: 231_CR19 – volume: 643 year: 2023 ident: 231_CR31 publication-title: Inform. Sci. doi: 10.1016/j.ins.2023.119260
SSID	ssj0002710226 ssib053822361
Score	2.2991714
Snippet	Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making...
SourceID	proquest crossref
SourceType	Aggregation Database Index Database
StartPage	323
SubjectTerms	Algorithms Bandwidths Computation Design Energy efficiency Heuristic methods High performance computing Microkernels Microprocessors MIMD (computers) Optimization Performance enhancement Resource utilization Supercomputers
Title	A high-performance matrix transposition for a new MIMD architecture processor PEZY-SC3s
URI	https://www.proquest.com/docview/3241080259
Volume	7
WOSCitedRecordID	wos001468989500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAVX databaseName: SpringerLINK Contemporary 1997-Present customDbUrl: eissn: 2524-4930 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002710226 issn: 2524-4922 databaseCode: RSV dateStart: 20190501 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5jePDiD1ScTsnBmwbXJG2T45gbCm4Mp3N6KUmbgAfnWKv45_uStupADzu3pOXjvfe9R977HkJnyvDMRtQQKi0nnMUdogzlJIPi2UbSSOOXwUxv49FIzGZy3EAX_97gX-ZgVQEnbu2qIxioeCDgBhF16wruJtPaeMBxaS0k4sMwddzp163RED7OJaXV0Mzfp64S02pc9mQz2F7vN3fQVpVU4m5pBbuoYeZ76LGLnRYxWfyMBuBXJ8j_iYtS0bxs18LwFCsM6TUe3gyv8O-rBbwo5wjgjXH_-YlMeizfRw-D_n3vmlRrFEhKqShIJjIFcDllwDTjwmhwWiWgGpWaMalDxrURMkiNooZpltpIxUJH1nBhYyie2QFqzt_m5hBh3sl4oC1ToVNa4yFQGdXSBCJ1wn9KttB5jWGyKNUykm9dZA9QAgAlHqCEt1C7hjmpPCdPIMFzbY9QlR2tddgx2qQefNeb10bNYvluTtBG-lG85MtTbypf5f2xag
linkProvider	Springer Nature
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+high-performance+matrix+transposition+for+a+new+MIMD+architecture+processor+PEZY-SC3s&rft.jtitle=CCF+transactions+on+high+performance+computing+%28Online%29&rft.au=Liang%2C+Yaling&rft.au=Wang%2C+Qinglin&rft.au=Yang%2C+Shun&rft.au=Xia%2C+Rui&rft.date=2025-08-01&rft.issn=2524-4922&rft.eissn=2524-4930&rft.volume=7&rft.issue=4&rft.spage=323&rft.epage=335&rft_id=info:doi/10.1007%2Fs42514-025-00231-4&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s42514_025_00231_4
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2524-4922&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2524-4922&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2524-4922&client=summon