A high-performance matrix transposition for a new MIMD architecture processor PEZY-SC3s

Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	CCF transactions on high performance computing (Online) Jg. 7; H. 4; S. 323 - 335
Hauptverfasser:	Liang, Yaling, Wang, Qinglin, Yang, Shun, Xia, Rui, Guo, Weihao, Liu, Jie
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Beijing Springer Nature B.V 01.08.2025
Schlagworte:	Algorithms Bandwidths Computation Design Energy efficiency Heuristic methods High performance computing Microkernels Microprocessors MIMD (computers) Optimization Performance enhancement Resource utilization Supercomputers
ISSN:	2524-4922, 2524-4930
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving the performance of related applications and enhancing system resource utilization. The PEZY-SC3s, a new MIMD (Multiple Instruction Multiple Data) architecture processor, possesses numerous cores and supports SIMD instructions, demonstrating tremendous potential for high-performance computing. However, no matrix transposition algorithm currently exists tailored to the PEZY-SC3s architecture to leverage its computing potential fully. We propose a high-performance matrix transposition algorithm for PEZY-SC3s. First, we block the matrix according to the cache architecture at the microkernel level to improve the memory access pattern. Then, we separate read and write operations by utilizing the PEZY-SC3s’ Local Memory, solving the cache line contention. Finally, we design various processor-level parallel strategies and implement a dynamic selection strategy based on a performance heuristic algorithm for different matrix shapes, alleviating bank conflict and enhancing performance. Experimental results show that our implementation achieves an average speedup of 17.27 times across 60 matrices compared to the baseline algorithm, with a maximum bandwidth utilization of 87.7%.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2524-4922 2524-4930
DOI:	10.1007/s42514-025-00231-4