A high-performance matrix transposition for a new MIMD architecture processor PEZY-SC3s

Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving...

Full description

Saved in:
Bibliographic Details
Published in:CCF transactions on high performance computing (Online) Vol. 7; no. 4; pp. 323 - 335
Main Authors: Liang, Yaling, Wang, Qinglin, Yang, Shun, Xia, Rui, Guo, Weihao, Liu, Jie
Format: Journal Article
Language:English
Published: Beijing Springer Nature B.V 01.08.2025
Subjects:
ISSN:2524-4922, 2524-4930
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Matrix transposition is a vital kernel widely used in various fields. However, its memory-intensive nature leads to significant memory access conflicts, making it a performance bottleneck. Therefore, optimizing matrix transposition algorithms based on architectural features is crucial for improving the performance of related applications and enhancing system resource utilization. The PEZY-SC3s, a new MIMD (Multiple Instruction Multiple Data) architecture processor, possesses numerous cores and supports SIMD instructions, demonstrating tremendous potential for high-performance computing. However, no matrix transposition algorithm currently exists tailored to the PEZY-SC3s architecture to leverage its computing potential fully. We propose a high-performance matrix transposition algorithm for PEZY-SC3s. First, we block the matrix according to the cache architecture at the microkernel level to improve the memory access pattern. Then, we separate read and write operations by utilizing the PEZY-SC3s’ Local Memory, solving the cache line contention. Finally, we design various processor-level parallel strategies and implement a dynamic selection strategy based on a performance heuristic algorithm for different matrix shapes, alleviating bank conflict and enhancing performance. Experimental results show that our implementation achieves an average speedup of 17.27 times across 60 matrices compared to the baseline algorithm, with a maximum bandwidth utilization of 87.7%.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2524-4922
2524-4930
DOI:10.1007/s42514-025-00231-4