FSMM: An Efficient Matrix Multiplication Accelerator Supporting Flexible Sparsity

Sparse matrix multiplication is a critical operation in deep learning. However, matrix sparsity leads to irregular data flow, which would degrade the efficiency of matrix multiplication. Traditional accelerators, equipped with additional hardware units to address this issue, often experience the iss...

Full description

Saved in:

Bibliographic Details
Published in:	Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design pp. 1 - 9
Main Authors:	Qiao, Yuxuan, Yang, Fan, Zhang, Yecheng, Xiong, Xiankui, Yao, Xiao, Yao, Haidong
Format:	Conference Proceeding
Language:	English
Published:	ACM 27.10.2024
Subjects:	Computer architecture Deep learning Degradation Design automation Faces Hardware Matrix reordering Optimization Production Sparse matrices Sparse Matrix-Sparse Matrix Multiplication Systolic arrays
ISSN:	1558-2434
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Sparse matrix multiplication is a critical operation in deep learning. However, matrix sparsity leads to irregular data flow, which would degrade the efficiency of matrix multiplication. Traditional accelerators, equipped with additional hardware units to address this issue, often experience the issue of low hardware utilization. Furthermore, N: M structured sparsity and corresponding hardware architectures face challenges such as accuracy degradation, limited flexibility, and restricted applicability. In this paper, we propose a Flexible Sparse Matrix Multiplication Accelerator (FSMM), which can improve the efficiency of sparse matrix multiplication through both algorithmic-level and hardware-level optimizations. At the algorithmic level, we propose the matrix-matrix multiplication with block-level outer production and fine-grained matrix reordering algorithm. The algorithm balances the sparsity of each column of a matrix block, which improves matrix compression, balances the load, and speeds up computation. This algorithm reduces storage by 8.2\%\sim 85.9{\%} . At the hardware-level, we introduce a flexible architecture for matrix multiplication. It selects the most suitable data path to complete matrix multiplication based on the sparsity of the reordered matrix. FSMM achieves a speedup of 1.90\times\sim 16.18\times over Systolic Array and 1.70\times\sim 2.87\times over the existing TSTC approach.
ISSN:	1558-2434
DOI:	10.1145/3676536.3676663