FSMM: An Efficient Matrix Multiplication Accelerator Supporting Flexible Sparsity
Sparse matrix multiplication is a critical operation in deep learning. However, matrix sparsity leads to irregular data flow, which would degrade the efficiency of matrix multiplication. Traditional accelerators, equipped with additional hardware units to address this issue, often experience the iss...
Saved in:
| Published in: | Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design pp. 1 - 9 |
|---|---|
| Main Authors: | , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
ACM
27.10.2024
|
| Subjects: | |
| ISSN: | 1558-2434 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Sparse matrix multiplication is a critical operation in deep learning. However, matrix sparsity leads to irregular data flow, which would degrade the efficiency of matrix multiplication. Traditional accelerators, equipped with additional hardware units to address this issue, often experience the issue of low hardware utilization. Furthermore, N: M structured sparsity and corresponding hardware architectures face challenges such as accuracy degradation, limited flexibility, and restricted applicability. In this paper, we propose a Flexible Sparse Matrix Multiplication Accelerator (FSMM), which can improve the efficiency of sparse matrix multiplication through both algorithmic-level and hardware-level optimizations. At the algorithmic level, we propose the matrix-matrix multiplication with block-level outer production and fine-grained matrix reordering algorithm. The algorithm balances the sparsity of each column of a matrix block, which improves matrix compression, balances the load, and speeds up computation. This algorithm reduces storage by 8.2\%\sim 85.9{\%} . At the hardware-level, we introduce a flexible architecture for matrix multiplication. It selects the most suitable data path to complete matrix multiplication based on the sparsity of the reordered matrix. FSMM achieves a speedup of 1.90\times\sim 16.18\times over Systolic Array and 1.70\times\sim 2.87\times over the existing TSTC approach. |
|---|---|
| ISSN: | 1558-2434 |
| DOI: | 10.1145/3676536.3676663 |