DenSparSA: A Balanced Systolic Array Approach for Dense and Sparse Matrix Multiplication

Numerous studies have proposed hardware architectures to accelerate sparse matrix multiplication, but these approaches often incur substantial area and power overhead, significantly compromising their usage in dense scenarios. On the other hand, systolic arrays deliver high efficiency for dense matr...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři:	Wang, Ziheng, Sun, Ruiqi, He, Xin, Ma, Tianrui, Zou, An
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 22.06.2025
Témata:	Acceleration Computational efficiency Computer architecture Distance measurement Hands Hardware Microarchitecture Neural Network Neural networks Sparse matrices Sparsity Switching circuits Systolic arrays
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Numerous studies have proposed hardware architectures to accelerate sparse matrix multiplication, but these approaches often incur substantial area and power overhead, significantly compromising their usage in dense scenarios. On the other hand, systolic arrays deliver high efficiency for dense matrix operations, but their application to sparse matrices remains challenging. An ideal design should process both dense and sparse matrices with high efficiency to satisfy performance and versatility requirements.In this paper, we introduce DenSparSA, a balanced systolic array centralized architecture that can execute sparse matrix computations with minimal overhead to original dense matrix computations. DenSparSA supports both single-side and dual-side unstructured sparse matrix multiplications with high efficiency. At the same time, the additional hardware required for managing sparsity is compact and decoupled from the conventional systolic array, allowing for minimal power overhead when switched back to dense matrix operations via circuit gating. The proposed design is implemented with Nangate 45 nm. Implementation results show that DenSparSA achieves a speedup ranging from 1.9 \times to 22 \times compared to the classic systolic array for sparse workloads, while maintaining relatively low area and power overhead. For dense workloads, the power overhead can be reduced to \mathbf{1 2 \%} for BF16 and 5% for FP32. Compared with existing solutions for sparse acceleration, DenSparSA delivers competitive (0.82 \times-1.32 \times) efficiency in sparse scenarios and 1.17 \times-2.28 \times better efficiency for dense scenarios, indicating a better balance between both situations.
DOI:	10.1109/DAC63849.2025.11133069