An Efficient Bit-level Sparse MAC-accelerated Architecture with SW/HW Co-design on FPGA

Exploring bit-level sparsity in the MAC process has been proven to be an important method for improving the efficiency of neural network feedforward processing. The reconfigurable platform offers possibilities for identifying the bitlevel unstructured redundancy during inference with different DNN m...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autori:	Zhang, Chenming, Gong, Lei, Wang, Chao, Zhou, Xuehai
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 22.06.2025
Predmet:	Accuracy Bit-level sparsity Encoding Field programmable gate arrays FPGA look-up table multiply and accumulate Optimization Quantization (signal) Reconfigurable architectures Redundancy Scalability SW/HW co-design Table lookup Uncertainty
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Exploring bit-level sparsity in the MAC process has been proven to be an important method for improving the efficiency of neural network feedforward processing. The reconfigurable platform offers possibilities for identifying the bitlevel unstructured redundancy during inference with different DNN models. Researchers noticed significant progress in valueaware accelerators on ASICs, yet we are concerned about the few studies on FPGAs. This paper observed the limitations of implementing bit-level sparsity optimizations using FPGA and proposed a software/architecture co-design solution. Specifically, by introducing LUT-friendly encoding with adaptable granularity and hardware structure supporting multiplication time uncertainty, we achieved a better trade-off between potential redundancy and accuracy with compatibility and scalability. Experiments show that under accurate calculation, PEs are up to 2.2 \times smaller than bit-parallel ones, and our design boosts performance by 1.04 \times to 1.74 \times and 1.40 \times to 2.79 \times over bitparallel and Booth-based designs, respectively.
DOI:	10.1109/DAC63849.2025.11133249