APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design

DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of highprecision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Tra...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7
Hauptverfasser:	Tan, Yonghao, Dong, Pingcheng, Wu, Yongkun, Liu, Yu, Liu, Xuejiao, Luo, Peng, Liu, Shih-Yang, Huang, Xijie, Zhang, Dong, Liang, Luhong, Cheng, Kwang-Ting
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 22.06.2025
Schlagworte:	Additives Costs dataflow DNN Energy consumption Hardware-aware quantization Large language models Memory architecture Model compression partial sum quantization Power demand Quantization (signal) Reconfigurable architectures Transformer Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of highprecision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by \mathbf{2 8-8 7 \%}. Extended experiments on LLaMA2-7B demonstrate the potential of APSQ for large language models. Code is available at https://github.com/Yonghao-Tan/APSQ.
DOI:	10.1109/DAC63849.2025.11133081