DuoQ: A DSP Utilization-aware and Outlier-free Quantization for FPGA-based LLMs Acceleration

Quantization enables efficient deployment of large language models (LLMs) on FPGAs, but its presence of outliers affects the accuracy of the quantized model. Existing methods mainly deal with outliers through channel-wise or tokenwise isolation and encoding, which leads to expensive dynamic quantiza...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři:	Yu, Zhuoquan, Ji, Huidong, Cao, Yue, Wu, Junfu, Yan, Xiaoze, Zheng, Lirong, Zou, Zhuo
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 22.06.2025
Témata:	Accuracy algorithm-hardware co-design Computational efficiency Computer architecture DSP utilization Energy efficiency Field programmable gate arrays FPGA Hardware Heuristic algorithms Inference algorithms LLMs Quantization (signal) Software algorithms W4A4 quantization
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Quantization enables efficient deployment of large language models (LLMs) on FPGAs, but its presence of outliers affects the accuracy of the quantized model. Existing methods mainly deal with outliers through channel-wise or tokenwise isolation and encoding, which leads to expensive dynamic quantization. To address this problem, we introduce DuoQ, an FPGA-oriented algorithm-hardware co-design framework. DuoQ effectively eliminates outliers through learnable equivalent transformations and low-semantic token awareness in the quantization scheme part, facilitating per-tensor quantization with 4 -bits. We co-design the quantization algorithm and hardware architecture. Specifically, DuoQ accelerates end-to-end LLM through a novel DSP-aware PE unit design and encoder design. In addition, two types of post-processing units assist in the realization of nonlinear functions and dynamic token awareness. Experimental results show that compared with platforms with different architectures, DuoQ's computational efficiency and energy efficiency are improved by up to 8.8 \times and 23.45 \times. In addition, DuoQ has achieved accuracy improvements compared to other outlieraware software and hardware works.
DOI:	10.1109/DAC63849.2025.11132816