DuoQ: A DSP Utilization-aware and Outlier-free Quantization for FPGA-based LLMs Acceleration

Quantization enables efficient deployment of large language models (LLMs) on FPGAs, but its presence of outliers affects the accuracy of the quantized model. Existing methods mainly deal with outliers through channel-wise or tokenwise isolation and encoding, which leads to expensive dynamic quantiza...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7
Hauptverfasser: Yu, Zhuoquan, Ji, Huidong, Cao, Yue, Wu, Junfu, Yan, Xiaoze, Zheng, Lirong, Zou, Zhuo
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 22.06.2025
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Quantization enables efficient deployment of large language models (LLMs) on FPGAs, but its presence of outliers affects the accuracy of the quantized model. Existing methods mainly deal with outliers through channel-wise or tokenwise isolation and encoding, which leads to expensive dynamic quantization. To address this problem, we introduce DuoQ, an FPGA-oriented algorithm-hardware co-design framework. DuoQ effectively eliminates outliers through learnable equivalent transformations and low-semantic token awareness in the quantization scheme part, facilitating per-tensor quantization with 4 -bits. We co-design the quantization algorithm and hardware architecture. Specifically, DuoQ accelerates end-to-end LLM through a novel DSP-aware PE unit design and encoder design. In addition, two types of post-processing units assist in the realization of nonlinear functions and dynamic token awareness. Experimental results show that compared with platforms with different architectures, DuoQ's computational efficiency and energy efficiency are improved by up to 8.8 \times and 23.45 \times. In addition, DuoQ has achieved accuracy improvements compared to other outlieraware software and hardware works.
DOI:10.1109/DAC63849.2025.11132816