Integer Unit-Based Outlier-Aware LLM Accelerator Preserving Numerical Accuracy of FP-FP GEMM

The proliferation of large language models (LLMs) has significantly heightened the importance of quantization to alleviate the computational burden given the surge in the number of parameters. However, quantization often targets a subset of a LLM and relies on the floating-point (FP) arithmetic for...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings - Design, Automation, and Test in Europe Conference and Exhibition s. 1 - 7
Hlavní autoři:	Lee, Jehun, Kim, Jae-Joon
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	EDAA 31.03.2025
Témata:	accelerator Accuracy Arithmetic Energy conservation floating-point computation Hardware outlier Performance gain Philosophical considerations Quantization (signal) Simulation Surges transformer Transformers
ISSN:	1558-1101
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The proliferation of large language models (LLMs) has significantly heightened the importance of quantization to alleviate the computational burden given the surge in the number of parameters. However, quantization often targets a subset of a LLM and relies on the floating-point (FP) arithmetic for matrix multiplication of specific subsets, leading to performance and energy overhead. Additionally, to compensate for the quality degradation incurred by quantization, retraining methods are frequently employed, demanding significant efforts and resources. This paper proposes OwL-P, an outlier-aware LLM inference accelerator which preserves the numerical accuracy of FP arithmetic while enhancing hardware efficiency with an integer (INT)-based arithmetic unit for general matrix multiplication (GEMM), through the use of a shared exponent and efficient management of outlier data. It also mitigates off-chip bandwidth requirements by employing a compressed number format. The proposed number format leverages outliers and shared exponents to facilitate the compression of both model weights and activations. We evaluate this work across 10 different transformer-based benchmarks, and the results demonstrate that the proposed integer-based LLM accelerator achieves an average 2.70x performance gain and 3.57 x energy savings while maintaining the numerical accuracy of the FP arithmetic.
ISSN:	1558-1101
DOI:	10.23919/DATE64628.2025.10992868