An Algorithm-Hardware Co-design Based on Revised Microscaling Format Quantization for Accelerating Large Language Models

The narrow-bit-width data format is crucial for reducing the computation and storage costs of modern deep learning applications, particularly in large language models (LLMs) based applications. Microscaling (MX) format has been proven as a drop-in replacement for the baseline FP32 in existing infere...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři:	Hao, Yingbo, Chen, Huangxu, Zou, Yi, Yang, Yanfeng
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 22.06.2025
Témata:	Accelerator Computer architecture Deep learning Design automation Encoding Energy efficiency Friction Hardware Inference algorithms Large Language Model (LLM) Large language models Quantization (signal) Revised Microscaling (RMX) Format Two-Level Quantization
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The narrow-bit-width data format is crucial for reducing the computation and storage costs of modern deep learning applications, particularly in large language models (LLMs) based applications. Microscaling (MX) format has been proven as a drop-in replacement for the baseline FP32 in existing inference frameworks, with low user friction. However, deploying such a new format into existing hardware systems is still challenging, and the dominant solution for LLM inference at low precision is still low-bit quantization. This particularly limits the strategic applications of such LLMs in real deployment on a large scale. In this work, we propose an algorithm-hardware co-design that adopts a two-level Revised MX Format Quantization (RMFQ) and a Revised MX Format Accelerator (RMFA) architecture design. RMFQ proposes the revised M X(R M X) format and provides a novel quantization framework with innovative group direction. Also, RMFA provides an RMX adaptive hardware architecture and an RMX encoding scheme. As a result, RMFQ pushes the limit of 4-bit and 6-bit quantization to a new state-of-the-art, and RMFA surpasses the existing outlier-aware accelerator such as OliVe, achieving a 1.28 \times speedup and a 1.31 \times energy reduction.
DOI:	10.1109/DAC63849.2025.11132485