An Algorithm-Hardware Co-design Based on Revised Microscaling Format Quantization for Accelerating Large Language Models
The narrow-bit-width data format is crucial for reducing the computation and storage costs of modern deep learning applications, particularly in large language models (LLMs) based applications. Microscaling (MX) format has been proven as a drop-in replacement for the baseline FP32 in existing infere...
Uloženo v:
| Vydáno v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
22.06.2025
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | The narrow-bit-width data format is crucial for reducing the computation and storage costs of modern deep learning applications, particularly in large language models (LLMs) based applications. Microscaling (MX) format has been proven as a drop-in replacement for the baseline FP32 in existing inference frameworks, with low user friction. However, deploying such a new format into existing hardware systems is still challenging, and the dominant solution for LLM inference at low precision is still low-bit quantization. This particularly limits the strategic applications of such LLMs in real deployment on a large scale. In this work, we propose an algorithm-hardware co-design that adopts a two-level Revised MX Format Quantization (RMFQ) and a Revised MX Format Accelerator (RMFA) architecture design. RMFQ proposes the revised M X(R M X) format and provides a novel quantization framework with innovative group direction. Also, RMFA provides an RMX adaptive hardware architecture and an RMX encoding scheme. As a result, RMFQ pushes the limit of 4-bit and 6-bit quantization to a new state-of-the-art, and RMFA surpasses the existing outlier-aware accelerator such as OliVe, achieving a 1.28 \times speedup and a 1.31 \times energy reduction. |
|---|---|
| DOI: | 10.1109/DAC63849.2025.11132485 |