BLOOM: Bit-Slice Framework for DNN Acceleration with Mixed-Precision

Deep neural networks (DNNs) have revolutionized numerous AI applications, but their vast model sizes and limited hardware resources present significant deployment challenges. Model quantization offers a promising solution to bridge the gap between DNN size and hardware capacity. While INT8 quantizat...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři: Liu, Fangxin, Yang, Ning, Wang, Zongwu, Zhu, Xuanpeng, Yao, Haidong, Xiong, Xiankui, Jiang, Li, Guan, Haibing
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 22.06.2025
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Deep neural networks (DNNs) have revolutionized numerous AI applications, but their vast model sizes and limited hardware resources present significant deployment challenges. Model quantization offers a promising solution to bridge the gap between DNN size and hardware capacity. While INT8 quantization has been widely used, recent research has pushed for even lower precision, such as INT4. However, the presence of outliers-values with unusually large magnitudes-limits the effectiveness of current quantization techniques. Previous compression-based acceleration methods that incorporate outlieraware encoding introduce complex logic. A critical issue we have identified is that serialization and deserialization dominate the encoding/decoding time in these compression workflows, leading to substantial performance penalties during workflow execution. To address this challenge, we introduce a novel computing approach and a compatible architecture design named "BLOOM". BLOOM leverages the strengths of the "bit-slicing" method, effectively combining structured mixed-precision and bit-level sparsity with adaptive dataflow techniques. The key insight of BLOOM is that outliers require higher precision, while normal values can be processed at lower precision. By interleaving 4-bit values, we efficiently exploit the inherent sparsity in the highprecision components. As a result, the BLOOM-based accelerator outperforms the existing outlier-aware accelerators by an average 1.2 \sim 4.0 \times speedup and 24.6 \% \sim 71.3 \% energy reduction, respectively, without model accuracy loss.
DOI:10.1109/DAC63849.2025.11133246