Codeword Decomposition and Recoding Based Lossless Model Compression Algorithm and Its VLSI Decompressor Design

Today, generative AI models, including large language models, are extensively used in various applications demanding real-time execution and substantial memory capacity, which presents a big challenge in achieving low-power and efficient implementations. Consequently, there is a growing need for mod...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE International Conference on Artificial Intelligence Circuits and Systems (Online) s. 1 - 5
Hlavní autoři:	Ho, Lyu-Ming, Liang, Chih-Yao, Huang, Juinn-Dar
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 28.04.2025
Témata:	Codeword decomposition and recoding Compression algorithms Decoding decompressor design Generative AI Hardware lossless model compression Memory management Model compression Real-time systems Runtime subword encoding Throughput Very large scale integration
ISSN:	2834-9857
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Today, generative AI models, including large language models, are extensively used in various applications demanding real-time execution and substantial memory capacity, which presents a big challenge in achieving low-power and efficient implementations. Consequently, there is a growing need for model compression techniques for model size reduction. This work presents a new hardware-friendly, lossless model compression algorithm and its VLSI decompressor design to address the issue of limited memory resources. Our approach consists of two key ideas: codeword decomposition and recoding. The codeword decomposition method significantly reduces the complexity of the decompressor and thus increases the speed. The recoding method makes the frequency distribution of subwords more uneven, thus improving the compression rate. In addition, we propose a high-speed decompressor architecture design and its efficient hardware implementation. The proposed lossless compression algorithm significantly reduces the model size - 41% and 52% for Llama2-7B and BLOOM-7B, respectively. Moreover, multiple decompressors can be deployed in parallel to boost the overall throughput. Therefore, the proposed lossless compression algorithm and its decompressor design can effectively increase the runtime efficiency of memory-hungry generative AI models.
ISSN:	2834-9857
DOI:	10.1109/AICAS64808.2025.11173159