KVO-LLM: Boosting Long-Context Generation Throughput for Batched LLM Inference

With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři:	Li, Zhenyu, Lyu, Dongxu, Wang, Gang, Chen, Yuzhou, Chen, Liyan, Li, Wenjie, Jiang, Jianfei, Sun, Yanan, He, Guanghui
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 22.06.2025
Témata:	Accuracy Algorithm-architecture co-design Bandwidth Batching Compression algorithms Energy efficiency Inference algorithms KV cache Large language model Large language models Long-context Measurement Quantization (signal) Random access memory Throughput
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise drastically, the KV cache incurs extreme external memory access (EMA) issues. Recent LLM accelerators face substantial processing element (PE) under-utilization due to the low arithmetic intensity of attention with KV cache, while existing KV cache compression algorithms struggle with hardware inefficiency or significant accuracy degradation. To address these issues, an algorithm-architecture co-optimization, KVO-LLM, is proposed for long-context batched LLM generation. At the algorithm level, we propose a KV cache quantization-aware pruning method that first adopts salient-token-aware quantization and then prunes KV channels and tokens by attention guided pruning based on salient tokens identified during quantization. Achieving substantial savings on hardware overhead, our algorithm reduces the EMA of KV cache over 91% with significant accuracy advantages compared to previous KV cache compression algorithms. At the architecture level, we propose a multi-core jointly optimized accelerator that adopts operator fusion and cross-batch interleaving strategy, maximizing PE and DRAM bandwidth utilization. Compared to the state-of-the-art LLM accelerators, KVO-LLM improves generation throughput by up to 7.32 \times, and attains 5.52 \sim 8.38 \times better energy efficiency.
DOI:	10.1109/DAC63849.2025.11132542