KVO-LLM: Boosting Long-Context Generation Throughput for Batched LLM Inference

With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7
Hauptverfasser:	Li, Zhenyu, Lyu, Dongxu, Wang, Gang, Chen, Yuzhou, Chen, Liyan, Li, Wenjie, Jiang, Jianfei, Sun, Yanan, He, Guanghui
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 22.06.2025
Schlagworte:	Accuracy Algorithm-architecture co-design Bandwidth Batching Compression algorithms Energy efficiency Inference algorithms KV cache Large language model Large language models Long-context Measurement Quantization (signal) Random access memory Throughput
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise drastically, the KV cache incurs extreme external memory access (EMA) issues. Recent LLM accelerators face substantial processing element (PE) under-utilization due to the low arithmetic intensity of attention with KV cache, while existing KV cache compression algorithms struggle with hardware inefficiency or significant accuracy degradation. To address these issues, an algorithm-architecture co-optimization, KVO-LLM, is proposed for long-context batched LLM generation. At the algorithm level, we propose a KV cache quantization-aware pruning method that first adopts salient-token-aware quantization and then prunes KV channels and tokens by attention guided pruning based on salient tokens identified during quantization. Achieving substantial savings on hardware overhead, our algorithm reduces the EMA of KV cache over 91% with significant accuracy advantages compared to previous KV cache compression algorithms. At the architecture level, we propose a multi-core jointly optimized accelerator that adopts operator fusion and cross-batch interleaving strategy, maximizing PE and DRAM bandwidth utilization. Compared to the state-of-the-art LLM accelerators, KVO-LLM improves generation throughput by up to 7.32 \times, and attains 5.52 \sim 8.38 \times better energy efficiency.
DOI:	10.1109/DAC63849.2025.11132542