KVO-LLM: Boosting Long-Context Generation Throughput for Batched LLM Inference

With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7
Hauptverfasser: Li, Zhenyu, Lyu, Dongxu, Wang, Gang, Chen, Yuzhou, Chen, Liyan, Li, Wenjie, Jiang, Jianfei, Sun, Yanan, He, Guanghui
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 22.06.2025
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise drastically, the KV cache incurs extreme external memory access (EMA) issues. Recent LLM accelerators face substantial processing element (PE) under-utilization due to the low arithmetic intensity of attention with KV cache, while existing KV cache compression algorithms struggle with hardware inefficiency or significant accuracy degradation. To address these issues, an algorithm-architecture co-optimization, KVO-LLM, is proposed for long-context batched LLM generation. At the algorithm level, we propose a KV cache quantization-aware pruning method that first adopts salient-token-aware quantization and then prunes KV channels and tokens by attention guided pruning based on salient tokens identified during quantization. Achieving substantial savings on hardware overhead, our algorithm reduces the EMA of KV cache over 91% with significant accuracy advantages compared to previous KV cache compression algorithms. At the architecture level, we propose a multi-core jointly optimized accelerator that adopts operator fusion and cross-batch interleaving strategy, maximizing PE and DRAM bandwidth utilization. Compared to the state-of-the-art LLM accelerators, KVO-LLM improves generation throughput by up to 7.32 \times, and attains 5.52 \sim 8.38 \times better energy efficiency.
AbstractList With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise drastically, the KV cache incurs extreme external memory access (EMA) issues. Recent LLM accelerators face substantial processing element (PE) under-utilization due to the low arithmetic intensity of attention with KV cache, while existing KV cache compression algorithms struggle with hardware inefficiency or significant accuracy degradation. To address these issues, an algorithm-architecture co-optimization, KVO-LLM, is proposed for long-context batched LLM generation. At the algorithm level, we propose a KV cache quantization-aware pruning method that first adopts salient-token-aware quantization and then prunes KV channels and tokens by attention guided pruning based on salient tokens identified during quantization. Achieving substantial savings on hardware overhead, our algorithm reduces the EMA of KV cache over 91% with significant accuracy advantages compared to previous KV cache compression algorithms. At the architecture level, we propose a multi-core jointly optimized accelerator that adopts operator fusion and cross-batch interleaving strategy, maximizing PE and DRAM bandwidth utilization. Compared to the state-of-the-art LLM accelerators, KVO-LLM improves generation throughput by up to 7.32 \times, and attains 5.52 \sim 8.38 \times better energy efficiency.
Author Li, Zhenyu
Li, Wenjie
Jiang, Jianfei
Chen, Liyan
Wang, Gang
Lyu, Dongxu
Sun, Yanan
Chen, Yuzhou
He, Guanghui
Author_xml – sequence: 1
  givenname: Zhenyu
  surname: Li
  fullname: Li, Zhenyu
  email: ambitious-lzy@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 2
  givenname: Dongxu
  surname: Lyu
  fullname: Lyu, Dongxu
  email: lvdongxu@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 3
  givenname: Gang
  surname: Wang
  fullname: Wang, Gang
  email: wangganginjstu@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 4
  givenname: Yuzhou
  surname: Chen
  fullname: Chen, Yuzhou
  email: Huygens@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 5
  givenname: Liyan
  surname: Chen
  fullname: Chen, Liyan
  email: liyan.chen@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 6
  givenname: Wenjie
  surname: Li
  fullname: Li, Wenjie
  email: wenjieli@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 7
  givenname: Jianfei
  surname: Jiang
  fullname: Jiang, Jianfei
  email: jiangjianfei@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 8
  givenname: Yanan
  surname: Sun
  fullname: Sun, Yanan
  email: sunyanan@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
– sequence: 9
  givenname: Guanghui
  surname: He
  fullname: He, Guanghui
  email: guanghui.he@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China
BookMark eNo1j81KAzEUhSPoQmvfQCQvMHXu5N9dO9paHO2mui3JzM3MgCYlTUHf3oK6OnD4zgfnipyHGJCQWyhnAKW5e5jXkmluZlVZiVMFrBK8OiNTo4xmDETJSq4vyevz-6Zompd7uojxkMfQ0yaGvqhjyPiV6QoDJpvHGOh2SPHYD_tjpj4murC5HbCjpzFdB48JQ4vX5MLbjwNO_3JC3paP2_qpaDardT1vCgvK5AKlAyuE4wBCG8u9UJ1UpUDljADgnqGSJ9RLrlvnnDWSdYACOOeghWQTcvPrHRFxt0_jp03fu_-b7Ae0c0qu
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/DAC63849.2025.11132542
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798331503048
EndPage 7
ExternalDocumentID 11132542
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  funderid: 10.13039/100000001
GroupedDBID 6IE
6IH
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a179t-e6b1a55b411589a4f57d6705e7b95114f3e76a17f648cbbba963d1e5144418563
IEDL.DBID RIE
IngestDate Wed Oct 01 07:05:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a179t-e6b1a55b411589a4f57d6705e7b95114f3e76a17f648cbbba963d1e5144418563
PageCount 7
ParticipantIDs ieee_primary_11132542
PublicationCentury 2000
PublicationDate 2025-June-22
PublicationDateYYYYMMDD 2025-06-22
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-June-22
  day: 22
PublicationDecade 2020
PublicationTitle 2025 62nd ACM/IEEE Design Automation Conference (DAC)
PublicationTitleAbbrev DAC
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
Score 2.2954183
Snippet With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Accuracy
Algorithm-architecture co-design
Bandwidth
Batching
Compression algorithms
Energy efficiency
Inference algorithms
KV cache
Large language model
Large language models
Long-context
Measurement
Quantization (signal)
Random access memory
Throughput
Title KVO-LLM: Boosting Long-Context Generation Throughput for Batched LLM Inference
URI https://ieeexplore.ieee.org/document/11132542
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07TwMxDI6gYmACRBFvZWBN27u87thoAYE4SodSdavycBDLXdVeET-fJNeCGBjYosiRJTuOrdifjdCVMCyxWlCigQJhNgeSaWaIdta4VHHrYsZ0UsjhMJtO89EarB6xMAAQi8-gE5Yxl28rswpfZd04Fp0z_-JuSykbsNYa9Zv08u7tzcDfJhbgJynvbIh_jU2JXuN-75_89lH7B3-HR9-e5QBtQXmIhk-TF1IUz9e4X1XLUK2Mi6p8I7G_1GeNmwbSQc543Azfma9q7GNS3FdBNRb7w_hxw6CNXu_vxoMHsp6GQJQ3mpqA0IniXDMfw2W5Yo5LK2SPg9Q-SkqYoyCFJ3WCZUZrrbxp2QR8QMRCgxpBj1CrrEo4RphZqhRV3h6FZtxkWlpDM2eVkalW3J2gdhDGbN40vJht5HD6x_4Z2g0iDxVUaXqOWvViBRdox3zU78vFZVTTF3TDlPo
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG6MmuhJjRjf9uC1wG5fu94EJRCWlQMSbqRP42WXwGL8-bZd0Hjw4K1ppplkptOZdOabAeCeKRJpyTCSBhtEdGpQIolC0mplY0G1DRnTacbzPJnN0vEGrB6wMMaYUHxmmn4Zcvm6VGv_VdYKY9EpcS_uHiUkjmq41gb3G7XT1tNj190n4gEoMW1uyX8NTgl-o3f0T47HoPGDwIPjb99yAnZMcQry4fQFZdnoAXbKcuXrlWFWFm8odJj6rGDdQtpLGk7q8TuLdQVdVAo7witHQ3cYDrYMGuC19zzp9tFmHgISzmwqZJiMBKWSuCguSQWxlGvG29Rw6eKkiFhsOHOklpFESSmFMy4dGRcSEd-ihuEzsFuUhTkHkGgsBBbOIpkkVCWSa4UTq4XisRTUXoCGF8Z8Ube8mG_lcPnH_h046E9G2Twb5MMrcOjF7-up4vga7FbLtbkB--qjel8tb4PKvgBnLphB
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=KVO-LLM%3A+Boosting+Long-Context+Generation+Throughput+for+Batched+LLM+Inference&rft.au=Li%2C+Zhenyu&rft.au=Lyu%2C+Dongxu&rft.au=Wang%2C+Gang&rft.au=Chen%2C+Yuzhou&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132542&rft.externalDocID=11132542