KVO-LLM: Boosting Long-Context Generation Throughput for Batched LLM Inference
With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise...
Gespeichert in:
| Veröffentlicht in: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7 |
|---|---|
| Hauptverfasser: | , , , , , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
22.06.2025
|
| Schlagworte: | |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise drastically, the KV cache incurs extreme external memory access (EMA) issues. Recent LLM accelerators face substantial processing element (PE) under-utilization due to the low arithmetic intensity of attention with KV cache, while existing KV cache compression algorithms struggle with hardware inefficiency or significant accuracy degradation. To address these issues, an algorithm-architecture co-optimization, KVO-LLM, is proposed for long-context batched LLM generation. At the algorithm level, we propose a KV cache quantization-aware pruning method that first adopts salient-token-aware quantization and then prunes KV channels and tokens by attention guided pruning based on salient tokens identified during quantization. Achieving substantial savings on hardware overhead, our algorithm reduces the EMA of KV cache over 91% with significant accuracy advantages compared to previous KV cache compression algorithms. At the architecture level, we propose a multi-core jointly optimized accelerator that adopts operator fusion and cross-batch interleaving strategy, maximizing PE and DRAM bandwidth utilization. Compared to the state-of-the-art LLM accelerators, KVO-LLM improves generation throughput by up to 7.32 \times, and attains 5.52 \sim 8.38 \times better energy efficiency. |
|---|---|
| AbstractList | With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise drastically, the KV cache incurs extreme external memory access (EMA) issues. Recent LLM accelerators face substantial processing element (PE) under-utilization due to the low arithmetic intensity of attention with KV cache, while existing KV cache compression algorithms struggle with hardware inefficiency or significant accuracy degradation. To address these issues, an algorithm-architecture co-optimization, KVO-LLM, is proposed for long-context batched LLM generation. At the algorithm level, we propose a KV cache quantization-aware pruning method that first adopts salient-token-aware quantization and then prunes KV channels and tokens by attention guided pruning based on salient tokens identified during quantization. Achieving substantial savings on hardware overhead, our algorithm reduces the EMA of KV cache over 91% with significant accuracy advantages compared to previous KV cache compression algorithms. At the architecture level, we propose a multi-core jointly optimized accelerator that adopts operator fusion and cross-batch interleaving strategy, maximizing PE and DRAM bandwidth utilization. Compared to the state-of-the-art LLM accelerators, KVO-LLM improves generation throughput by up to 7.32 \times, and attains 5.52 \sim 8.38 \times better energy efficiency. |
| Author | Li, Zhenyu Li, Wenjie Jiang, Jianfei Chen, Liyan Wang, Gang Lyu, Dongxu Sun, Yanan Chen, Yuzhou He, Guanghui |
| Author_xml | – sequence: 1 givenname: Zhenyu surname: Li fullname: Li, Zhenyu email: ambitious-lzy@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 2 givenname: Dongxu surname: Lyu fullname: Lyu, Dongxu email: lvdongxu@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 3 givenname: Gang surname: Wang fullname: Wang, Gang email: wangganginjstu@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 4 givenname: Yuzhou surname: Chen fullname: Chen, Yuzhou email: Huygens@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 5 givenname: Liyan surname: Chen fullname: Chen, Liyan email: liyan.chen@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 6 givenname: Wenjie surname: Li fullname: Li, Wenjie email: wenjieli@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 7 givenname: Jianfei surname: Jiang fullname: Jiang, Jianfei email: jiangjianfei@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 8 givenname: Yanan surname: Sun fullname: Sun, Yanan email: sunyanan@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China – sequence: 9 givenname: Guanghui surname: He fullname: He, Guanghui email: guanghui.he@sjtu.edu.cn organization: Shanghai Jiao Tong University,State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering,Shanghai,China |
| BookMark | eNo1j81KAzEUhSPoQmvfQCQvMHXu5N9dO9paHO2mui3JzM3MgCYlTUHf3oK6OnD4zgfnipyHGJCQWyhnAKW5e5jXkmluZlVZiVMFrBK8OiNTo4xmDETJSq4vyevz-6Zompd7uojxkMfQ0yaGvqhjyPiV6QoDJpvHGOh2SPHYD_tjpj4murC5HbCjpzFdB48JQ4vX5MLbjwNO_3JC3paP2_qpaDardT1vCgvK5AKlAyuE4wBCG8u9UJ1UpUDljADgnqGSJ9RLrlvnnDWSdYACOOeghWQTcvPrHRFxt0_jp03fu_-b7Ae0c0qu |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/DAC63849.2025.11132542 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798331503048 |
| EndPage | 7 |
| ExternalDocumentID | 11132542 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Science Foundation funderid: 10.13039/100000001 |
| GroupedDBID | 6IE 6IH CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a179t-e6b1a55b411589a4f57d6705e7b95114f3e76a17f648cbbba963d1e5144418563 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 01 07:05:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a179t-e6b1a55b411589a4f57d6705e7b95114f3e76a17f648cbbba963d1e5144418563 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_11132542 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-June-22 |
| PublicationDateYYYYMMDD | 2025-06-22 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-June-22 day: 22 |
| PublicationDecade | 2020 |
| PublicationTitle | 2025 62nd ACM/IEEE Design Automation Conference (DAC) |
| PublicationTitleAbbrev | DAC |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.2954183 |
| Snippet | With the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Accuracy Algorithm-architecture co-design Bandwidth Batching Compression algorithms Energy efficiency Inference algorithms KV cache Large language model Large language models Long-context Measurement Quantization (signal) Random access memory Throughput |
| Title | KVO-LLM: Boosting Long-Context Generation Throughput for Batched LLM Inference |
| URI | https://ieeexplore.ieee.org/document/11132542 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07TwMxDI6gYmACRBFvZWBN27u87thoAYE4SodSdavycBDLXdVeET-fJNeCGBjYosiRJTuOrdifjdCVMCyxWlCigQJhNgeSaWaIdta4VHHrYsZ0UsjhMJtO89EarB6xMAAQi8-gE5Yxl28rswpfZd04Fp0z_-JuSykbsNYa9Zv08u7tzcDfJhbgJynvbIh_jU2JXuN-75_89lH7B3-HR9-e5QBtQXmIhk-TF1IUz9e4X1XLUK2Mi6p8I7G_1GeNmwbSQc543Azfma9q7GNS3FdBNRb7w_hxw6CNXu_vxoMHsp6GQJQ3mpqA0IniXDMfw2W5Yo5LK2SPg9Q-SkqYoyCFJ3WCZUZrrbxp2QR8QMRCgxpBj1CrrEo4RphZqhRV3h6FZtxkWlpDM2eVkalW3J2gdhDGbN40vJht5HD6x_4Z2g0iDxVUaXqOWvViBRdox3zU78vFZVTTF3TDlPo |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG6MmuhJjRjf9uC1wG5fu94EJRCWlQMSbqRP42WXwGL8-bZd0Hjw4K1ppplkptOZdOabAeCeKRJpyTCSBhtEdGpQIolC0mplY0G1DRnTacbzPJnN0vEGrB6wMMaYUHxmmn4Zcvm6VGv_VdYKY9EpcS_uHiUkjmq41gb3G7XT1tNj190n4gEoMW1uyX8NTgl-o3f0T47HoPGDwIPjb99yAnZMcQry4fQFZdnoAXbKcuXrlWFWFm8odJj6rGDdQtpLGk7q8TuLdQVdVAo7witHQ3cYDrYMGuC19zzp9tFmHgISzmwqZJiMBKWSuCguSQWxlGvG29Rw6eKkiFhsOHOklpFESSmFMy4dGRcSEd-ihuEzsFuUhTkHkGgsBBbOIpkkVCWSa4UTq4XisRTUXoCGF8Z8Ube8mG_lcPnH_h046E9G2Twb5MMrcOjF7-up4vga7FbLtbkB--qjel8tb4PKvgBnLphB |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=KVO-LLM%3A+Boosting+Long-Context+Generation+Throughput+for+Batched+LLM+Inference&rft.au=Li%2C+Zhenyu&rft.au=Lyu%2C+Dongxu&rft.au=Wang%2C+Gang&rft.au=Chen%2C+Yuzhou&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132542&rft.externalDocID=11132542 |