ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenge...

Full description

Saved in:
Bibliographic Details
Published in:2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 1005 - 1017
Main Authors: Zhao, Youpeng, Wu, Di, Wang, Jun
Format: Conference Proceeding
Language:English
Published: IEEE 29.06.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this paper, we propose ALISA, a novel algorithm-system co-design solution to address the challenges imposed by KV caching. On the algorithm level, ALISA prioritizes tokens that are most important in generating a new token via a Sparse Window Attention (SWA) algorithm. SWA introduces high sparsity in attention layers and reduces the memory footprint of KV caching at negligible accuracy loss. On the system level, ALISA employs three-phase token-level dynamical scheduling and optimizes the trade-off between caching and recomputation, thus maximizing the overall performance in resource-constrained systems. In a single GPU-CPU system, we demonstrate that under varying workloads, ALISA improves the throughput of baseline systems such as FlexGen and vLLM by up to 3 \times and 1.9 \times, respectively.
AbstractList The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this paper, we propose ALISA, a novel algorithm-system co-design solution to address the challenges imposed by KV caching. On the algorithm level, ALISA prioritizes tokens that are most important in generating a new token via a Sparse Window Attention (SWA) algorithm. SWA introduces high sparsity in attention layers and reduces the memory footprint of KV caching at negligible accuracy loss. On the system level, ALISA employs three-phase token-level dynamical scheduling and optimizes the trade-off between caching and recomputation, thus maximizing the overall performance in resource-constrained systems. In a single GPU-CPU system, we demonstrate that under varying workloads, ALISA improves the throughput of baseline systems such as FlexGen and vLLM by up to 3 \times and 1.9 \times, respectively.
Author Zhao, Youpeng
Wang, Jun
Wu, Di
Author_xml – sequence: 1
  givenname: Youpeng
  surname: Zhao
  fullname: Zhao, Youpeng
  email: youpeng.zhao@ucf.edu
  organization: University of Central Florida
– sequence: 2
  givenname: Di
  surname: Wu
  fullname: Wu, Di
  email: di.wu@ucf.edu
  organization: University of Central Florida
– sequence: 3
  givenname: Jun
  surname: Wang
  fullname: Wang, Jun
  email: jun.wang@ucf.edu
  organization: University of Central Florida
BookMark eNotj8tKxEAURFtQUMf8wSz6BzL2I_1yF4I6wYhC1O1wk74dAzEzdKIyf29EN1VncwrqkpyO-xEJWXO24Zy567IucuWYMRvBRLZhbMETkjjjrFRMCq0sPyfJNPUN08wZaay6IM95Vdb5Dc3bFgeMMPdjRyuIHS45dp-wwOPe40DLMWDEsUX61QOtDxCnfj6m-TdEpA9vtID2fZGvyFmAYcLkv1fk9e72pdim1dN9WeRVCkLZOVXOS628DcZroz0HLrjhTWBWZ-BQBusyB6ZRXkImTGs9d00bTCO5CA0PckXWf7s9Iu4Osf-AeNzx32taaPkD14ZPOg
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ISCA59077.2024.00077
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350326581
EndPage 1017
ExternalDocumentID 10609626
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  funderid: 10.13039/100000001
GroupedDBID 6IE
6IH
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a258t-59d365d8f7d676d1a12171bf0864a9e3f8949a7b5d3a427c8d19bcf7b312fb1f3
IEDL.DBID RIE
ISICitedReferencesCount 8
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700067&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:35:00 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a258t-59d365d8f7d676d1a12171bf0864a9e3f8949a7b5d3a427c8d19bcf7b312fb1f3
PageCount 13
ParticipantIDs ieee_primary_10609626
PublicationCentury 2000
PublicationDate 2024-June-29
PublicationDateYYYYMMDD 2024-06-29
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-29
  day: 29
PublicationDecade 2020
PublicationTitle 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)
PublicationTitleAbbrev ISCA
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060973785
Score 2.360742
Snippet The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs)...
SourceID ieee
SourceType Publisher
StartPage 1005
SubjectTerms Accuracy
Heuristic algorithms
Large language models
Machine Learning Systems
Memory management
Tensors
Throughput
Transformers
Title ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
URI https://ieeexplore.ieee.org/document/10609626
WOSCitedRecordID wos001290320700067&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62ePCkYsU3OXiNbrKbx3hbisViKYWq9FbylIK0pbb175vsthUPHryEkECWnSRMvuT7ZhC65cxBbpwiktlAoj_mBIwQBJTxmctsrgtXJZuQ_b4ajWCwEatXWhjvfUU-83epWr3lu5ldpauyuMNFPHEz0UANKUUt1touHpHizkjFN_I4msF9d9gueQR_MsJAloJkx-qvJCqVD-kc_vPrR6j1o8bDg52fOUZ7fnqCBmWvOywfcGltdBxpGqfvuJdo3bGsryBxynP2gbu7QdYTjYdzXdEwSPmlFx4_v-F2TadsodfO40v7iWyyIxDNuFoSDi4X3KkgnZDCUU0juqAmRIxSaPB5UFCAloa7aG8mrXIUjA3S5JQFQ0N-iprT2dSfISy1kmAh9oUImJSGTBdxOEN1_GWn4By1kjnG8zoAxnhriYs_2i_RQbJ4YlQxuELN5WLlr9G-XS8nn4ubatq-AbEJl6o
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8MwGA46BT2pOPHbHLxGm7T58laGY2V1DDZlt5GvykC6Mbf5903abeLBg5cQEkjpm4Q3T_I87wvAPSVWxtoKxIkpkPfHFEnNGJJCu8hGJlaJrZJN8F5PjEayvxarV1oY51xFPnMPoVq95dupWYarMr_DmT9xE7YL9miSkKiWa22WDwuRZ7iga4EcjuRjNmil1MM_7oEgCWGyffVXGpXKi7SP_vn9Y9D80ePB_tbTnIAdV56Cfppng_QJpsZ41xEmsnyHeSB2-7K-hIQh09kHzLaDrCYKDmaqImKg9EvNHey-wVZNqGyC1_bzsNVB6_wISBEqFohKGzNqRcEt48xihT2-wLrwKCVR0sWFkIlUXFPrLU64ERZLbQquY0wKjYv4DDTKaenOAeRKcGmk7ys8ZBJKRirxw2ms_C9bIS9AM5hjPKtDYIw3lrj8o_0OHHSGL_k4z3rdK3AYrB_4VUReg8ZivnQ3YN-sFpPP-W01hd_HFJrx
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=ALISA%3A+Accelerating+Large+Language+Model+Inference+via+Sparsity-Aware+KV+Caching&rft.au=Zhao%2C+Youpeng&rft.au=Wu%2C+Di&rft.au=Wang%2C+Jun&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=1005&rft.epage=1017&rft_id=info:doi/10.1109%2FISCA59077.2024.00077&rft.externalDocID=10609626