Splitwise: Efficient Generative LLM Inference Using Phase Splitting

Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory inten...

Full description

Saved in:
Bibliographic Details
Published in:2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 118 - 132
Main Authors: Patel, Pratyush, Choukse, Esha, Zhang, Chaojie, Shah, Aashaka, Goiri, Inigo, Maleki, Saeed, Bianchini, Ricardo
Format: Conference Proceeding
Language:English
Published: IEEE 29.06.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model deployment and scheduling technique that splits the two phases of LLM inference requests on to separate machines. Splitwise enables phase-specific resource management using hardware that is well suited for each phase. Request state is transferred efficiently between machines using optimized network libraries on the fast back-plane interconnects available in today's GPU clusters. Using Splitwise, we design homogeneous and heterogeneous LLM inference clusters optimized for throughput, cost, and power Compared to current designs, Splitwise clusters achieve up to 1.4 \times higher throughput at \mathbf{2 0 \%} lower cost. Alternatively, they can deliver 2.35 \times more throughput under the same power and cost budgets.
AbstractList Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model deployment and scheduling technique that splits the two phases of LLM inference requests on to separate machines. Splitwise enables phase-specific resource management using hardware that is well suited for each phase. Request state is transferred efficiently between machines using optimized network libraries on the fast back-plane interconnects available in today's GPU clusters. Using Splitwise, we design homogeneous and heterogeneous LLM inference clusters optimized for throughput, cost, and power Compared to current designs, Splitwise clusters achieve up to 1.4 \times higher throughput at \mathbf{2 0 \%} lower cost. Alternatively, they can deliver 2.35 \times more throughput under the same power and cost budgets.
Author Zhang, Chaojie
Bianchini, Ricardo
Goiri, Inigo
Maleki, Saeed
Patel, Pratyush
Choukse, Esha
Shah, Aashaka
Author_xml – sequence: 1
  givenname: Pratyush
  surname: Patel
  fullname: Patel, Pratyush
  organization: University of Washington
– sequence: 2
  givenname: Esha
  surname: Choukse
  fullname: Choukse, Esha
  organization: Microsoft
– sequence: 3
  givenname: Chaojie
  surname: Zhang
  fullname: Zhang, Chaojie
  organization: Microsoft
– sequence: 4
  givenname: Aashaka
  surname: Shah
  fullname: Shah, Aashaka
  organization: Microsoft
– sequence: 5
  givenname: Inigo
  surname: Goiri
  fullname: Goiri, Inigo
  organization: Microsoft
– sequence: 6
  givenname: Saeed
  surname: Maleki
  fullname: Maleki, Saeed
  organization: Microsoft
– sequence: 7
  givenname: Ricardo
  surname: Bianchini
  fullname: Bianchini, Ricardo
  organization: Microsoft
BookMark eNotj11LwzAYhSMoqLP_YBf5A61v8jZN4t0ocytUFOauR9q-0cCMoymK_976cXV44DwHzjU7j--RGFsKKIQAe9vs6pWyoHUhQZYFAAh7xjKrrUEFKCtlxCXLUgodVGA1aqOuWL07HcP0GRLd8bX3oQ8UJ76hSKObwgfxtn3gTfQ0UuyJ71OIL_zp1SXiv-Y08w278O6YKPvPBdvfr5_rbd4-bpp61eZOKjPlZHyJVjs0rndKSzWAlJ3XWHaidxpn0F5VJI2sjHC9GZwY0CqNMJeExAVb_u0GIjqcxvDmxq-D-HlTlRa_AQlnSwQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ISCA59077.2024.00019
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350326581
EndPage 132
ExternalDocumentID 10609649
Genre orig-research
GroupedDBID 6IE
6IH
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a258t-e8f4397a38aca5725d022bf734b1ca7322b7f56e282681ac8da1d395730f73123
IEDL.DBID RIE
ISICitedReferencesCount 53
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:35:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a258t-e8f4397a38aca5725d022bf734b1ca7322b7f56e282681ac8da1d395730f73123
PageCount 15
ParticipantIDs ieee_primary_10609649
PublicationCentury 2000
PublicationDate 2024-June-29
PublicationDateYYYYMMDD 2024-06-29
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-29
  day: 29
PublicationDecade 2020
PublicationTitle 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)
PublicationTitleAbbrev ISCA
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060973785
Score 2.6328418
Snippet Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our...
SourceID ieee
SourceType Publisher
StartPage 118
SubjectTerms Cluster deployments
Computational modeling
Computer architecture
Costs
GPUs
Graphics processing units
Inference efficiency
Large language models
Machine learning
Processor scheduling
Resource management
Scheduling
Throughput
Title Splitwise: Efficient Generative LLM Inference Using Phase Splitting
URI https://ieeexplore.ieee.org/document/10609649
WOSCitedRecordID wos001290320700009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED3RioEJEEV8ywNrwE7i2GZDVSsqhapSAXWr7MQRXVrUpvD3uXPawsLAFkU5RfFH3p393jPAbeUMAnkqo9RKS0s3PrKZSSLjcEhrKxwPbMK3XA2HejIxo41YPWhhvPeBfObv6DLs5ZeLYk1LZTjDM8y4U9OCllKqEWttB09GvjNKy408TnBzPxh3HyUWfwrLwJhMsoOfzq9DVAKG9A__-fYj6Pyo8dhohzPHsOfnJ9AdY_ZYf81W_oH1gg0EBrPGRJr-YCzPn9lgFx-YAWz0jpjFQiSxnTvw2u-9dJ-izYEIkY2lriOvK8ofbKJtYaWKZYkI7CqVpE4UVuHcdKqSmccyKtPCFrq0oqSNuITjQ4hRp9CeL-b-DFiJZY8QpjJcxqk32lHlwMnZRZdc2vQcOtQC04_G82K6_fiLP-5fwgE1MpGoYnMF7Xq59tewX3zWs9XyJvTUN3k8kg4
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED1BQYIJEEV844E1ECd2YrOhqlUr0qpSC-pW2YkjurSoTeHvc-d-wMLAFkU5RfFH3p393jPAfWk1ArmQgTDS0NKNC0yi40BbHNLKcBt6NuFblvZ6ajTS_bVY3WthnHOefOYe6NLv5RezfElLZTjDE8y4hd6FPSlExFdyrc3wSch5JlVyLZDjoX7sDBrPEsu_FAvBiGyyvaPOr2NUPIq0jv75_mOo_-jxWH-LNCew46an0Bhg_lh9TRbuiTW9EQQGs5WNNP3DWJZ1WWcb77kBrP-OqMV8JPGd6_Daag4b7WB9JEJgIqmqwKmSMggTK5MbmUayQAy2ZRoLy3OT4uy0aSkTh4VUorjJVWF4QVtxcYgPIUqdQW06m7pzYAUWPpzrUocyEk4rS7VDSN4uqgilERdQpxYYf6xcL8abj7_84_4dHLSH3WycdXovV3BIDU6UqkhfQ62aL90N7Oef1WQxv_W99g1GKpVV
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=Splitwise%3A+Efficient+Generative+LLM+Inference+Using+Phase+Splitting&rft.au=Patel%2C+Pratyush&rft.au=Choukse%2C+Esha&rft.au=Zhang%2C+Chaojie&rft.au=Shah%2C+Aashaka&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=118&rft.epage=132&rft_id=info:doi/10.1109%2FISCA59077.2024.00019&rft.externalDocID=10609649