Splitwise: Efficient Generative LLM Inference Using Phase Splitting

Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory inten...

Full description

Saved in:

Bibliographic Details
Published in:	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 118 - 132
Main Authors:	Patel, Pratyush, Choukse, Esha, Zhang, Chaojie, Shah, Aashaka, Goiri, Inigo, Maleki, Saeed, Bianchini, Ricardo
Format:	Conference Proceeding
Language:	English
Published:	IEEE 29.06.2024
Subjects:	Cluster deployments Computational modeling Computer architecture Costs GPUs Graphics processing units Inference efficiency Large language models Machine learning Processor scheduling Resource management Scheduling Throughput
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model deployment and scheduling technique that splits the two phases of LLM inference requests on to separate machines. Splitwise enables phase-specific resource management using hardware that is well suited for each phase. Request state is transferred efficiently between machines using optimized network libraries on the fast back-plane interconnects available in today's GPU clusters. Using Splitwise, we design homogeneous and heterogeneous LLM inference clusters optimized for throughput, cost, and power Compared to current designs, Splitwise clusters achieve up to 1.4 \times higher throughput at \mathbf{2 0 \%} lower cost. Alternatively, they can deliver 2.35 \times more throughput under the same power and cost budgets.
AbstractList	Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model deployment and scheduling technique that splits the two phases of LLM inference requests on to separate machines. Splitwise enables phase-specific resource management using hardware that is well suited for each phase. Request state is transferred efficiently between machines using optimized network libraries on the fast back-plane interconnects available in today's GPU clusters. Using Splitwise, we design homogeneous and heterogeneous LLM inference clusters optimized for throughput, cost, and power Compared to current designs, Splitwise clusters achieve up to 1.4 \times higher throughput at \mathbf{2 0 \%} lower cost. Alternatively, they can deliver 2.35 \times more throughput under the same power and cost budgets.
Author	Zhang, Chaojie Bianchini, Ricardo Goiri, Inigo Maleki, Saeed Patel, Pratyush Choukse, Esha Shah, Aashaka
Author_xml	– sequence: 1 givenname: Pratyush surname: Patel fullname: Patel, Pratyush organization: University of Washington – sequence: 2 givenname: Esha surname: Choukse fullname: Choukse, Esha organization: Microsoft – sequence: 3 givenname: Chaojie surname: Zhang fullname: Zhang, Chaojie organization: Microsoft – sequence: 4 givenname: Aashaka surname: Shah fullname: Shah, Aashaka organization: Microsoft – sequence: 5 givenname: Inigo surname: Goiri fullname: Goiri, Inigo organization: Microsoft – sequence: 6 givenname: Saeed surname: Maleki fullname: Maleki, Saeed organization: Microsoft – sequence: 7 givenname: Ricardo surname: Bianchini fullname: Bianchini, Ricardo organization: Microsoft
BookMark	eNotj11LwzAYhSMoqLP_YBf5A61v8jZN4t0ocytUFOauR9q-0cCMoymK_976cXV44DwHzjU7j--RGFsKKIQAe9vs6pWyoHUhQZYFAAh7xjKrrUEFKCtlxCXLUgodVGA1aqOuWL07HcP0GRLd8bX3oQ8UJ76hSKObwgfxtn3gTfQ0UuyJ71OIL_zp1SXiv-Y08w278O6YKPvPBdvfr5_rbd4-bpp61eZOKjPlZHyJVjs0rndKSzWAlJ3XWHaidxpn0F5VJI2sjHC9GZwY0CqNMJeExAVb_u0GIjqcxvDmxq-D-HlTlRa_AQlnSwQ
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ISCA59077.2024.00019
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350326581
EndPage	132
ExternalDocumentID	10609649
Genre	orig-research
GroupedDBID	6IE 6IH ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIO
ID	FETCH-LOGICAL-a258t-e8f4397a38aca5725d022bf734b1ca7322b7f56e282681ac8da1d395730f73123
IEDL.DBID	RIE
ISICitedReferencesCount	53
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:35:15 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a258t-e8f4397a38aca5725d022bf734b1ca7322b7f56e282681ac8da1d395730f73123
PageCount	15
ParticipantIDs	ieee_primary_10609649
PublicationCentury	2000
PublicationDate	2024-June-29
PublicationDateYYYYMMDD	2024-06-29
PublicationDate_xml	– month: 06 year: 2024 text: 2024-June-29 day: 29
PublicationDecade	2020
PublicationTitle	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)
PublicationTitleAbbrev	ISCA
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssib060973785
Score	2.6328418
Snippet	Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our...
SourceID	ieee
SourceType	Publisher
StartPage	118
SubjectTerms	Cluster deployments Computational modeling Computer architecture Costs GPUs Graphics processing units Inference efficiency Large language models Machine learning Processor scheduling Resource management Scheduling Throughput
Title	Splitwise: Efficient Generative LLM Inference Using Phase Splitting
URI	https://ieeexplore.ieee.org/document/10609649
WOSCitedRecordID	wos001290320700009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED3RioEJEEV8ywNrwE7i2GZDVSsqhapSAXWr7MQRXVrUpvD3uXPawsLAFkU5RfFH3p393jPAbeUMAnkqo9RKS0s3PrKZSSLjcEhrKxwPbMK3XA2HejIxo41YPWhhvPeBfObv6DLs5ZeLYk1LZTjDM8y4U9OCllKqEWttB09GvjNKy408TnBzPxh3HyUWfwrLwJhMsoOfzq9DVAKG9A__-fYj6Pyo8dhohzPHsOfnJ9AdY_ZYf81W_oH1gg0EBrPGRJr-YCzPn9lgFx-YAWz0jpjFQiSxnTvw2u-9dJ-izYEIkY2lriOvK8ofbKJtYaWKZYkI7CqVpE4UVuHcdKqSmccyKtPCFrq0oqSNuITjQ4hRp9CeL-b-DFiJZY8QpjJcxqk32lHlwMnZRZdc2vQcOtQC04_G82K6_fiLP-5fwgE1MpGoYnMF7Xq59tewX3zWs9XyJvTUN3k8kg4
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED1BQYIJEEV844E1ECd2YrOhqlUr0qpSC-pW2YkjurSoTeHvc-d-wMLAFkU5RfFH3p393jPAfWk1ArmQgTDS0NKNC0yi40BbHNLKcBt6NuFblvZ6ajTS_bVY3WthnHOefOYe6NLv5RezfElLZTjDE8y4hd6FPSlExFdyrc3wSch5JlVyLZDjoX7sDBrPEsu_FAvBiGyyvaPOr2NUPIq0jv75_mOo_-jxWH-LNCew46an0Bhg_lh9TRbuiTW9EQQGs5WNNP3DWJZ1WWcb77kBrP-OqMV8JPGd6_Daag4b7WB9JEJgIqmqwKmSMggTK5MbmUayQAy2ZRoLy3OT4uy0aSkTh4VUorjJVWF4QVtxcYgPIUqdQW06m7pzYAUWPpzrUocyEk4rS7VDSN4uqgilERdQpxYYf6xcL8abj7_84_4dHLSH3WycdXovV3BIDU6UqkhfQ62aL90N7Oef1WQxv_W99g1GKpVV
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=Splitwise%3A+Efficient+Generative+LLM+Inference+Using+Phase+Splitting&rft.au=Patel%2C+Pratyush&rft.au=Choukse%2C+Esha&rft.au=Zhang%2C+Chaojie&rft.au=Shah%2C+Aashaka&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=118&rft.epage=132&rft_id=info:doi/10.1109%2FISCA59077.2024.00019&rft.externalDocID=10609649