Splitwise: Efficient Generative LLM Inference Using Phase Splitting
Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory inten...
Saved in:
| Published in: | 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 118 - 132 |
|---|---|
| Main Authors: | , , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
29.06.2024
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model deployment and scheduling technique that splits the two phases of LLM inference requests on to separate machines. Splitwise enables phase-specific resource management using hardware that is well suited for each phase. Request state is transferred efficiently between machines using optimized network libraries on the fast back-plane interconnects available in today's GPU clusters. Using Splitwise, we design homogeneous and heterogeneous LLM inference clusters optimized for throughput, cost, and power Compared to current designs, Splitwise clusters achieve up to 1.4 \times higher throughput at \mathbf{2 0 \%} lower cost. Alternatively, they can deliver 2.35 \times more throughput under the same power and cost budgets. |
|---|---|
| AbstractList | Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model deployment and scheduling technique that splits the two phases of LLM inference requests on to separate machines. Splitwise enables phase-specific resource management using hardware that is well suited for each phase. Request state is transferred efficiently between machines using optimized network libraries on the fast back-plane interconnects available in today's GPU clusters. Using Splitwise, we design homogeneous and heterogeneous LLM inference clusters optimized for throughput, cost, and power Compared to current designs, Splitwise clusters achieve up to 1.4 \times higher throughput at \mathbf{2 0 \%} lower cost. Alternatively, they can deliver 2.35 \times more throughput under the same power and cost budgets. |
| Author | Zhang, Chaojie Bianchini, Ricardo Goiri, Inigo Maleki, Saeed Patel, Pratyush Choukse, Esha Shah, Aashaka |
| Author_xml | – sequence: 1 givenname: Pratyush surname: Patel fullname: Patel, Pratyush organization: University of Washington – sequence: 2 givenname: Esha surname: Choukse fullname: Choukse, Esha organization: Microsoft – sequence: 3 givenname: Chaojie surname: Zhang fullname: Zhang, Chaojie organization: Microsoft – sequence: 4 givenname: Aashaka surname: Shah fullname: Shah, Aashaka organization: Microsoft – sequence: 5 givenname: Inigo surname: Goiri fullname: Goiri, Inigo organization: Microsoft – sequence: 6 givenname: Saeed surname: Maleki fullname: Maleki, Saeed organization: Microsoft – sequence: 7 givenname: Ricardo surname: Bianchini fullname: Bianchini, Ricardo organization: Microsoft |
| BookMark | eNotj11LwzAYhSMoqLP_YBf5A61v8jZN4t0ocytUFOauR9q-0cCMoymK_976cXV44DwHzjU7j--RGFsKKIQAe9vs6pWyoHUhQZYFAAh7xjKrrUEFKCtlxCXLUgodVGA1aqOuWL07HcP0GRLd8bX3oQ8UJ76hSKObwgfxtn3gTfQ0UuyJ71OIL_zp1SXiv-Y08w278O6YKPvPBdvfr5_rbd4-bpp61eZOKjPlZHyJVjs0rndKSzWAlJ3XWHaidxpn0F5VJI2sjHC9GZwY0CqNMJeExAVb_u0GIjqcxvDmxq-D-HlTlRa_AQlnSwQ |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ISCA59077.2024.00019 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350326581 |
| EndPage | 132 |
| ExternalDocumentID | 10609649 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a258t-e8f4397a38aca5725d022bf734b1ca7322b7f56e282681ac8da1d395730f73123 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 53 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:35:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a258t-e8f4397a38aca5725d022bf734b1ca7322b7f56e282681ac8da1d395730f73123 |
| PageCount | 15 |
| ParticipantIDs | ieee_primary_10609649 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-June-29 |
| PublicationDateYYYYMMDD | 2024-06-29 |
| PublicationDate_xml | – month: 06 year: 2024 text: 2024-June-29 day: 29 |
| PublicationDecade | 2020 |
| PublicationTitle | 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) |
| PublicationTitleAbbrev | ISCA |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib060973785 |
| Score | 2.6328418 |
| Snippet | Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 118 |
| SubjectTerms | Cluster deployments Computational modeling Computer architecture Costs GPUs Graphics processing units Inference efficiency Large language models Machine learning Processor scheduling Resource management Scheduling Throughput |
| Title | Splitwise: Efficient Generative LLM Inference Using Phase Splitting |
| URI | https://ieeexplore.ieee.org/document/10609649 |
| WOSCitedRecordID | wos001290320700009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED3RioEJEEV8ywNrwE7i2GZDVSsqhapSAXWr7MQRXVrUpvD3uXPawsLAFkU5RfFH3p393jPAbeUMAnkqo9RKS0s3PrKZSSLjcEhrKxwPbMK3XA2HejIxo41YPWhhvPeBfObv6DLs5ZeLYk1LZTjDM8y4U9OCllKqEWttB09GvjNKy408TnBzPxh3HyUWfwrLwJhMsoOfzq9DVAKG9A__-fYj6Pyo8dhohzPHsOfnJ9AdY_ZYf81W_oH1gg0EBrPGRJr-YCzPn9lgFx-YAWz0jpjFQiSxnTvw2u-9dJ-izYEIkY2lriOvK8ofbKJtYaWKZYkI7CqVpE4UVuHcdKqSmccyKtPCFrq0oqSNuITjQ4hRp9CeL-b-DFiJZY8QpjJcxqk32lHlwMnZRZdc2vQcOtQC04_G82K6_fiLP-5fwgE1MpGoYnMF7Xq59tewX3zWs9XyJvTUN3k8kg4 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED1BQYIJEEV844E1ECd2YrOhqlUr0qpSC-pW2YkjurSoTeHvc-d-wMLAFkU5RfFH3p393jPAfWk1ArmQgTDS0NKNC0yi40BbHNLKcBt6NuFblvZ6ajTS_bVY3WthnHOefOYe6NLv5RezfElLZTjDE8y4hd6FPSlExFdyrc3wSch5JlVyLZDjoX7sDBrPEsu_FAvBiGyyvaPOr2NUPIq0jv75_mOo_-jxWH-LNCew46an0Bhg_lh9TRbuiTW9EQQGs5WNNP3DWJZ1WWcb77kBrP-OqMV8JPGd6_Daag4b7WB9JEJgIqmqwKmSMggTK5MbmUayQAy2ZRoLy3OT4uy0aSkTh4VUorjJVWF4QVtxcYgPIUqdQW06m7pzYAUWPpzrUocyEk4rS7VDSN4uqgilERdQpxYYf6xcL8abj7_84_4dHLSH3WycdXovV3BIDU6UqkhfQ62aL90N7Oef1WQxv_W99g1GKpVV |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=Splitwise%3A+Efficient+Generative+LLM+Inference+Using+Phase+Splitting&rft.au=Patel%2C+Pratyush&rft.au=Choukse%2C+Esha&rft.au=Zhang%2C+Chaojie&rft.au=Shah%2C+Aashaka&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=118&rft.epage=132&rft_id=info:doi/10.1109%2FISCA59077.2024.00019&rft.externalDocID=10609649 |