Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) S. 1018 - 1031
Hauptverfasser: Hwang, Ranggi, Wei, Jianyu, Cao, Shijie, Hwang, Changho, Tang, Xiaohu, Cao, Ting, Yang, Mao
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 29.06.2024
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-ofExperts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system codesign. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.
AbstractList Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-ofExperts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system codesign. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.
Author Tang, Xiaohu
Cao, Shijie
Hwang, Ranggi
Hwang, Changho
Wei, Jianyu
Cao, Ting
Yang, Mao
Author_xml – sequence: 1
  givenname: Ranggi
  surname: Hwang
  fullname: Hwang, Ranggi
  email: ranggi.hwang@kaist.ac.kr
  organization: KAIST
– sequence: 2
  givenname: Jianyu
  surname: Wei
  fullname: Wei, Jianyu
  email: noob@mail.ustc.edu.cn
  organization: USTC / Microsoft Research
– sequence: 3
  givenname: Shijie
  surname: Cao
  fullname: Cao, Shijie
  email: shijiecao@microsoft.com
  organization: Microsoft Research
– sequence: 4
  givenname: Changho
  surname: Hwang
  fullname: Hwang, Changho
  email: changhohwang@microsoft.com
  organization: Microsoft Research
– sequence: 5
  givenname: Xiaohu
  surname: Tang
  fullname: Tang, Xiaohu
  email: v-xiaohutang@microsoft.com
  organization: Microsoft Research
– sequence: 6
  givenname: Ting
  surname: Cao
  fullname: Cao, Ting
  email: ting.cao@microsoft.com
  organization: Microsoft Research
– sequence: 7
  givenname: Mao
  surname: Yang
  fullname: Yang, Mao
  email: maoyang@microsoft.com
  organization: Microsoft Research
BookMark eNotz9FKwzAUgOEICurcG-wiL5B5kjRN4l2pmw42FKa3jqQ9mYUuHWmE7e1V9Oq_--C_JZdxiEjIjMOcc7D3q21dKQtazwWIYg4A2lyQqdXWSAVSlMrwazIdx85DCVZLbdQN-XhNyPYuY0s3w-KBVpFW_X5IXf48sO15zHig9cAecez2kYYh0aUbM3WxpdvG9c73SDfdKX_9MENgi9MRU6arGDBhbPCOXAXXjzj974S8Lxdv9TNbvzyt6mrNnFAmM8OLVvm2RBsAjAERGiM8Gl8qh063lpdaohBtaJrGK-6U4M5bpzkUQnEtJ2T253aIuDum7uDSecd_R0tZyG84UFSR
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ISCA59077.2024.00078
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350326581
EndPage 1031
ExternalDocumentID 10609634
Genre orig-research
GroupedDBID 6IE
6IH
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a258t-814d5bd6e9f008802fc82be8b65aea7d91673e22dfcccb51a521ab9a710425173
IEDL.DBID RIE
ISICitedReferencesCount 16
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700068&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:35:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a258t-814d5bd6e9f008802fc82be8b65aea7d91673e22dfcccb51a521ab9a710425173
PageCount 14
ParticipantIDs ieee_primary_10609634
PublicationCentury 2000
PublicationDate 2024-June-29
PublicationDateYYYYMMDD 2024-06-29
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-29
  day: 29
PublicationDecade 2020
PublicationTitle 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)
PublicationTitleAbbrev ISCA
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060973785
Score 2.484477
Snippet Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model...
SourceID ieee
SourceType Publisher
StartPage 1018
SubjectTerms Computational modeling
Graphics processing units
Heuristic algorithms
inference system
large language model
Large language models
machine learning
Memory management
memory offloading
Mixture-of-expert
Throughput
Transformers
Title Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
URI https://ieeexplore.ieee.org/document/10609634
WOSCitedRecordID wos001290320700068&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEG2UePCkRozf6cFrdbds2603ghA5QEjQhJOkH7NKorsGFuPPt1NAvXjw1vTSZKbNm5nOe0PIFQ8wlyJFWXqQLFPcMV04yTwowa3DjMLEYRNqOMwnEz1ak9UjFwYAYvMZXOMy_uX7yi2xVBZeuAwRdyvbJttKyRVZa3N5JOrOqFys6XFpom_6405bhORPhTSQo0h2gsPUfg1RiRjS2_vn6fuk-cPGo6NvnDkgW1AekqfRHGJtzNNB1b2l7ZK2X5-rkOq_vLGVDDntVOwuNmjQEJnSnlnU1JSejoNbkDBFB7NP_D9gVcGi4nFN-5sDm-Sx133o3LP1rARmuMhrrOR5Yb0EXQRUzxNeuJxbyK0UBozyIQpULeDcF845K1ITYNtYbUKAkaFqWeuINMqqhGNCTXjlFne595lWiebgc-FbReaKNBP2hDTRONP3lRzGdGOX0z_2z8gu2h_7q7g-J416voQLsuM-6tlifhmd-AWoXJyM
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgIMEEiCK-8cBqSJw4jtmq0qoVbVWpRepE5Y8LVIIEpSni52O7LbAwsEVeLPlivXvne-8QuqYW5kInUU4MJCTmVBOR6YQY4Iwq7RiF9MMm-GCQTiZiuBKrey0MAPjmM7hxn_4t3xR64Upl9oYnNuOO4k20xWJLfJZyrfXvkzjnGZ6ylUAuDMRtd9RsMEv_uCWC1NlkB26c2q8xKh5F2nv_3H8f1X_0eHj4jTQHaAPyQ_Q0LMFXxwzuF6073Mhx4_W5sGT_5Y0sjchxsyD3vkUD29wUt-W8wjI3eGQD4yRTuD_7dC8IpMiI9zyucHe9YR09tlvjZoespiUQSVlauVqeYcokIDKL62lAM51SBalKmATJjc0DeQSUmkxrrVgoLXBLJaRNMWLnWxYdoVpe5HCMsLT3XLlVakwseCAomJSZKIt1FsZMnaC6O5zp-9IQY7o-l9M_1q_QTmfc70173cHDGdp1sXDdVlSco1pVLuACbeuPajYvL31AvwD9QZ_T
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=Pre-gated+MoE%3A+An+Algorithm-System+Co-Design+for+Fast+and+Scalable+Mixture-of-Expert+Inference&rft.au=Hwang%2C+Ranggi&rft.au=Wei%2C+Jianyu&rft.au=Cao%2C+Shijie&rft.au=Hwang%2C+Changho&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=1018&rft.epage=1031&rft_id=info:doi/10.1109%2FISCA59077.2024.00078&rft.externalDocID=10609634