HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on re...
Uloženo v:
| Vydáno v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
22.06.2025
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of \mathbf{1. 3 3} \times in the prefill stage and 1.70 \times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE. |
|---|---|
| AbstractList | The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of \mathbf{1. 3 3} \times in the prefill stage and 1.70 \times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE. |
| Author | Zhong, Shuzhang Wang, Runsheng Liang, Ling Li, Meng Sun, Yanfan Huang, Ru |
| Author_xml | – sequence: 1 givenname: Shuzhang surname: Zhong fullname: Zhong, Shuzhang organization: Institute for Artificial Intelligence, Peking University,Beijing,China – sequence: 2 givenname: Yanfan surname: Sun fullname: Sun, Yanfan organization: Beihang University,School of Computer Science and Engineering,Beijing,China – sequence: 3 givenname: Ling surname: Liang fullname: Liang, Ling organization: Peking University,School of Integrated Circuits,Beijing,China – sequence: 4 givenname: Runsheng surname: Wang fullname: Wang, Runsheng organization: Peking University,School of Integrated Circuits,Beijing,China – sequence: 5 givenname: Ru surname: Huang fullname: Huang, Ru organization: Peking University,School of Integrated Circuits,Beijing,China – sequence: 6 givenname: Meng surname: Li fullname: Li, Meng email: meng.li@pku.edu.cn organization: Institute for Artificial Intelligence, Peking University,Beijing,China |
| BookMark | eNo1j81OAjEUhWuiC0XewJi-wGDb25m27sg4AglEEpk1udPeYhMoZsQFb-_4tzrnfIsvOTfsMh8zMXYvxURK4R6epnUFVruJEqockARQRl-wsTPOAshSgND2mm3m565Pq2PzyH9a4PW6LWbrlr_6Nwqf-5R3HPOAcdh8hRl3dKB84vHY8ybG5NP3Ggx8kSP1lD3dsquI-w8a_-WItc_Npp4Xy5fZop4uC5TGnYqq0ip6bRHRCQckCIWzwTpNwhtTKh2cAinQha5CMKILptS-gi5KGSnCiN39ehMRbd_7dMD-vP0_C19zL0zt |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/DAC63849.2025.11133274 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798331503048 |
| EndPage | 7 |
| ExternalDocumentID | 11133274 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Research and Development funderid: 10.13039/100006190 |
| GroupedDBID | 6IE 6IH CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a179t-6642fc48aaa9093e0ea098d894e0c77524d92310a9db6a370bd754c63bf11fef3 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 01 07:05:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a179t-6642fc48aaa9093e0ea098d894e0c77524d92310a9db6a370bd754c63bf11fef3 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_11133274 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-June-22 |
| PublicationDateYYYYMMDD | 2025-06-22 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-June-22 day: 22 |
| PublicationDecade | 2020 |
| PublicationTitle | 2025 62nd ACM/IEEE Design Automation Conference (DAC) |
| PublicationTitleAbbrev | DAC |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.2953632 |
| Snippet | The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Computational modeling Dynamic scheduling Hands Heuristic algorithms Inference algorithms Load modeling Prefetching Rendering (computer graphics) Resource management Schedules |
| Title | HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference |
| URI | https://ieeexplore.ieee.org/document/11133274 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62ePCkYsU3OXhNm93NbhJvUlvrwbJgC72VyUt62Uofgv_eJN21ePDgbRLygEnCTJLvm0HoPgk2KgAcMsNywrRJiPRuB-EmcYZ7m0DBxWQTfDwWs5ksa7J65MJYayP4zHaDGP_yzVJvw1NZL6RFz_w1qoVanPMdWatm_SZU9p4e-343sUA_SfNu0_hX2pRoNYbH_5zvBHX2_Dtc_liWU3RgqzM0GX2p1eJ1OXjAUTK4X07JcznFb17xJiDK3zFUvjrEaMZ7XAv2fikexFARoeRHwC_NNB00HQ4m_RGpcyIQ8EdnQwp_X3CaCQCQVGaWWqBSGCGZpZrzPGUmumwgjSog41QZnjNdZMolibMuO0ftalnZC4RTUAogT31vx5RWogDBnDRcKw4i15eoE1Qy_9iFvZg32rj6o_4aHQXFBxxVmt6g9ma1tbfoUH9uFuvVXVysb1iTlY4 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG4UTfSkRoxve_C6sI9223ozCEIEsomQcCPTl-GyGFxM_Pe2BSQePHhrmz6SmTYzbb9vBqH7xNsoD3DINKERUTqJhHM7IqYTq5mzCTHYkGyCDYd8MhHFmqweuDDGmAA-Mw1fDH_5eq6W_qms6dOiZ-4atYv2KCFpsqJrrXm_SSyaT48tt5-IJ6CktLHp_itxSrAbnaN_rniM6lsGHi5-bMsJ2jHlKRp1v-RiNpi3H3AoadwqxtFzMcavTvTaY8rfMJSu2UdpxltkC3aeKW6HYBG-5mbAvc0ydTTutEetbrTOihCBOzxVlLsbg1WEA4CIRWZiA7HgmgtiYsUYTYkOThsILXPIWCw1o0TlmbRJYo3NzlCtnJfmHOEUpASgqRttiVSS58CJFZopyYBTdYHqXiTT91Xgi-lGGpd_tN-hg-5o0J_2e8OXK3ToleBRVWl6jWrVYmlu0L76rGYfi9uguG9QMpjV |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=HybriMoE%3A+Hybrid+CPU-GPU+Scheduling+and+Cache+Management+for+Efficient+MoE+Inference&rft.au=Zhong%2C+Shuzhang&rft.au=Sun%2C+Yanfan&rft.au=Liang%2C+Ling&rft.au=Wang%2C+Runsheng&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11133274&rft.externalDocID=11133274 |