Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing
Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. T...
Saved in:
| Published in: | 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 382 - 395 |
|---|---|
| Main Authors: | , , , , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
29.06.2024
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by 7.1 \times \sim 10.6 \times(9.4 \times on average) and 12.7 \times \sim 31.3 \times(22.6 \times on average), respectively. |
|---|---|
| AbstractList | Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by 7.1 \times \sim 10.6 \times(9.4 \times on average) and 12.7 \times \sim 31.3 \times(22.6 \times on average), respectively. |
| Author | Xue, Jingling Liu, Chaoqiang Huang, Yu Zhou, Jingyi Wang, Runze Liao, Xiaofei Jin, Hai Zheng, Long Liu, Haifeng |
| Author_xml | – sequence: 1 givenname: Haifeng surname: Liu fullname: Liu, Haifeng email: hfliu@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 2 givenname: Long surname: Zheng fullname: Zheng, Long email: longzh@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 3 givenname: Yu surname: Huang fullname: Huang, Yu email: yuh@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 4 givenname: Jingyi surname: Zhou fullname: Zhou, Jingyi email: zjy9695@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 5 givenname: Chaoqiang surname: Liu fullname: Liu, Chaoqiang email: chqliu@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 6 givenname: Runze surname: Wang fullname: Wang, Runze email: rzwang@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 7 givenname: Xiaofei surname: Liao fullname: Liao, Xiaofei email: xfliao@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 8 givenname: Hai surname: Jin fullname: Jin, Hai email: hjin@hust.edu.cn organization: Huazhong University of Science and Technology,National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab,China – sequence: 9 givenname: Jingling surname: Xue fullname: Xue, Jingling email: jingling@cse.unsw.edu.au organization: Zhejiang Lab,Hangzhou,China |
| BookMark | eNotj81KAzEURiMoqLVv0EVeoPXmbzJZlqFqYaqiVdyVO8mdGuhkJDMgfXt_V9_mcA7fJTtNfSLGZgIWQoC7Xj9XS-PA2oUEqRcAoIoTNnXWlcqAkoUpxTmbDkNsoABnlS3NBXtdJWwOMe35qm2jj5RGXmPeE38i33cdpYBj7BPf9IEOfJsxph_6M47v_J4w8-qt5hvq-nzkj7n39B1I-yt21uJhoOn_TtjLzWpb3c3rh9t1taznKE05zjU2TXCmQeWk8GiD09aUWmpviIIn2xbeGoFtA2itJe3AhFBKZ2RROA1qwmZ_3khEu48cO8zHnfg9KI36AhEgUpY |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ISCA59077.2024.00036 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore / Electronic Library Online (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350326581 |
| EndPage | 395 |
| ExternalDocumentID | 10609725 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 – fundername: Research and Development funderid: 10.13039/100006190 |
| GroupedDBID | 6IE 6IH ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a258t-4abbd95ba3921ca7d94758424c5eedce7f6c751afb0a777e4905dd82952669403 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 6 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700026&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 03:08:18 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a258t-4abbd95ba3921ca7d94758424c5eedce7f6c751afb0a777e4905dd82952669403 |
| PageCount | 14 |
| ParticipantIDs | ieee_primary_10609725 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-June-29 |
| PublicationDateYYYYMMDD | 2024-06-29 |
| PublicationDate_xml | – month: 06 year: 2024 text: 2024-June-29 day: 29 |
| PublicationDecade | 2020 |
| PublicationTitle | 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) |
| PublicationTitleAbbrev | ISCA |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib060973785 |
| Score | 2.31266 |
| Snippet | Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 382 |
| SubjectTerms | Bandwidth Data transfer Degradation Hardware Memory management Prefetching Recommender systems Scalability Training Web and internet services |
| Title | Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing |
| URI | https://ieeexplore.ieee.org/document/10609725 |
| WOSCitedRecordID | wos001290320700026&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA9uePCk4sRvcvAabdO8pjnK2FCYY-CU3UY-XkHQTeYm-N-bl3XqxYO3tqEtvKa8_tLfB2OXCrHOEJSoK62E8qEUBm0pggOLsgQo6iQUHujhsJpMzKgRqyctDCIm8hle0Wb6lx_mfkVLZfENL8ltBlqspXW5FmttJg-NFLqCRh6XZ-b67qF7AxH86QgDJZlkZ2TE_CtEJfWQ_u4_777HOj9qPD767jP7bAtnB-ypR6KnuM97yQQinsoHxOrmBChf46XWYUmcws5e-LhJguC07sqHcXrz7mTA74ln-8kbtUAc77DHfm_cvRVNRoKwEqqlUNa5YMDZ-J2Te6uDUREBKKk8IBE8dV16DbmtXWa11qhMBiFU0kDszEZlxSFrz-YzPGIci-CM9zWS-M1KW7nchwinQIcgpffHrENFmb6tbTCmm3qc_HH8lO1Q3YlXJc0Zay8XKzxn2_5j-fy-uEgP7wvGVZx6 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8MwGA46BT2pOPHbHLxG2zRpmqOMjQ27MnDKbiMfb0HQTuYm-O_N23XqxYO3tqEtJClvn-T5IORaAJQRSMHKTAkmnE-ZBpMyb6UBnkqZlLVQOFdFkU0metSI1WstDADU5DO4wcN6L9_P3BKXysIXnqLbjNwkWxid1ci11tMH2xKVyUYgF0f6dvDQuZMB_qkABDnaZEdoxfwrRqWuIr29f75_n7R_9Hh09F1pDsgGVIfkqYuyp3BOu7UNRLiV5sjrpggpX8OjVnFJFOPOXui4yYKguPJKizDBaWeS0yEybT9poxcI7W3y2OuOO33WpCQww2W2YMJY67W0JvzpxM4or0XAAIILJwEpnqpMnZKxKW1klFIgdCS9z7iWoTZrESVHpFXNKjgmFBJvtXMloPzNcJPZ2PkAqKTynnPnTkgbO2X6tjLCmK774_SP61dkpz8e5tN8UNyfkV0cA2RZcX1OWov5Ei7ItvtYPL_PL-uB_AIdLJ_D |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=Enabling+Efficient+Large+Recommendation+Model+Training+with+Near+CXL+Memory+Processing&rft.au=Liu%2C+Haifeng&rft.au=Zheng%2C+Long&rft.au=Huang%2C+Yu&rft.au=Zhou%2C+Jingyi&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=382&rft.epage=395&rft_id=info:doi/10.1109%2FISCA59077.2024.00036&rft.externalDocID=10609725 |