Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing

Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. T...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) S. 382 - 395
Hauptverfasser:	Liu, Haifeng, Zheng, Long, Huang, Yu, Zhou, Jingyi, Liu, Chaoqiang, Wang, Runze, Liao, Xiaofei, Jin, Hai, Xue, Jingling
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 29.06.2024
Schlagworte:	Bandwidth Data transfer Degradation Hardware Memory management Prefetching Recommender systems Scalability Training Web and internet services
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by 7.1 \times \sim 10.6 \times(9.4 \times on average) and 12.7 \times \sim 31.3 \times(22.6 \times on average), respectively.
DOI:	10.1109/ISCA59077.2024.00036