Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing
Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. T...
Gespeichert in:
| Veröffentlicht in: | 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) S. 382 - 395 |
|---|---|
| Hauptverfasser: | , , , , , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
29.06.2024
|
| Schlagworte: | |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by 7.1 \times \sim 10.6 \times(9.4 \times on average) and 12.7 \times \sim 31.3 \times(22.6 \times on average), respectively. |
|---|---|
| DOI: | 10.1109/ISCA59077.2024.00036 |