Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing

Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. T...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) s. 382 - 395
Hlavní autoři:	Liu, Haifeng, Zheng, Long, Huang, Yu, Zhou, Jingyi, Liu, Chaoqiang, Wang, Runze, Liao, Xiaofei, Jin, Hai, Xue, Jingling
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 29.06.2024
Témata:	Bandwidth Data transfer Degradation Hardware Memory management Prefetching Recommender systems Scalability Training Web and internet services
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by 7.1 \times \sim 10.6 \times(9.4 \times on average) and 12.7 \times \sim 31.3 \times(22.6 \times on average), respectively.
DOI:	10.1109/ISCA59077.2024.00036