Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

Deep Learning Recommendation Models (DLRMs) are pivotal in various sectors, yet they are hindered by the high memory demands of embedding tables and the significant communication overhead in distributed training environments. Traditional approaches, like Tensor-Train (TT) decomposition, although eff...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:SC24: International Conference for High Performance Computing, Networking, Storage and Analysis s. 1 - 15
Hlavní autoři: Wang, Weihu, Xia, Yaqi, Yang, Donglin, Zhou, Xiaobo, Cheng, Dazhao
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 17.11.2024
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Deep Learning Recommendation Models (DLRMs) are pivotal in various sectors, yet they are hindered by the high memory demands of embedding tables and the significant communication overhead in distributed training environments. Traditional approaches, like Tensor-Train (TT) decomposition, although effective for compressing these tables, introduce substantial computational burdens. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements.This paper proposes EcoRec, an advanced library designed to expedite the training of DLRMs through a synergistic integration of TT decomposition technology and distributed training. EcoRec introduces a novel computation pattern that eliminates redundancy in TT operations, alongside an efficient multiplication pathway, significantly reducing computational time. Additionally, it provides a unique micro-batching technique with sorted indices to decrease memory demands without additional computational costs. EcoRec also features a novel pipeline training system for embedding layers, ensuring balanced data distribution and enhanced communication efficiency. EcoRec, built on PyTorch and CUDA, has been evaluated on a 32 GPU cluster. The results show EcoRec significantly outperforms the existing ELRec system, achieving up to a 3.1 \times speedup and a 38.5% reduction in memory requirements. EcoRec marks a notable advancement in high-performance DLRM training.
DOI:10.1109/SC41406.2024.00055