High-Speed Data Communication With Advanced Networks in Large Language Model Training

Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performanc...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE MICRO Ročník 44; číslo 2; s. 31 - 40
Hlavní autoři:	Dai, Liuyao, Qi, Hao, Chen, Weicong, Lu, Xiaoyi
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Los Alamitos IEEE 01.03.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Communication Computational modeling Data communication Data models Decoding High speed Interconnections IP (Internet Protocol) Large language models Natural language processing Parallel processing Synchronization TCP/IP (protocol) TCPIP Training Transformers
ISSN:	0272-1732, 1937-4143
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performance across various interconnects and communication protocols: TCP/IP, Internet Protocol over InfiniBand, (IPoIB), and Remote Direct Memory Access (RDMA), using data and model parallelism. RDMA-100 Gbps outperforms IPoIB-100 Gbps and TCP/IP-10 Gbps, with average gains of 2.5x and 4.8x in data parallelism, while in model parallelism, the gains were 1.1x and 1.2x. RDMA achieves the highest interconnect utilization (up to 60 Gbps), compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps. Larger models demand increased communication bandwidth, with AllReduce in data parallelism consuming up to 91% of training time, and forward receive and back-embedding AllReduce in model parallelism taking up to 90%. The larger-scale experiment confirms that communication predominates iterations. Our findings underscore the significance of communication in distributed LLM training and present opportunities for optimization.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0272-1732 1937-4143
DOI:	10.1109/MM.2024.3360081