High-Speed Data Communication With Advanced Networks in Large Language Model Training

Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performanc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE MICRO Jg. 44; H. 2; S. 31 - 40
Hauptverfasser: Dai, Liuyao, Qi, Hao, Chen, Weicong, Lu, Xiaoyi
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Los Alamitos IEEE 01.03.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:
ISSN:0272-1732, 1937-4143
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performance across various interconnects and communication protocols: TCP/IP, Internet Protocol over InfiniBand, (IPoIB), and Remote Direct Memory Access (RDMA), using data and model parallelism. RDMA-100 Gbps outperforms IPoIB-100 Gbps and TCP/IP-10 Gbps, with average gains of 2.5x and 4.8x in data parallelism, while in model parallelism, the gains were 1.1x and 1.2x. RDMA achieves the highest interconnect utilization (up to 60 Gbps), compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps. Larger models demand increased communication bandwidth, with AllReduce in data parallelism consuming up to 91% of training time, and forward receive and back-embedding AllReduce in model parallelism taking up to 90%. The larger-scale experiment confirms that communication predominates iterations. Our findings underscore the significance of communication in distributed LLM training and present opportunities for optimization.
AbstractList Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performance across various interconnects and communication protocols: TCP/IP, Internet Protocol over InfiniBand, (IPoIB), and Remote Direct Memory Access (RDMA), using data and model parallelism. RDMA-100 Gbps outperforms IPoIB-100 Gbps and TCP/IP-10 Gbps, with average gains of 2.5x and 4.8x in data parallelism, while in model parallelism, the gains were 1.1x and 1.2x. RDMA achieves the highest interconnect utilization (up to 60 Gbps), compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps. Larger models demand increased communication bandwidth, with AllReduce in data parallelism consuming up to 91% of training time, and forward receive and back-embedding AllReduce in model parallelism taking up to 90%. The larger-scale experiment confirms that communication predominates iterations. Our findings underscore the significance of communication in distributed LLM training and present opportunities for optimization.
Author Dai, Liuyao
Lu, Xiaoyi
Qi, Hao
Chen, Weicong
Author_xml – sequence: 1
  givenname: Liuyao
  orcidid: 0000-0002-0907-6920
  surname: Dai
  fullname: Dai, Liuyao
  email: ldai8@ucmerced.edu
  organization: University of California Merced, Merced, CA, 95343, USA
– sequence: 2
  givenname: Hao
  surname: Qi
  fullname: Qi, Hao
  email: hqi6@ucmerced.edu
  organization: University of California Merced, Merced, CA, 95343, USA
– sequence: 3
  givenname: Weicong
  orcidid: 0000-0003-0573-8808
  surname: Chen
  fullname: Chen, Weicong
  email: wchen97@ucmerced.edu
  organization: University of California Merced, Merced, CA, 95343, USA
– sequence: 4
  givenname: Xiaoyi
  orcidid: 0000-0001-7581-8905
  surname: Lu
  fullname: Lu, Xiaoyi
  email: xiaoyi.lu@ucmerced.edu
  organization: University of California Merced, Merced, CA, 95343, USA
BookMark eNpNkEtPAjEQgBuDiYCevXjYxPNCX3SXI8EHJqwehHhsuu3sUoQWu7sa_70lcDCTzMzhm5nMN0A95x0gdEvwiBA8HRfFiGLKR4wJjHNygfpkyrKUE856qI9pRlOSMXqFBk2zxRhPKM77aL2w9SZ9PwCY5EG1Kpn7_b5zVqvWepd82HaTzMy3cjoCr9D--PDZJNYlSxVqiNnVnYpN4Q3sklVQ1llXX6PLSu0auDnXIVo_Pa7mi3T59vwyny1TTTlvU11pofOsZEIA01xxOi1LAzTLmSgNUQB6UhoVA1clyRUYroVRIFSeQ8aADdH9ae8h-K8OmlZufRdcPCkZjiLiiyyP1PhE6eCbJkAlD8HuVfiVBMujO1kU8uhOnt3FibvThAWAfzQnGRZT9gcYEG00
CODEN IEMIDZ
Cites_doi 10.1145/3458817.3476209
10.1145/3065386
10.1109/tcom.1974.1092259
10.1109/hoti59126.2023.00022
10.17487/rfc4392
10.17487/rfc5040
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/MM.2024.3360081
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE/IET Electronic Library (IEL) (UW System Shared)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1937-4143
EndPage 40
ExternalDocumentID 10_1109_MM_2024_3360081
10417069
Genre orig-research
GrantInformation_xml – fundername: MRI
  grantid: 2019144
  funderid: 10.13039/100011612
– fundername: Office of Advanced Cyberinfrastructure
  grantid: 2321123; 2340982
  funderid: 10.13039/100000105
GroupedDBID -DZ
-~X
.DC
0R~
29I
3EH
4.4
5GY
5VS
6IK
97E
AAFWJ
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACGOD
ACIWK
ACNCT
AENEX
AETEA
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
AZLTO
BEFXN
BFFAM
BGNUA
BKEBE
BKOMP
BPEOZ
C1A
CS3
DU5
EBS
EJD
HZ~
H~9
IBMZZ
ICLAB
IEDLZ
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
OHT
P2P
PQQKQ
RIA
RIE
RNI
RNS
RZB
TAE
TN5
TWZ
VH1
YZZ
ZCG
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c244t-cfc6c87b366e3c4a429bbde27836bd1aeec5bdadad0fb18aed4c6dae6a88e73e3
IEDL.DBID RIE
ISICitedReferencesCount 3
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001198266500002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0272-1732
IngestDate Mon Jun 30 07:18:52 EDT 2025
Sat Nov 29 06:18:43 EST 2025
Wed Aug 27 02:17:07 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c244t-cfc6c87b366e3c4a429bbde27836bd1aeec5bdadad0fb18aed4c6dae6a88e73e3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-7581-8905
0000-0003-0573-8808
0000-0002-0907-6920
PQID 3033620838
PQPubID 37061
PageCount 10
ParticipantIDs crossref_primary_10_1109_MM_2024_3360081
ieee_primary_10417069
proquest_journals_3033620838
PublicationCentury 2000
PublicationDate 2024-03-01
PublicationDateYYYYMMDD 2024-03-01
PublicationDate_xml – month: 03
  year: 2024
  text: 2024-03-01
  day: 01
PublicationDecade 2020
PublicationPlace Los Alamitos
PublicationPlace_xml – name: Los Alamitos
PublicationTitle IEEE MICRO
PublicationTitleAbbrev MM
PublicationYear 2024
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref8
Huang (ref9) 2019
ref12
Radford (ref5) 2019; 1
ref4
ref3
Dean (ref7) 2012; 25
ref11
ref10
ref2
Devlin (ref1) 2018
Raffel (ref6) 2024; 21
References_xml – volume: 25
  start-page: 1223
  year: 2012
  ident: ref7
  article-title: Large scale distributed deep networks
  publication-title: ,” in Proc. Adv. Neural Inf. Process. Syst.
– volume-title: GPT-4
  ident: ref2
– ident: ref3
  doi: 10.1145/3458817.3476209
– ident: ref8
  doi: 10.1145/3065386
– ident: ref10
  doi: 10.1109/tcom.1974.1092259
– volume: 1
  start-page: 9
  issue: 8
  year: 2019
  ident: ref5
  article-title: Language models are unsupervised multitask learners
  publication-title: OpenAI Blog
– volume-title: GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism
  year: 2019
  ident: ref9
– volume: 21
  start-page: 5485
  issue: 1
  year: 2024
  ident: ref6
  article-title: Exploring the limits of transfer learning with a unified text-to-text transformer
  publication-title: J. Mach. Learn. Res.
– ident: ref4
  doi: 10.1109/hoti59126.2023.00022
– ident: ref11
  doi: 10.17487/rfc4392
– ident: ref12
  doi: 10.17487/rfc5040
– year: 2018
  ident: ref1
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
SSID ssj0005208
Score 2.4111812
Snippet Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Index Database
Publisher
StartPage 31
SubjectTerms Communication
Computational modeling
Data communication
Data models
Decoding
High speed
Interconnections
IP (Internet Protocol)
Large language models
Natural language processing
Parallel processing
Synchronization
TCP/IP (protocol)
TCPIP
Training
Transformers
Title High-Speed Data Communication With Advanced Networks in Large Language Model Training
URI https://ieeexplore.ieee.org/document/10417069
https://www.proquest.com/docview/3033620838
Volume 44
WOSCitedRecordID wos001198266500002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE/IET Electronic Library (IEL) (UW System Shared)
  customDbUrl:
  eissn: 1937-4143
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0005208
  issn: 0272-1732
  databaseCode: RIE
  dateStart: 19810101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA46PHhx_pg4nZKDBy-ZbdMm6XGow8M6BDfcraRJigPpxtb59_uSZrAhHqQQekhLea_J9yX53nsI3QNpU5qVlGheBCSOtSAyUBIaJpJEBSoJXKDwiI_HYjZL33ywuouFMcY48Znp21t3lq8XamO3ymCExzbbS3qIDjlnTbDWrp7DTbsRj0jIaeTz-IRB-phlsBCM4j6lzELgHgS5miq_JmKHLsP2P7_rFJ14GokHjd_P0IGpzlF7W6IB-xF7gaZWx0Hel4BR-FnWEu8FhOCPef2JB14GgMeNJHyN5xUeWYU4tM1uJrYl077wxNeT6KDp8GXy9Ep8JQWiAL5rokrFlOAFZcxQFUsAoaLQxlbZYIUOpTEqKbSEKyiLUEijY8W0NEwKYTg19BK1qkVlrhBmJTAaIGVMuTNWDfTKgF1tmnbFgH120cPWuPmySZiRu4VGkOZZlls_5N4PXdSxttzp1pixi3pbb-R-RK1zgFrAWiCM4vqPx27QsX17IxDroVa92phbdKS-6_l6ded-lh-8BLz6
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7oFPTi_DFxOjUHD146-zNtj0MdE9siuOFuJU1SHMg2ts6_35c0gw3xIIXQQ0rLe02-L8n33gO4Q9LGBS09S4SFbfm-iCxmc4YNjYKA2zywdaBwEmZZNB7HbyZYXcfCSCm1-Ex21a0-yxczvlJbZTjCfZXtJd6FvcDHhU8drrWp6NATrxu6lhN6rsnk49jxQ5riUtD1u55HFQhugZCuqvJrKtb40m_-88uO4cgQSdKrPX8CO3J6Cs11kQZixuwZjJSSw3qfI0qRJ1YxshUSQj4m1SfpGSEAyWpR-JJMpiRRGnFs6_1MooqmfZGhqSjRglH_efg4sEwtBYsjgFcWLznlUVh4lEqP-wxhqCiEVHU2aCEcJiUPCsHwssvCiZgUPqeCScqiSIae9M6hMZ1N5QUQWiKnQVpGuT5lFUiwJNpVJWrnFPlnG-7Xxs3ndcqMXC817DhP01z5ITd-aENL2XKjW23GNnTW3sjNmFrmCLaItkgZo8s_HruFg8EwTfLkJXu9gkP1plou1oFGtVjJa9jn39VkubjRP84PvZvAQQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=High-Speed+Data+Communication+With+Advanced+Networks+in+Large+Language+Model+Training&rft.jtitle=IEEE+MICRO&rft.au=Dai%2C+Liuyao&rft.au=Qi%2C+Hao&rft.au=Chen%2C+Weicong&rft.au=Lu%2C+Xiaoyi&rft.date=2024-03-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=0272-1732&rft.eissn=1937-4143&rft.volume=44&rft.issue=2&rft.spage=31&rft_id=info:doi/10.1109%2FMM.2024.3360081&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0272-1732&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0272-1732&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0272-1732&client=summon