Communication Algorithm-Architecture Co-Design for Distributed Deep Learning

Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which domina...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings - International Symposium on Computer Architecture s. 181 - 194
Hlavní autoři: Huang, Jiayi, Majumder, Pritam, Kim, Sungkeun, Muzahid, Abdullah, Yum, Ki Hwan, Kim, Eun Jung
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.06.2021
Témata:
ISSN:2575-713X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MultiTree all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MultiTree achieves 2.3× and 1.56× communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.
AbstractList Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MultiTree all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MultiTree achieves 2.3× and 1.56× communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.
Author Yum, Ki Hwan
Huang, Jiayi
Kim, Sungkeun
Kim, Eun Jung
Muzahid, Abdullah
Majumder, Pritam
Author_xml – sequence: 1
  givenname: Jiayi
  surname: Huang
  fullname: Huang, Jiayi
  email: jyhuang@ucsb.edu
  organization: UC Santa Barbara
– sequence: 2
  givenname: Pritam
  surname: Majumder
  fullname: Majumder, Pritam
  email: pritam2309@tamu.edu
  organization: Texas A&M University
– sequence: 3
  givenname: Sungkeun
  surname: Kim
  fullname: Kim, Sungkeun
  email: ksungkeun84@tamu.edu
  organization: Texas A&M University
– sequence: 4
  givenname: Abdullah
  surname: Muzahid
  fullname: Muzahid, Abdullah
  email: abdullah.muzahid@cse.tamu.edu
  organization: Texas A&M University
– sequence: 5
  givenname: Ki Hwan
  surname: Yum
  fullname: Yum, Ki Hwan
  email: yum@cse.tamu.edu
  organization: Texas A&M University
– sequence: 6
  givenname: Eun Jung
  surname: Kim
  fullname: Kim, Eun Jung
  email: ejkim@cse.tamu.edu
  organization: Texas A&M University
BookMark eNotjs1OhDAYAKvRxGX1CfTQFwD7Q1t6JODPJiQe1MTbpgtf2Zql3ZRy8O0l0bnMbTIZuvLBA0IPlBSUEv24e29qwQhlBSOMFoQQxi9QRqUUJV9Rl2jDhBK5ovzrBmXz_E0I1VrIDeqaME2Ld71JLnhcn8YQXTpOeR37o0vQpyUCbkLewuxGj22IuHVziu6wJBhwC3DGHZjonR9v0bU1pxnu_r1Fn89PH81r3r297Jq6yw0XNOWgYF2Eap0YwAjbAzFW9lyV5cAElbaSzFZWD5wQYayFgwAhDFfaUjuwim_R_V_XAcD-HN1k4s9el1prqvgvLrxQUQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ISCA52012.2021.00023
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 1665433337
9781665433334
EISSN 2575-713X
EndPage 194
ExternalDocumentID 9499917
Genre orig-research
GroupedDBID 23M
29F
29O
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
ZY4
ID FETCH-LOGICAL-a351t-e7e002e8001dea5fce0af6c3744d2516f862f8f9d3005affeb5e55a379f1fd283
IEDL.DBID RIE
ISICitedReferencesCount 29
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000702275600014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:24:06 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a351t-e7e002e8001dea5fce0af6c3744d2516f862f8f9d3005affeb5e55a379f1fd283
PageCount 14
ParticipantIDs ieee_primary_9499917
PublicationCentury 2000
PublicationDate 2021-06
PublicationDateYYYYMMDD 2021-06-01
PublicationDate_xml – month: 06
  year: 2021
  text: 2021-06
PublicationDecade 2020
PublicationTitle Proceedings - International Symposium on Computer Architecture
PublicationTitleAbbrev ISCA
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0019956
Score 2.3740485
Snippet Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for...
SourceID ieee
SourceType Publisher
StartPage 181
SubjectTerms algorithm-architecture co-design
all-reduce
data-parallel training
Deep learning
distributed deep learning
interconnection network
Network topology
Schedules
Scheduling
Stochastic processes
Topology
Training
Title Communication Algorithm-Architecture Co-Design for Distributed Deep Learning
URI https://ieeexplore.ieee.org/document/9499917
WOSCitedRecordID wos000702275600014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ1NTwIxEIYnQDx4QgXjd3rw6ALL0nZ7JCDRhBASNeFGSjtFE2UJLP5-O2VFSbzY06aXZmfTzMzOPO8A3DrOUaREtxttfYJiSPJWh7Y_wbUTVicBFB7K0SidTNS4BHc7FgYRQ_MZNugx1PJtZjb0q6xJQio-vShDWUqxZbV2FQMiNAs0Lm6p5uNTr8u9cyPWqh03gq7L3gCV4D8G1f-dfAT1HxCPjXcu5hhKuDiB6vckBlZczBoM9zgP1n2fZz7pf_2Iur_qBKyXRf3QsMF8pMr6JJlL067Qsj7ikhVSq_M6vAzun3sPUTEnIdIJj_MIJfrXQx_6xRY1dwZb3somkZ2O9eGLcD5rcalTlqTptXM448i5TqRysbM-vjiFyiJb4BkwJaxsOYs81h2_0lmbxF-UUamwbafSc6iRcabLrRTGtLDLxd_bl3BI1t92Vl1BJV9t8BoOzGf-tl7dhO_3BYwbnQA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3fT8IwEMcviCb6hArG3_bBRwf71W19XEACcRISMeGNlPaKJsoIgn-_bTdREl_cU7OXZdc0d9e7z_cAbhWlGCWGbhdc6gRFGMlbbtv-IspVJHlgQeEsHgyS8ZgNK3C3YWEQ0TafYdMsbS1f5mJtrspaRkhFpxc7sEvD0HcLWmtTMzCMZgnHeS5r9Z_aKdXuzdBWvte0yi5bI1SsB-nW_vftQ2j8oHhkuHEyR1DB-THUvmcxkPJo1iHbIj1I-jbLddr_8u6kvyoFpJ07HduyQXSsSjpGNNfMu0JJOogLUoqtzhrw3L0ftXtOOSnB4QH1Vg7GqH8PdfDnSeRUCXS1nUUQh6HUAUykdN6iEsWkEafnSuGUIqU8iJnylNQRxglU5_kcT4GwSMaukkg9HuonmfpG_oUJlkTSVyw5g7oxzmRRiGFMSruc__36BvZ7o8dskvUHDxdwYHai6LO6hOpqucYr2BOfq9eP5bXdyy_lxaBH
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+-+International+Symposium+on+Computer+Architecture&rft.atitle=Communication+Algorithm-Architecture+Co-Design+for+Distributed+Deep+Learning&rft.au=Huang%2C+Jiayi&rft.au=Majumder%2C+Pritam&rft.au=Kim%2C+Sungkeun&rft.au=Muzahid%2C+Abdullah&rft.date=2021-06-01&rft.pub=IEEE&rft.eissn=2575-713X&rft.spage=181&rft.epage=194&rft_id=info:doi/10.1109%2FISCA52012.2021.00023&rft.externalDocID=9499917