Communication Algorithm-Architecture Co-Design for Distributed Deep Learning
Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which domina...
Uloženo v:
| Vydáno v: | Proceedings - International Symposium on Computer Architecture s. 181 - 194 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.06.2021
|
| Témata: | |
| ISSN: | 2575-713X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MultiTree all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MultiTree achieves 2.3× and 1.56× communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively. |
|---|---|
| AbstractList | Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MultiTree all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MultiTree achieves 2.3× and 1.56× communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively. |
| Author | Yum, Ki Hwan Huang, Jiayi Kim, Sungkeun Kim, Eun Jung Muzahid, Abdullah Majumder, Pritam |
| Author_xml | – sequence: 1 givenname: Jiayi surname: Huang fullname: Huang, Jiayi email: jyhuang@ucsb.edu organization: UC Santa Barbara – sequence: 2 givenname: Pritam surname: Majumder fullname: Majumder, Pritam email: pritam2309@tamu.edu organization: Texas A&M University – sequence: 3 givenname: Sungkeun surname: Kim fullname: Kim, Sungkeun email: ksungkeun84@tamu.edu organization: Texas A&M University – sequence: 4 givenname: Abdullah surname: Muzahid fullname: Muzahid, Abdullah email: abdullah.muzahid@cse.tamu.edu organization: Texas A&M University – sequence: 5 givenname: Ki Hwan surname: Yum fullname: Yum, Ki Hwan email: yum@cse.tamu.edu organization: Texas A&M University – sequence: 6 givenname: Eun Jung surname: Kim fullname: Kim, Eun Jung email: ejkim@cse.tamu.edu organization: Texas A&M University |
| BookMark | eNotjs1OhDAYAKvRxGX1CfTQFwD7Q1t6JODPJiQe1MTbpgtf2Zql3ZRy8O0l0bnMbTIZuvLBA0IPlBSUEv24e29qwQhlBSOMFoQQxi9QRqUUJV9Rl2jDhBK5ovzrBmXz_E0I1VrIDeqaME2Ld71JLnhcn8YQXTpOeR37o0vQpyUCbkLewuxGj22IuHVziu6wJBhwC3DGHZjonR9v0bU1pxnu_r1Fn89PH81r3r297Jq6yw0XNOWgYF2Eap0YwAjbAzFW9lyV5cAElbaSzFZWD5wQYayFgwAhDFfaUjuwim_R_V_XAcD-HN1k4s9el1prqvgvLrxQUQ |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ISCA52012.2021.00023 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 1665433337 9781665433334 |
| EISSN | 2575-713X |
| EndPage | 194 |
| ExternalDocumentID | 9499917 |
| Genre | orig-research |
| GroupedDBID | 23M 29F 29O 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO ZY4 |
| ID | FETCH-LOGICAL-a351t-e7e002e8001dea5fce0af6c3744d2516f862f8f9d3005affeb5e55a379f1fd283 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 29 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000702275600014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:24:06 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a351t-e7e002e8001dea5fce0af6c3744d2516f862f8f9d3005affeb5e55a379f1fd283 |
| PageCount | 14 |
| ParticipantIDs | ieee_primary_9499917 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-06 |
| PublicationDateYYYYMMDD | 2021-06-01 |
| PublicationDate_xml | – month: 06 year: 2021 text: 2021-06 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings - International Symposium on Computer Architecture |
| PublicationTitleAbbrev | ISCA |
| PublicationYear | 2021 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0019956 |
| Score | 2.3740485 |
| Snippet | Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 181 |
| SubjectTerms | algorithm-architecture co-design all-reduce data-parallel training Deep learning distributed deep learning interconnection network Network topology Schedules Scheduling Stochastic processes Topology Training |
| Title | Communication Algorithm-Architecture Co-Design for Distributed Deep Learning |
| URI | https://ieeexplore.ieee.org/document/9499917 |
| WOSCitedRecordID | wos000702275600014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ1NTwIxEIYnQDx4QgXjd3rw6ALL0nZ7JCDRhBASNeFGSjtFE2UJLP5-O2VFSbzY06aXZmfTzMzOPO8A3DrOUaREtxttfYJiSPJWh7Y_wbUTVicBFB7K0SidTNS4BHc7FgYRQ_MZNugx1PJtZjb0q6xJQio-vShDWUqxZbV2FQMiNAs0Lm6p5uNTr8u9cyPWqh03gq7L3gCV4D8G1f-dfAT1HxCPjXcu5hhKuDiB6vckBlZczBoM9zgP1n2fZz7pf_2Iur_qBKyXRf3QsMF8pMr6JJlL067Qsj7ikhVSq_M6vAzun3sPUTEnIdIJj_MIJfrXQx_6xRY1dwZb3somkZ2O9eGLcD5rcalTlqTptXM448i5TqRysbM-vjiFyiJb4BkwJaxsOYs81h2_0lmbxF-UUamwbafSc6iRcabLrRTGtLDLxd_bl3BI1t92Vl1BJV9t8BoOzGf-tl7dhO_3BYwbnQA |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3fT8IwEMcviCb6hArG3_bBRwf71W19XEACcRISMeGNlPaKJsoIgn-_bTdREl_cU7OXZdc0d9e7z_cAbhWlGCWGbhdc6gRFGMlbbtv-IspVJHlgQeEsHgyS8ZgNK3C3YWEQ0TafYdMsbS1f5mJtrspaRkhFpxc7sEvD0HcLWmtTMzCMZgnHeS5r9Z_aKdXuzdBWvte0yi5bI1SsB-nW_vftQ2j8oHhkuHEyR1DB-THUvmcxkPJo1iHbIj1I-jbLddr_8u6kvyoFpJ07HduyQXSsSjpGNNfMu0JJOogLUoqtzhrw3L0ftXtOOSnB4QH1Vg7GqH8PdfDnSeRUCXS1nUUQh6HUAUykdN6iEsWkEafnSuGUIqU8iJnylNQRxglU5_kcT4GwSMaukkg9HuonmfpG_oUJlkTSVyw5g7oxzmRRiGFMSruc__36BvZ7o8dskvUHDxdwYHai6LO6hOpqucYr2BOfq9eP5bXdyy_lxaBH |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+-+International+Symposium+on+Computer+Architecture&rft.atitle=Communication+Algorithm-Architecture+Co-Design+for+Distributed+Deep+Learning&rft.au=Huang%2C+Jiayi&rft.au=Majumder%2C+Pritam&rft.au=Kim%2C+Sungkeun&rft.au=Muzahid%2C+Abdullah&rft.date=2021-06-01&rft.pub=IEEE&rft.eissn=2575-713X&rft.spage=181&rft.epage=194&rft_id=info:doi/10.1109%2FISCA52012.2021.00023&rft.externalDocID=9499917 |