AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive

As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s) and servers. Collective communication among devices/servers (for gradient synchronization, intermediate data exchange, etc.) introduces signif...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings of the International Conference on Distributed Computing Systems s. 25 - 35
Hlavní autoři: Zhao, Xiaoyang, Zhang, Zhe, Wu, Chuan
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 23.07.2024
Témata:
ISSN:2575-8411
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s) and servers. Collective communication among devices/servers (for gradient synchronization, intermediate data exchange, etc.) introduces significant overheads, rendering major performance bottlenecks in distributed learning. A number of communication libraries, such as NCCL, Gloo and MPI, have been developed to optimize collective communication. Predefined communication strategies (e.g., ring or tree) are largely adopted, which may not be efficient or adaptive enough for inter-machine communication, especially in cloud-based scenarios where instance configurations and network performance can vary substantially. We propose AdapCC, a novel communication library that dynamically adapts to resource heterogeneity and network variability for optimized communication and training performance. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31 % training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends.
AbstractList As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s) and servers. Collective communication among devices/servers (for gradient synchronization, intermediate data exchange, etc.) introduces significant overheads, rendering major performance bottlenecks in distributed learning. A number of communication libraries, such as NCCL, Gloo and MPI, have been developed to optimize collective communication. Predefined communication strategies (e.g., ring or tree) are largely adopted, which may not be efficient or adaptive enough for inter-machine communication, especially in cloud-based scenarios where instance configurations and network performance can vary substantially. We propose AdapCC, a novel communication library that dynamically adapts to resource heterogeneity and network variability for optimized communication and training performance. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31 % training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends.
Author Wu, Chuan
Zhao, Xiaoyang
Zhang, Zhe
Author_xml – sequence: 1
  givenname: Xiaoyang
  surname: Zhao
  fullname: Zhao, Xiaoyang
  email: xyzhao@cs.hku.hk
  organization: The University of Hong Kong,Department of Computer Science
– sequence: 2
  givenname: Zhe
  surname: Zhang
  fullname: Zhang, Zhe
  email: zzhang2@cs.hku.hk
  organization: The University of Hong Kong,Department of Computer Science
– sequence: 3
  givenname: Chuan
  surname: Wu
  fullname: Wu, Chuan
  email: cwu@cs.hku.hk
  organization: The University of Hong Kong,Department of Computer Science
BookMark eNotjOFKwzAUhaMouM29gUJfoPPepEkT_41s6mCioP4eaXKr0S0dbSf49nYoHDjng3POmJ2lJhFj1wgzRDA3K7uwLwrMwBx4MQMA5CdsakqjhQShFUhzykZcljLXBeIFG3fd51CTWokRe54Ht7f2Nnt0XzG9Z7bZbsn38ZuGuNsdUvSuj03KYsoWsevbWB16CkPdf8RE2Zpcm47D489xdsnOa7ftaPrvE_Z2t3y1D_n66X5l5-s8Yqn6XAfilQsFF06qAYLGqvC1IokOvQ-m8k4U3tfcCNQB1CATCJSqvS91KSbs6u83EtFm38ada382CEogIIpflidSrQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICDCS60910.2024.00012
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350386059
EISSN 2575-8411
EndPage 35
ExternalDocumentID 10631011
Genre orig-research
GroupedDBID 29G
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i176t-8de2bad423a568ded81b4cf6e51a1ccd9bca34ccf29318d06d069de066fcc7873
IEDL.DBID RIE
ISICitedReferencesCount 16
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001304430200003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:32:38 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-8de2bad423a568ded81b4cf6e51a1ccd9bca34ccf29318d06d069de066fcc7873
PageCount 11
ParticipantIDs ieee_primary_10631011
PublicationCentury 2000
PublicationDate 2024-July-23
PublicationDateYYYYMMDD 2024-07-23
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-July-23
  day: 23
PublicationDecade 2020
PublicationTitle Proceedings of the International Conference on Distributed Computing Systems
PublicationTitleAbbrev ICDCS
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0005863
Score 2.3580136
Snippet As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s)...
SourceID ieee
SourceType Publisher
StartPage 25
SubjectTerms collective communication
distributed training
Graphics processing units
Libraries
Performance evaluation
Pressing
Rendering (computer graphics)
Throughput
Training
Title AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive
URI https://ieeexplore.ieee.org/document/10631011
WOSCitedRecordID wos001304430200003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagYmAqjyDe8sAayNN22FBKBRJUkQCpW-XYZ9QlrUrK7-fOTSkMDEgZnCiJJVuX--5y332MXWUq1qpG5EZFfxigOBvq2Low1xLHOYZvqScKP8nRSI3HRdWR1T0XBgB88Rlc09D_y7czs6RUGVq4QDRCTN5tKeWKrLWp51Ai7Sg6OO_NYzkoXwR5QwwCE2qRHZHq5A8JFe9Bhv1_zr3Hgg0Xj1ffXmafbUFzwPprMQbe2eYhq-6snpflLX_2-lLcZwT8x4z_IoHwacMH1C2XhK7A4u1UTQm8a7T6zuk99FjA3ob3r-VD2MklhNNYijZUFpJaW8RHOhd4YhGRZsYJyGMdG2OL2ug0M8ahh4-VjQQehQXEHM4YtNv0iPWaWQPHjDtbJLUp8ozII2i2OklspiCqRSRdbeCEBbRCk_mqI8ZkvTinf1w_Y7u0CZQTTdJz1msXS7hgO-aznX4sLv0-fgGyJ5-L
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZQQYKpPIp444E1ECeO47ChlKoVbVWJInWrHD9Ql7QqKb-fOzelMDAgZXCiJJZsXe67y333EXLHJVOyAOSGRX8QoDgTKGZckKgUxgmEb7EnCvfT4VBOJtmoJqt7Loy11hef2Xsc-n_5Zq5XmCoDCxeARpDJu5twHrE1XWtb0SFFXJN0YOaHXt7OXwX6QwgDI2ySHaLu5A8RFe9DOs1_zn5IWls2Hh19-5kjsmPLY9LcyDHQ2jpPyOjJqEWeP9KBV5iiPifgP2f0Fw2Ezkraxn65KHVlDdyO9ZSW1q1W3ym-Bx9rkbfO8zjvBrVgQjBjqagCaWxUKAMISSUCTgxgUq6dsAlTTGuTFVrFXGsHPp5JEwo4MmMBdTitwXLjU9Io56U9I9SZLCp0lnCkj4DhqigyXNqwEGHqCm3PSQtXaLpY98SYbhbn4o_rt2S_Ox70p_3e8OWSHOCGYIY0iq9Io1qu7DXZ05_V7GN54_f0Cy1IotI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+International+Conference+on+Distributed+Computing+Systems&rft.atitle=AdapCC%3A+Making+Collective+Communication+in+Distributed+Machine+Learning+Adaptive&rft.au=Zhao%2C+Xiaoyang&rft.au=Zhang%2C+Zhe&rft.au=Wu%2C+Chuan&rft.date=2024-07-23&rft.pub=IEEE&rft.eissn=2575-8411&rft.spage=25&rft.epage=35&rft_id=info:doi/10.1109%2FICDCS60910.2024.00012&rft.externalDocID=10631011