AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive
As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s) and servers. Collective communication among devices/servers (for gradient synchronization, intermediate data exchange, etc.) introduces signif...
Uloženo v:
| Vydáno v: | Proceedings of the International Conference on Distributed Computing Systems s. 25 - 35 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
23.07.2024
|
| Témata: | |
| ISSN: | 2575-8411 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s) and servers. Collective communication among devices/servers (for gradient synchronization, intermediate data exchange, etc.) introduces significant overheads, rendering major performance bottlenecks in distributed learning. A number of communication libraries, such as NCCL, Gloo and MPI, have been developed to optimize collective communication. Predefined communication strategies (e.g., ring or tree) are largely adopted, which may not be efficient or adaptive enough for inter-machine communication, especially in cloud-based scenarios where instance configurations and network performance can vary substantially. We propose AdapCC, a novel communication library that dynamically adapts to resource heterogeneity and network variability for optimized communication and training performance. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31 % training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends. |
|---|---|
| AbstractList | As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s) and servers. Collective communication among devices/servers (for gradient synchronization, intermediate data exchange, etc.) introduces significant overheads, rendering major performance bottlenecks in distributed learning. A number of communication libraries, such as NCCL, Gloo and MPI, have been developed to optimize collective communication. Predefined communication strategies (e.g., ring or tree) are largely adopted, which may not be efficient or adaptive enough for inter-machine communication, especially in cloud-based scenarios where instance configurations and network performance can vary substantially. We propose AdapCC, a novel communication library that dynamically adapts to resource heterogeneity and network variability for optimized communication and training performance. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31 % training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends. |
| Author | Wu, Chuan Zhao, Xiaoyang Zhang, Zhe |
| Author_xml | – sequence: 1 givenname: Xiaoyang surname: Zhao fullname: Zhao, Xiaoyang email: xyzhao@cs.hku.hk organization: The University of Hong Kong,Department of Computer Science – sequence: 2 givenname: Zhe surname: Zhang fullname: Zhang, Zhe email: zzhang2@cs.hku.hk organization: The University of Hong Kong,Department of Computer Science – sequence: 3 givenname: Chuan surname: Wu fullname: Wu, Chuan email: cwu@cs.hku.hk organization: The University of Hong Kong,Department of Computer Science |
| BookMark | eNotjOFKwzAUhaMouM29gUJfoPPepEkT_41s6mCioP4eaXKr0S0dbSf49nYoHDjng3POmJ2lJhFj1wgzRDA3K7uwLwrMwBx4MQMA5CdsakqjhQShFUhzykZcljLXBeIFG3fd51CTWokRe54Ht7f2Nnt0XzG9Z7bZbsn38ZuGuNsdUvSuj03KYsoWsevbWB16CkPdf8RE2Zpcm47D489xdsnOa7ftaPrvE_Z2t3y1D_n66X5l5-s8Yqn6XAfilQsFF06qAYLGqvC1IokOvQ-m8k4U3tfcCNQB1CATCJSqvS91KSbs6u83EtFm38ada382CEogIIpflidSrQ |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICDCS60910.2024.00012 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9798350386059 |
| EISSN | 2575-8411 |
| EndPage | 35 |
| ExternalDocumentID | 10631011 |
| Genre | orig-research |
| GroupedDBID | 29G 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i176t-8de2bad423a568ded81b4cf6e51a1ccd9bca34ccf29318d06d069de066fcc7873 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 16 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001304430200003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:32:38 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i176t-8de2bad423a568ded81b4cf6e51a1ccd9bca34ccf29318d06d069de066fcc7873 |
| PageCount | 11 |
| ParticipantIDs | ieee_primary_10631011 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-July-23 |
| PublicationDateYYYYMMDD | 2024-07-23 |
| PublicationDate_xml | – month: 07 year: 2024 text: 2024-July-23 day: 23 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings of the International Conference on Distributed Computing Systems |
| PublicationTitleAbbrev | ICDCS |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0005863 |
| Score | 2.3580136 |
| Snippet | As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s)... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 25 |
| SubjectTerms | collective communication distributed training Graphics processing units Libraries Performance evaluation Pressing Rendering (computer graphics) Throughput Training |
| Title | AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive |
| URI | https://ieeexplore.ieee.org/document/10631011 |
| WOSCitedRecordID | wos001304430200003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagYmAqjyDe8sAayNN22FBKBRJUkQCpW-XYZ9QlrUrK7-fOTSkMDEgZnCiJJVuX--5y332MXWUq1qpG5EZFfxigOBvq2Low1xLHOYZvqScKP8nRSI3HRdWR1T0XBgB88Rlc09D_y7czs6RUGVq4QDRCTN5tKeWKrLWp51Ai7Sg6OO_NYzkoXwR5QwwCE2qRHZHq5A8JFe9Bhv1_zr3Hgg0Xj1ffXmafbUFzwPprMQbe2eYhq-6snpflLX_2-lLcZwT8x4z_IoHwacMH1C2XhK7A4u1UTQm8a7T6zuk99FjA3ob3r-VD2MklhNNYijZUFpJaW8RHOhd4YhGRZsYJyGMdG2OL2ug0M8ahh4-VjQQehQXEHM4YtNv0iPWaWQPHjDtbJLUp8ozII2i2OklspiCqRSRdbeCEBbRCk_mqI8ZkvTinf1w_Y7u0CZQTTdJz1msXS7hgO-aznX4sLv0-fgGyJ5-L |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZQQYKpPIp444E1ECeO47ChlKoVbVWJInWrHD9Ql7QqKb-fOzelMDAgZXCiJJZsXe67y333EXLHJVOyAOSGRX8QoDgTKGZckKgUxgmEb7EnCvfT4VBOJtmoJqt7Loy11hef2Xsc-n_5Zq5XmCoDCxeARpDJu5twHrE1XWtb0SFFXJN0YOaHXt7OXwX6QwgDI2ySHaLu5A8RFe9DOs1_zn5IWls2Hh19-5kjsmPLY9LcyDHQ2jpPyOjJqEWeP9KBV5iiPifgP2f0Fw2Ezkraxn65KHVlDdyO9ZSW1q1W3ym-Bx9rkbfO8zjvBrVgQjBjqagCaWxUKAMISSUCTgxgUq6dsAlTTGuTFVrFXGsHPp5JEwo4MmMBdTitwXLjU9Io56U9I9SZLCp0lnCkj4DhqigyXNqwEGHqCm3PSQtXaLpY98SYbhbn4o_rt2S_Ox70p_3e8OWSHOCGYIY0iq9Io1qu7DXZ05_V7GN54_f0Cy1IotI |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+International+Conference+on+Distributed+Computing+Systems&rft.atitle=AdapCC%3A+Making+Collective+Communication+in+Distributed+Machine+Learning+Adaptive&rft.au=Zhao%2C+Xiaoyang&rft.au=Zhang%2C+Zhe&rft.au=Wu%2C+Chuan&rft.date=2024-07-23&rft.pub=IEEE&rft.eissn=2575-8411&rft.spage=25&rft.epage=35&rft_id=info:doi/10.1109%2FICDCS60910.2024.00012&rft.externalDocID=10631011 |