Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles....
Uloženo v:
| Vydáno v: | SC24: International Conference for High Performance Computing, Networking, Storage and Analysis s. 1 - 17 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
17.11.2024
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2 \times traffic reduction on a 188 -node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links. |
|---|---|
| AbstractList | In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2 \times traffic reduction on a 188 -node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links. |
| Author | Bloch, Gil Khalilov, Mikhail Nudelman, Rami Hoefler, Torsten Girolamo, Salvatore Di Chrapek, Marcin |
| Author_xml | – sequence: 1 givenname: Mikhail surname: Khalilov fullname: Khalilov, Mikhail email: mikhail.khalilov@inf.ethz.ch organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland – sequence: 2 givenname: Salvatore Di surname: Girolamo fullname: Girolamo, Salvatore Di email: sdigirolamo@nvidia.com organization: NVIDIA Corporation,Zurich,Switzerland – sequence: 3 givenname: Marcin surname: Chrapek fullname: Chrapek, Marcin email: marcin.chrapek@inf.ethz.ch organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland – sequence: 4 givenname: Rami surname: Nudelman fullname: Nudelman, Rami email: ramin@nvidia.com organization: NVIDIA Corporation,Santa Clara,United States – sequence: 5 givenname: Gil surname: Bloch fullname: Bloch, Gil email: gil@nvidia.com organization: NVIDIA Corporation,Yokne'am Illit,Israel – sequence: 6 givenname: Torsten surname: Hoefler fullname: Hoefler, Torsten email: torsten.hoefler@inf.ethz.ch organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland |
| BookMark | eNotjs9KxDAYxCMoqGtfQDzkBVq_JG3SHLv138KyPajnJWm-uMXaLmlk8e0N6GmG-THDXJPzaZ6QkFsGBWOg71_bkpUgCw68LABSdEYyrXQtKhAV10xdkmxZBguVUkIJEFdkt8N4msNn3nk_zsaho2szudPg4iHvjnH4MiNdh0R6s0SaEG3G8cPEAwbq50AfhiWGwX7H1Gw2N-TCm3HB7F9X5P3p8a19ybfd86ZttrnhVRlzC70FkKXTvXIieXSMaVtZI7Wrsao9Sl8jFzW3DJUGLzhoSDUm0afrK3L3tzsg4v4Y0s3ws2egtAAJ4henU090 |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/SC41406.2024.00109 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350352917 |
| EndPage | 17 |
| ExternalDocumentID | 10793060 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK LHSKQ RIE RIL |
| ID | FETCH-LOGICAL-a254t-b0cb0064d9c7d3cb0ed119b5ba69d8e58fe6f8e2382b1e790f32090b0c16ef373 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001414891300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Jan 01 06:01:57 EST 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a254t-b0cb0064d9c7d3cb0ed119b5ba69d8e58fe6f8e2382b1e790f32090b0c16ef373 |
| PageCount | 17 |
| ParticipantIDs | ieee_primary_10793060 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-Nov.-17 |
| PublicationDateYYYYMMDD | 2024-11-17 |
| PublicationDate_xml | – month: 11 year: 2024 text: 2024-Nov.-17 day: 17 |
| PublicationDecade | 2020 |
| PublicationTitle | SC24: International Conference for High Performance Computing, Networking, Storage and Analysis |
| PublicationTitleAbbrev | SC |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib057737303 |
| Score | 1.9334532 |
| Snippet | In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | AI accelerators Artificial intelligence Clusters Engines Multicast algorithms Networking Next generation networking Pipelines Protocols Substrates Supercomputers Training |
| Title | Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI |
| URI | https://ieeexplore.ieee.org/document/10793060 |
| WOSCitedRecordID | wos001414891300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwMhECbaePCkxhq1ajh4RffB8jjWaqOXtYma9NbAMpgm69a0W_37DrRVLx68EQghGQZmPphvhpBLG7wGrgUrvM4ZV5ljaEZy5sCiyoDIVOVisQlZlmo81qM1WT1yYQAgBp_BVWjGv3w3q5bhqQxPOGpTIhChb0spVmStjfIUUuaorfmGGJPo66cBR_gQ4hCykCI7Bh3-KqESLchw759r75PuDxePjr6tzAHZguaQlOUqeps9el_PjANHb0zjPqcuvFTiLfBmaooI27jKLFqKQ7Rf16_R26PoptLbkC83lLrCmf2HLnkZ3j0P7tm6MgIzCOhaZpMqHBfudCVdjm1waaptYY3QTkGhPAivAM1xZlOQOvF5lugEp6UCPMrpiHSaWQPHhFqjeSpN5qFA14gb4xXXRnJhEEuJ1J2QbhDG5H2V_GKykcPpH_09shvkHeh6qTwjnXa-hHOyU32008X8Im7ZF38blzs |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1SBT2pWPHbHLxG9yObbI61Wlqsa8EKvZXsZiKFdSvtVv--k7RVLx68hSxhYXayMy-ZN4-Qq9xlDVwJllgVM55GhmEYiZmBHF0GRJQWxotNyCxLRyM1WJHVPRcGAHzxGVy7ob_LN9Ni4Y7KcIejNwUCEfqmk85KlnSttfskUsbor_GaGhOom-c2RwDhKhEi1yTblx3-ElHxMaSz-8-375HmDxuPDr7jzD7ZgOqAZNmyfps9WVtOtQFDb3VlPifGnVXif-BNlxQxtjaFntcUH9FWWb76fI9iokrvXMdcJ3aFK1u9Jnnp3A_bXbbSRmAaIV3N8qBwG4YbVUgT4xhMGKo8ybVQJoUktSBsChiQozwEqQIbR4EKcFkowKKdDkmjmlZwRGiuFQ-ljiwkmBxxrW3KlZZcaERTIjTHpOmMMX5ftr8Yr-1w8sf8JdnuDh_7434vezglO872jrwXyjPSqGcLOCdbxUc9mc8u_Of7ApS5moY |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Network-Offloaded+Bandwidth-Optimal+Broadcast+and+Allgather+for+Distributed+AI&rft.au=Khalilov%2C+Mikhail&rft.au=Girolamo%2C+Salvatore+Di&rft.au=Chrapek%2C+Marcin&rft.au=Nudelman%2C+Rami&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=1&rft.epage=17&rft_id=info:doi/10.1109%2FSC41406.2024.00109&rft.externalDocID=10793060 |