Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:SC24: International Conference for High Performance Computing, Networking, Storage and Analysis S. 1 - 17
Hauptverfasser: Khalilov, Mikhail, Girolamo, Salvatore Di, Chrapek, Marcin, Nudelman, Rami, Bloch, Gil, Hoefler, Torsten
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 17.11.2024
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2 \times traffic reduction on a 188 -node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
AbstractList In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2 \times traffic reduction on a 188 -node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
Author Bloch, Gil
Khalilov, Mikhail
Nudelman, Rami
Hoefler, Torsten
Girolamo, Salvatore Di
Chrapek, Marcin
Author_xml – sequence: 1
  givenname: Mikhail
  surname: Khalilov
  fullname: Khalilov, Mikhail
  email: mikhail.khalilov@inf.ethz.ch
  organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland
– sequence: 2
  givenname: Salvatore Di
  surname: Girolamo
  fullname: Girolamo, Salvatore Di
  email: sdigirolamo@nvidia.com
  organization: NVIDIA Corporation,Zurich,Switzerland
– sequence: 3
  givenname: Marcin
  surname: Chrapek
  fullname: Chrapek, Marcin
  email: marcin.chrapek@inf.ethz.ch
  organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland
– sequence: 4
  givenname: Rami
  surname: Nudelman
  fullname: Nudelman, Rami
  email: ramin@nvidia.com
  organization: NVIDIA Corporation,Santa Clara,United States
– sequence: 5
  givenname: Gil
  surname: Bloch
  fullname: Bloch, Gil
  email: gil@nvidia.com
  organization: NVIDIA Corporation,Yokne'am Illit,Israel
– sequence: 6
  givenname: Torsten
  surname: Hoefler
  fullname: Hoefler, Torsten
  email: torsten.hoefler@inf.ethz.ch
  organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland
BookMark eNotjs9KxDAYxCMoqGtfQDzkBVq_JG3SHLv138KyPajnJWm-uMXaLmlk8e0N6GmG-THDXJPzaZ6QkFsGBWOg71_bkpUgCw68LABSdEYyrXQtKhAV10xdkmxZBguVUkIJEFdkt8N4msNn3nk_zsaho2szudPg4iHvjnH4MiNdh0R6s0SaEG3G8cPEAwbq50AfhiWGwX7H1Gw2N-TCm3HB7F9X5P3p8a19ybfd86ZttrnhVRlzC70FkKXTvXIieXSMaVtZI7Wrsao9Sl8jFzW3DJUGLzhoSDUm0afrK3L3tzsg4v4Y0s3ws2egtAAJ4henU090
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SC41406.2024.00109
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
Accès UT - IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350352917
EndPage 17
ExternalDocumentID 10793060
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
APO
CBEJK
LHSKQ
RIE
RIL
ID FETCH-LOGICAL-a254t-b0cb0064d9c7d3cb0ed119b5ba69d8e58fe6f8e2382b1e790f32090b0c16ef373
IEDL.DBID RIE
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001414891300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Jan 01 06:01:57 EST 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a254t-b0cb0064d9c7d3cb0ed119b5ba69d8e58fe6f8e2382b1e790f32090b0c16ef373
PageCount 17
ParticipantIDs ieee_primary_10793060
PublicationCentury 2000
PublicationDate 2024-Nov.-17
PublicationDateYYYYMMDD 2024-11-17
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-Nov.-17
  day: 17
PublicationDecade 2020
PublicationTitle SC24: International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib057737303
Score 1.9335604
Snippet In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms AI accelerators
Artificial intelligence
Clusters
Engines
Multicast algorithms
Networking
Next generation networking
Pipelines
Protocols
Substrates
Supercomputers
Training
Title Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
URI https://ieeexplore.ieee.org/document/10793060
WOSCitedRecordID wos001414891300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwMhECbaePCkxhrf4eAVZZctLMdabTQxaxMf6a2BZTBN1m3TbvXvO9BWvXjwRiCEZJhhGJjvG0IurPToVxPPAr05yzJnmVFeMO0UiFJwnUey59cHVRT5cKgHK7B6xMIAQEw-g8vQjH_5blIuwlMZWjhqE5cYoW8qJZdgrbXydJQSqK1iDYzh-uqpl2H4EPIQ0kCRHZMOf5VQiR6kv_PPtXdJ-weLRwffXmaPbEC9T4pimb3NHr2vJsaBo9emdp9jF14q8RR4NxXFCNu40swbikO0W1Vv8bZH8ZpKbwJfbih1hTO7923y0r997t2xVWUEZjCga5jlZTCXzOlSOYFtcEmibccaqV0OndyD9DmgO05tAkpzL1KuOU5LJHiU0wFp1ZMaDglVUmaKOzASLRMg1bhHxhrhVKmEtfyItIMwRtMl-cVoLYfjP_pPyHaQd4DrJeqUtJrZAs7IVvnRjOez87hlXxDil9A
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwMhECVGTfSkxhq_5eAVZZfdZTnWatPGujaxmt4aWAbTZN2adqt_32HbqhcP3giEkAwzDAPz3hByaRKHfjVwzNObsyiyhmnpBFNWgsgFV2lN9vzSk1mWDoeqvwSr11gYAKiTz-DKN-u_fDvJ5_6pDC0ctYknGKFvxFEU8gVca6U-sZQC9VWsoDFcXT-1IgwgfCZC6Emy67TDX0VUah_S3vnn6ruk8YPGo_1vP7NH1qDcJ1m2yN9mj84VE23B0htd2s-x9W-VeA686YJijK1trmcVxSHaLIrX-r5H8aJKbz1jri92hTOb3QZ5bt8NWh22rI3ANIZ0FTM89wYTWZVLK7ANNgiUiY1OlE0hTh0kLgV0yKEJQCruRMgVx2lBAg7ldEDWy0kJh4TKJIkkt6ATtE2AUOEuaaOFlbkUxvAj0vDCGL0v6C9GKzkc_9F_QbY6g4feqNfN7k_Itpe9B-8F8pSsV9M5nJHN_KMaz6bn9fZ9ASbPmxc
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Network-Offloaded+Bandwidth-Optimal+Broadcast+and+Allgather+for+Distributed+AI&rft.au=Khalilov%2C+Mikhail&rft.au=Girolamo%2C+Salvatore+Di&rft.au=Chrapek%2C+Marcin&rft.au=Nudelman%2C+Rami&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=1&rft.epage=17&rft_id=info:doi/10.1109%2FSC41406.2024.00109&rft.externalDocID=10793060