Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles....

Full description

Saved in:

Bibliographic Details
Published in:	SC24: International Conference for High Performance Computing, Networking, Storage and Analysis pp. 1 - 17
Main Authors:	Khalilov, Mikhail, Girolamo, Salvatore Di, Chrapek, Marcin, Nudelman, Rami, Bloch, Gil, Hoefler, Torsten
Format:	Conference Proceeding
Language:	English
Published:	IEEE 17.11.2024
Subjects:	AI accelerators Artificial intelligence Clusters Engines Multicast algorithms Networking Next generation networking Pipelines Protocols Substrates Supercomputers Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2 \times traffic reduction on a 188 -node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
AbstractList	In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2 \times traffic reduction on a 188 -node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
Author	Bloch, Gil Khalilov, Mikhail Nudelman, Rami Hoefler, Torsten Girolamo, Salvatore Di Chrapek, Marcin
Author_xml	– sequence: 1 givenname: Mikhail surname: Khalilov fullname: Khalilov, Mikhail email: mikhail.khalilov@inf.ethz.ch organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland – sequence: 2 givenname: Salvatore Di surname: Girolamo fullname: Girolamo, Salvatore Di email: sdigirolamo@nvidia.com organization: NVIDIA Corporation,Zurich,Switzerland – sequence: 3 givenname: Marcin surname: Chrapek fullname: Chrapek, Marcin email: marcin.chrapek@inf.ethz.ch organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland – sequence: 4 givenname: Rami surname: Nudelman fullname: Nudelman, Rami email: ramin@nvidia.com organization: NVIDIA Corporation,Santa Clara,United States – sequence: 5 givenname: Gil surname: Bloch fullname: Bloch, Gil email: gil@nvidia.com organization: NVIDIA Corporation,Yokne'am Illit,Israel – sequence: 6 givenname: Torsten surname: Hoefler fullname: Hoefler, Torsten email: torsten.hoefler@inf.ethz.ch organization: ETH Zurich,Department of Computer Science,Zurich,Switzerland
BookMark	eNotjs9KxDAYxCMoqGtfQDzkBVq_JG3SHLv138KyPajnJWm-uMXaLmlk8e0N6GmG-THDXJPzaZ6QkFsGBWOg71_bkpUgCw68LABSdEYyrXQtKhAV10xdkmxZBguVUkIJEFdkt8N4msNn3nk_zsaho2szudPg4iHvjnH4MiNdh0R6s0SaEG3G8cPEAwbq50AfhiWGwX7H1Gw2N-TCm3HB7F9X5P3p8a19ybfd86ZttrnhVRlzC70FkKXTvXIieXSMaVtZI7Wrsao9Sl8jFzW3DJUGLzhoSDUm0afrK3L3tzsg4v4Y0s3ws2egtAAJ4henU090
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/SC41406.2024.00109
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350352917
EndPage	17
ExternalDocumentID	10793060
Genre	orig-research
GroupedDBID	6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK LHSKQ RIE RIL
ID	FETCH-LOGICAL-a254t-b0cb0064d9c7d3cb0ed119b5ba69d8e58fe6f8e2382b1e790f32090b0c16ef373
IEDL.DBID	RIE
ISICitedReferencesCount	1
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001414891300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Jan 01 06:01:57 EST 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a254t-b0cb0064d9c7d3cb0ed119b5ba69d8e58fe6f8e2382b1e790f32090b0c16ef373
PageCount	17
ParticipantIDs	ieee_primary_10793060
PublicationCentury	2000
PublicationDate	2024-Nov.-17
PublicationDateYYYYMMDD	2024-11-17
PublicationDate_xml	– month: 11 year: 2024 text: 2024-Nov.-17 day: 17
PublicationDecade	2020
PublicationTitle	SC24: International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev	SC
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssib057737303
Score	1.9334532
Snippet	In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	AI accelerators Artificial intelligence Clusters Engines Multicast algorithms Networking Next generation networking Pipelines Protocols Substrates Supercomputers Training
Title	Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
URI	https://ieeexplore.ieee.org/document/10793060
WOSCitedRecordID	wos001414891300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA46PHhSceJvcvBaTZo2aY5zOhSkDlTYbSTNiwxqJ1un_74v2aZePHgLLSHw9SVf3uv73iPkgjmmtM1l4jk3SeYEw3MwRMRYZjj4wEpRKPygyrIYjfRwJVaPWhgAiMlncBmG8V--m1aLECrDHY7WxCR66JtKyaVYa208uVICrVWshTFMXz31M3QfQh5CGkpkx6TDXy1UIoMMdv659i7p_mjx6PCbZfbIBjT7pCyX2dvJo_f11Dhw9No07nPiQqQST4E3U1P0sI2rzLyl-Ir26vo13vYoXlPpTaiXG1pd4czefZe8DG6f-3fJqjNCYtChaxPLqrBdMqcr5QSOwXGOkFsjtSsgLzxIXwDScWo5KM28SJlmOI1L8IjTAek00wYOCRXeKCcL54W1SFRa5xWDwqep5tZryY9IN4Axfl8WvxivcTj-4_kJ2Q54B7keV6ek084WcEa2qo92Mp-dx0_2BWI1l1g
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1SBT2pWPHbHLyuJvuRbI61Wlqsa8EKvZXsZiKFdSvtVv--k7RVLx68hV1C4O0kLzM7b4aQK2aYVHkiAsu5DmITMTwHXUSMxZqDdazkhcJ9mWXpaKQGK7G618IAgE8-g2s39P_yzbRYuFAZ7nC0JibQQ990rbPUUq61Np9EygjtNVpLY5i6eW7H6EC4TITQFcn2aYe_mqh4Duns_nP1PdL8UePRwTfP7JMNqA5Ili3zt4Mna8upNmDora7M58S4WCWeA2-6pOhja1PoeU3xFW2V5au_71G8qNI7VzHXNbvCma1ek7x07oftbrDqjRBodOnqIGeF2zCxUYU0EY7BcI6g51ook0KSWhA2BSTkMOcgFbNRyBTDaVyARZwOSaOaVnBEaGS1NCI1NspzpCqlkoJBasNQ8dwqwY9J04Exfl-WvxivcTj54_kl2e4OH_vjfi97OCU7Dnsn3uPyjDTq2QLOyVbxUU_mswv_-b4AgA2apw
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Network-Offloaded+Bandwidth-Optimal+Broadcast+and+Allgather+for+Distributed+AI&rft.au=Khalilov%2C+Mikhail&rft.au=Girolamo%2C+Salvatore+Di&rft.au=Chrapek%2C+Marcin&rft.au=Nudelman%2C+Rami&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=1&rft.epage=17&rft_id=info:doi/10.1109%2FSC41406.2024.00109&rft.externalDocID=10793060