MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU h...

Full description

Saved in:
Bibliographic Details
Published in:2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 818 - 833
Main Authors: Hsia, Samuel, Golden, Alicia, Acun, Bilge, Ardalani, Newsha, DeVito, Zachary, Wei, Gu-Yeon, Brooks, David, Wu, Carole-Jean
Format: Conference Proceeding
Language:English
Published: IEEE 29.06.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.
AbstractList Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.
Author Ardalani, Newsha
Hsia, Samuel
Acun, Bilge
Wu, Carole-Jean
Brooks, David
Wei, Gu-Yeon
Golden, Alicia
DeVito, Zachary
Author_xml – sequence: 1
  givenname: Samuel
  surname: Hsia
  fullname: Hsia, Samuel
  email: shsia@g.harvard.edu
  organization: FAIR at Meta
– sequence: 2
  givenname: Alicia
  surname: Golden
  fullname: Golden, Alicia
  organization: FAIR at Meta
– sequence: 3
  givenname: Bilge
  surname: Acun
  fullname: Acun, Bilge
  organization: FAIR at Meta
– sequence: 4
  givenname: Newsha
  surname: Ardalani
  fullname: Ardalani, Newsha
  organization: FAIR at Meta
– sequence: 5
  givenname: Zachary
  surname: DeVito
  fullname: DeVito, Zachary
  organization: FAIR at Meta
– sequence: 6
  givenname: Gu-Yeon
  surname: Wei
  fullname: Wei, Gu-Yeon
  organization: Harvard University
– sequence: 7
  givenname: David
  surname: Brooks
  fullname: Brooks, David
  organization: Harvard University
– sequence: 8
  givenname: Carole-Jean
  surname: Wu
  fullname: Wu, Carole-Jean
  email: carolejeanwu@meta.com
  organization: FAIR at Meta
BookMark eNotj11LwzAYhSMoqHP_YBf5A51JkzSJd7WbOmj1Yno93qRvZ6BLpa1g_731Aw4ceDg8cK7JeewiErLibM05s7e7fZEry7RepyyVa8ZYJs_I0mprhGIizZThl2Q5DMGxjFkttFFXBKt8k1TwRe9x6mJN9yEeW0yeuxrv6DaCa2dAS-iPSCvw7yEiLRH6-IOredXS3HtssYcxdJHO2YRh7IP7HHHWTcOIp-GGXDTQDrj87wV5e9i-Fk9J-fK4K_IygVSZMUHjtfHSKUhrZ13TcCc1F8JqUNJql2XgG940lllZgxdWOYsehJGi9sKhWJDVnzcg4uGjDyfopwP__cuM-AY4KlhU
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ISCA59077.2024.00064
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350326581
EndPage 833
ExternalDocumentID 10609708
Genre orig-research
GroupedDBID 6IE
6IH
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a258t-e8c78c4b5a2db9bff1b4713397a5497b66acf1ff9094dac395b9eca3843dc3be3
IEDL.DBID RIE
ISICitedReferencesCount 3
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:35:11 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a258t-e8c78c4b5a2db9bff1b4713397a5497b66acf1ff9094dac395b9eca3843dc3be3
PageCount 16
ParticipantIDs ieee_primary_10609708
PublicationCentury 2000
PublicationDate 2024-June-29
PublicationDateYYYYMMDD 2024-06-29
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-29
  day: 29
PublicationDecade 2020
PublicationTitle 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)
PublicationTitleAbbrev ISCA
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060973785
Score 2.3126194
Snippet Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high...
SourceID ieee
SourceType Publisher
StartPage 818
SubjectTerms Analytical models
Computational modeling
Computer architecture
Costs
Data Center
Distributed Inference
Distributed Training
GPU
Graphics processing units
Hardware-Software Co-Design
Machine learning
Parallelization
Performance Model
Simulator
Training
Title MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
URI https://ieeexplore.ieee.org/document/10609708
WOSCitedRecordID wos001290320700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxFAxaPHhSseI3OXiN7ncSb6WtKNhSqEJvJXl5EaG0Ulvx55uX3aoXD8IelgR24SW7w-y-mWHsyqAtAKwSBj2KIrXhPQipFyVW0lUBkQ2qGDYhh0M1mehRI1aPWhhEjM1neE2n8V--W8CaPpWFJ7xKtCRp77aUshZrbTYPzeRSlY08Lk30zcO42ykD-ZOBBmZkkp2Qr8CvEJWIIXd7_7z7Pmv_qPH46BtnDtgWzg8ZDjo9MTCfvJag8HGYmqEYLhze8j4JosIAf6Q-bz6IDZPIGy_VF04BaDPeAQiYU-8AHo4eeehS_BWGy9VG5m32fNd_6t6LJjJBmKxUK4EKpILCliZzVlvvU1sQDdXSBCIobVUZ8Kn3OrA6ZyDXpdUIJldF7iC3mB-x1nwxx2PGjQPlK59BqGrhEqOzBJHy-7TOXAC8E9amGk3faleM6aY8p3-Mn7FdWgZqs8r0OWutlmu8YDvwsXp9X17GtfwCZwKhtQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwHA0yBT2pOPHbHLxG2zRtEm9jH2y4lsEm7Dby8YsIY5O5iX--SdupFw9CDyWBFn5J-3jt772H0J0CzYzRgihwQFis_XvQxI6kkHGbeURWIMqwCV4UYjqVo1qsXmphAKBsPoP7cFr-y7dLswmfyvwTnkWSB2nvbsoYjSu51nb7hLmEi7QWyMWRfBiM263U0z_uiSANNtlRcBb4FaNSokjv8J_3P0LNHz0eHn0jzTHagcUJgrzVIbn6xJUIBY_91BxIsbTwiLtBEuUH8DB0euO8bJkEXLupvuAQgTbHLWM86lR7APujE1x0QwAW-MtVVuZN9NzrTtp9UocmEEVTsSYgDBeG6VRRq6V2LtYsEFHJlaeCXGeZMi52TnpeZ5VJZKolGJUIlliTaEhOUWOxXMAZwsoa4TJHja8qs5GSNAIICX5SUush7xw1Q41mb5Uvxmxbnos_xm_Rfn-SD2fDQfF0iQ7CkoSmKyqvUGO92sA12jMf69f31U25rl__IKT8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=MAD-Max+Beyond+Single-Node%3A+Enabling+Large+Machine+Learning+Model+Acceleration+on+Distributed+Systems&rft.au=Hsia%2C+Samuel&rft.au=Golden%2C+Alicia&rft.au=Acun%2C+Bilge&rft.au=Ardalani%2C+Newsha&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=818&rft.epage=833&rft_id=info:doi/10.1109%2FISCA59077.2024.00064&rft.externalDocID=10609708