MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU h...
Saved in:
| Published in: | 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 818 - 833 |
|---|---|
| Main Authors: | , , , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
29.06.2024
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively. |
|---|---|
| AbstractList | Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively. |
| Author | Ardalani, Newsha Hsia, Samuel Acun, Bilge Wu, Carole-Jean Brooks, David Wei, Gu-Yeon Golden, Alicia DeVito, Zachary |
| Author_xml | – sequence: 1 givenname: Samuel surname: Hsia fullname: Hsia, Samuel email: shsia@g.harvard.edu organization: FAIR at Meta – sequence: 2 givenname: Alicia surname: Golden fullname: Golden, Alicia organization: FAIR at Meta – sequence: 3 givenname: Bilge surname: Acun fullname: Acun, Bilge organization: FAIR at Meta – sequence: 4 givenname: Newsha surname: Ardalani fullname: Ardalani, Newsha organization: FAIR at Meta – sequence: 5 givenname: Zachary surname: DeVito fullname: DeVito, Zachary organization: FAIR at Meta – sequence: 6 givenname: Gu-Yeon surname: Wei fullname: Wei, Gu-Yeon organization: Harvard University – sequence: 7 givenname: David surname: Brooks fullname: Brooks, David organization: Harvard University – sequence: 8 givenname: Carole-Jean surname: Wu fullname: Wu, Carole-Jean email: carolejeanwu@meta.com organization: FAIR at Meta |
| BookMark | eNotj11LwzAYhSMoqHP_YBf5A51JkzSJd7WbOmj1Yno93qRvZ6BLpa1g_731Aw4ceDg8cK7JeewiErLibM05s7e7fZEry7RepyyVa8ZYJs_I0mprhGIizZThl2Q5DMGxjFkttFFXBKt8k1TwRe9x6mJN9yEeW0yeuxrv6DaCa2dAS-iPSCvw7yEiLRH6-IOredXS3HtssYcxdJHO2YRh7IP7HHHWTcOIp-GGXDTQDrj87wV5e9i-Fk9J-fK4K_IygVSZMUHjtfHSKUhrZ13TcCc1F8JqUNJql2XgG940lllZgxdWOYsehJGi9sKhWJDVnzcg4uGjDyfopwP__cuM-AY4KlhU |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ISCA59077.2024.00064 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350326581 |
| EndPage | 833 |
| ExternalDocumentID | 10609708 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a258t-e8c78c4b5a2db9bff1b4713397a5497b66acf1ff9094dac395b9eca3843dc3be3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 3 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001290320700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:35:11 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a258t-e8c78c4b5a2db9bff1b4713397a5497b66acf1ff9094dac395b9eca3843dc3be3 |
| PageCount | 16 |
| ParticipantIDs | ieee_primary_10609708 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-June-29 |
| PublicationDateYYYYMMDD | 2024-06-29 |
| PublicationDate_xml | – month: 06 year: 2024 text: 2024-June-29 day: 29 |
| PublicationDecade | 2020 |
| PublicationTitle | 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) |
| PublicationTitleAbbrev | ISCA |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib060973785 |
| Score | 2.3126194 |
| Snippet | Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 818 |
| SubjectTerms | Analytical models Computational modeling Computer architecture Costs Data Center Distributed Inference Distributed Training GPU Graphics processing units Hardware-Software Co-Design Machine learning Parallelization Performance Model Simulator Training |
| Title | MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems |
| URI | https://ieeexplore.ieee.org/document/10609708 |
| WOSCitedRecordID | wos001290320700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxFAxaPHhSseI3OXiN7ncSb6WtKNhSqEJvJXl5EaG0Ulvx55uX3aoXD8IelgR24SW7w-y-mWHsyqAtAKwSBj2KIrXhPQipFyVW0lUBkQ2qGDYhh0M1mehRI1aPWhhEjM1neE2n8V--W8CaPpWFJ7xKtCRp77aUshZrbTYPzeRSlY08Lk30zcO42ykD-ZOBBmZkkp2Qr8CvEJWIIXd7_7z7Pmv_qPH46BtnDtgWzg8ZDjo9MTCfvJag8HGYmqEYLhze8j4JosIAf6Q-bz6IDZPIGy_VF04BaDPeAQiYU-8AHo4eeehS_BWGy9VG5m32fNd_6t6LJjJBmKxUK4EKpILCliZzVlvvU1sQDdXSBCIobVUZ8Kn3OrA6ZyDXpdUIJldF7iC3mB-x1nwxx2PGjQPlK59BqGrhEqOzBJHy-7TOXAC8E9amGk3faleM6aY8p3-Mn7FdWgZqs8r0OWutlmu8YDvwsXp9X17GtfwCZwKhtQ |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwHA0yBT2pOPHbHLxG2zRtEm9jH2y4lsEm7Dby8YsIY5O5iX--SdupFw9CDyWBFn5J-3jt772H0J0CzYzRgihwQFis_XvQxI6kkHGbeURWIMqwCV4UYjqVo1qsXmphAKBsPoP7cFr-y7dLswmfyvwTnkWSB2nvbsoYjSu51nb7hLmEi7QWyMWRfBiM263U0z_uiSANNtlRcBb4FaNSokjv8J_3P0LNHz0eHn0jzTHagcUJgrzVIbn6xJUIBY_91BxIsbTwiLtBEuUH8DB0euO8bJkEXLupvuAQgTbHLWM86lR7APujE1x0QwAW-MtVVuZN9NzrTtp9UocmEEVTsSYgDBeG6VRRq6V2LtYsEFHJlaeCXGeZMi52TnpeZ5VJZKolGJUIlliTaEhOUWOxXMAZwsoa4TJHja8qs5GSNAIICX5SUush7xw1Q41mb5Uvxmxbnos_xm_Rfn-SD2fDQfF0iQ7CkoSmKyqvUGO92sA12jMf69f31U25rl__IKT8 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+ACM%2FIEEE+51st+Annual+International+Symposium+on+Computer+Architecture+%28ISCA%29&rft.atitle=MAD-Max+Beyond+Single-Node%3A+Enabling+Large+Machine+Learning+Model+Acceleration+on+Distributed+Systems&rft.au=Hsia%2C+Samuel&rft.au=Golden%2C+Alicia&rft.au=Acun%2C+Bilge&rft.au=Ardalani%2C+Newsha&rft.date=2024-06-29&rft.pub=IEEE&rft.spage=818&rft.epage=833&rft_id=info:doi/10.1109%2FISCA59077.2024.00064&rft.externalDocID=10609708 |