MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU h...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) S. 818 - 833
Hauptverfasser:	Hsia, Samuel, Golden, Alicia, Acun, Bilge, Ardalani, Newsha, DeVito, Zachary, Wei, Gu-Yeon, Brooks, David, Wu, Carole-Jean
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 29.06.2024
Schlagworte:	Analytical models Computational modeling Computer architecture Costs Data Center Distributed Inference Distributed Training GPU Graphics processing units Hardware-Software Co-Design Machine learning Parallelization Performance Model Simulator Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.
DOI:	10.1109/ISCA59077.2024.00064