MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU h...

Full description

Saved in:

Bibliographic Details
Published in:	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 818 - 833
Main Authors:	Hsia, Samuel, Golden, Alicia, Acun, Bilge, Ardalani, Newsha, DeVito, Zachary, Wei, Gu-Yeon, Brooks, David, Wu, Carole-Jean
Format:	Conference Proceeding
Language:	English
Published:	IEEE 29.06.2024
Subjects:	Analytical models Computational modeling Computer architecture Costs Data Center Distributed Inference Distributed Training GPU Graphics processing units Hardware-Software Co-Design Machine learning Parallelization Performance Model Simulator Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Be the first to leave a comment!