Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Machine learning models are distributed across multiple nodes using numerous parallelism strategies. The resulting collective communication is often on the critical path due to a lack of independent coarse-grain computation kernels available to execute. In this work, we propose fusing computation wi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	SC24: International Conference for High Performance Computing, Networking, Storage and Analysis S. 1 - 17
Hauptverfasser:	Punniyamurthy, Kishore, Hamidouche, Khaled, Beckmann, Bradford M.
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 17.11.2024
Schlagworte:	Collective communication Computational modeling distributed ML DLRM GPU Graphics processing units Instruction sets Kernel MoE Parallel processing Processor scheduling Prototypes Streams Training Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!