Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Machine learning models are distributed across multiple nodes using numerous parallelism strategies. The resulting collective communication is often on the critical path due to a lack of independent coarse-grain computation kernels available to execute. In this work, we propose fusing computation wi...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	SC24: International Conference for High Performance Computing, Networking, Storage and Analysis s. 1 - 17
Hlavní autoři:	Punniyamurthy, Kishore, Hamidouche, Khaled, Beckmann, Bradford M.
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 17.11.2024
Témata:	Collective communication Computational modeling distributed ML DLRM GPU Graphics processing units Instruction sets Kernel MoE Parallel processing Processor scheduling Prototypes Streams Training Transformers
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Buďte první, kdo okomentuje tento záznam!