Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in produc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:CCF transactions on high performance computing (Online) Jg. 3; H. 2; S. 171 - 185
Hauptverfasser: Qi, Qiang, Xu, Fei, Chen, Li, Zhou, Zhi
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Singapore Springer Singapore 01.06.2021
Springer Nature B.V
Schlagworte:
ISSN:2524-4922, 2524-4930
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in production DDNN training clusters, which inevitably causes intense network resource contention among the co-located PS and worker tasks. Our motivation experiments on Amazon EC2 further show that such network resource contention brings severe performance variation to DDNN training jobs. While existing works largely mitigate the inter-job network resource contention, the intra-job (i.e., task-level) network resource contention among the co-located PS and worker tasks has received comparably little attention. To tackle such performance issues, in this paper, we design and implement Nebula, a Ne twork b andwidth reso u rce al l oc a tion strategy for DDNN training tasks, in order to mitigate the network resource contention and alleviate the performance variation of DDNN training jobs. Nebula monitors the weights of co-located PS and workers and rations the network bandwidth resources for the two tasks by comparing the corresponding task weights. We implement a prototype of Nebula and conduct extensive prototype experiments with representative DNN models trained on Amazon EC2. Our experiment results demonstrate that Nebula can reduce the iteration time of a DDNN training job by up to 25% and improve the cluster resource utilization by up to 30% in comparison to MXNet, yet with practically acceptable runtime overhead.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2524-4922
2524-4930
DOI:10.1007/s42514-021-00064-x