Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters
Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in produc...
Uloženo v:
| Vydáno v: | CCF transactions on high performance computing (Online) Ročník 3; číslo 2; s. 171 - 185 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Singapore
Springer Singapore
01.06.2021
Springer Nature B.V |
| Témata: | |
| ISSN: | 2524-4922, 2524-4930 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the
co-located
Parameter Server (PS) configuration is not uncommon in production DDNN training clusters, which inevitably causes intense network resource contention among the co-located PS and worker tasks. Our motivation experiments on Amazon EC2 further show that such network resource contention brings severe performance variation to DDNN training jobs. While existing works largely mitigate the
inter-job
network resource contention, the
intra-job
(i.e., task-level) network resource contention among the co-located PS and worker tasks has received comparably little attention. To tackle such performance issues, in this paper, we design and implement Nebula, a
Ne
twork
b
andwidth reso
u
rce al
l
oc
a
tion strategy for DDNN training tasks, in order to mitigate the network resource contention and alleviate the performance variation of DDNN training jobs. Nebula monitors the
weights
of co-located PS and workers and rations the network bandwidth resources for the two tasks by comparing the corresponding task weights. We implement a prototype of Nebula and conduct extensive prototype experiments with representative DNN models trained on Amazon EC2. Our experiment results demonstrate that Nebula can reduce the iteration time of a DDNN training job by up to 25% and improve the cluster resource utilization by up to 30% in comparison to MXNet, yet with practically acceptable runtime overhead. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2524-4922 2524-4930 |
| DOI: | 10.1007/s42514-021-00064-x |