Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters
Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in produc...
Saved in:
| Published in: | CCF transactions on high performance computing (Online) Vol. 3; no. 2; pp. 171 - 185 |
|---|---|
| Main Authors: | , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Singapore
Springer Singapore
01.06.2021
Springer Nature B.V |
| Subjects: | |
| ISSN: | 2524-4922, 2524-4930 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Be the first to leave a comment!