Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in produc...

Full description

Saved in:
Bibliographic Details
Published in:CCF transactions on high performance computing (Online) Vol. 3; no. 2; pp. 171 - 185
Main Authors: Qi, Qiang, Xu, Fei, Chen, Li, Zhou, Zhi
Format: Journal Article
Language:English
Published: Singapore Springer Singapore 01.06.2021
Springer Nature B.V
Subjects:
ISSN:2524-4922, 2524-4930
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Be the first to leave a comment!
You must be logged in first