Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in produc...

Full description

Saved in:

Bibliographic Details
Published in:	CCF transactions on high performance computing (Online) Vol. 3; no. 2; pp. 171 - 185
Main Authors:	Qi, Qiang, Xu, Fei, Chen, Li, Zhou, Zhi
Format:	Journal Article
Language:	English
Published:	Singapore Springer Singapore 01.06.2021 Springer Nature B.V
Subjects:	3. High Performance Distributed Computing Artificial neural networks Bandwidths Clusters Communication Computer Hardware Computer Science Computer Systems Organization and Communication Networks Iterative methods Nebulae Performance evaluation Prototypes Regular Paper Resource allocation Resource utilization Training Network resource contention Bandwidth allocation Distributed DNN training
ISSN:	2524-4922, 2524-4930
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Be the first to leave a comment!