Communication Algorithm-Architecture Co-Design for Distributed Deep Learning

Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which domina...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings - International Symposium on Computer Architecture pp. 181 - 194
Main Authors:	Huang, Jiayi, Majumder, Pritam, Kim, Sungkeun, Muzahid, Abdullah, Yum, Ki Hwan, Kim, Eun Jung
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.06.2021
Subjects:	algorithm-architecture co-design all-reduce data-parallel training Deep learning distributed deep learning interconnection network Network topology Schedules Scheduling Stochastic processes Topology Training
ISSN:	2575-713X
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MultiTree all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MultiTree achieves 2.3× and 1.56× communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.
ISSN:	2575-713X
DOI:	10.1109/ISCA52012.2021.00023