A Dynamic Sliding Window Based Tensor Communication Scheduling Framework for Distributed Deep Learning

Simultaneous tensor communication can effectively improve the scalability of distributed deep learning on large clusters. However, a fixed number of tensor blocks communicated concurrently violates the priority-based scheduling strategy and cannot minimize communication overheads. In this paper, we...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on network science and engineering Vol. 12; no. 2; pp. 1080 - 1095
Main Authors:	Gao, Yunqi, Hu, Bing, Mashhadi, Mahdi Boloursaz, Wang, Wei, Tafazolli, Rahim, Debbah, Merouane
Format:	Journal Article
Language:	English
Published:	Piscataway IEEE 01.03.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Artificial neural networks Clusters Communication communication scheduling Computational modeling Computer architecture data parallelism Deep learning Distributed deep learning Dynamic scheduling Energy consumption generative pre-trained transformer (GPT) Greedy algorithms Mathematical models Parallel processing Priority scheduling Processor scheduling Sliding tensor partitioning Tensors Training
ISSN:	2327-4697, 2334-329X
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Simultaneous tensor communication can effectively improve the scalability of distributed deep learning on large clusters. However, a fixed number of tensor blocks communicated concurrently violates the priority-based scheduling strategy and cannot minimize communication overheads. In this paper, we propose a novel simultaneous tensor communication framework, namely D-Credit, which transmits tensor blocks based on dynamic sliding windows to minimize per-iteration time in distributed DNN training. We build the mathematical model of D-Credit in two phases: (1) the overlap of gradient communication and backward propagation, and (2) the overlap of gradient communication and forward computation. We drive the optimal window sizes for the second phase analytically, and develop a greedy algorithm to efficiently determine the dynamic window sizes for the first phase of D-Credit. We implement the D-Credit architecture on PyTorch framework. Experimental results on two different GPU clusters demonstrate that at training speed, D-Credit can achieve up to 1.26x, 1.21x, 1.48x and 1.53x speedup compared to ByteScheduler, DeAR, PyTorch-DDP and WFBP, respectively. At energy consumption, D-Credit saves up to 17.8% and 25.1% of the training energy consumption compared to ByteScheduler and WFBP, respectively.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2327-4697 2334-329X
DOI:	10.1109/TNSE.2024.3523320