A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training

The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperat...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on parallel and distributed systems Vol. 35; no. 8; pp. 1415 - 1428
Main Authors: Li, Shengwei, Lu, Kai, Lai, Zhiquan, Liu, Weijie, Ge, Keshi, Li, Dongsheng
Format: Journal Article
Language:English
Published: New York IEEE 01.08.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1045-9219, 1558-2183
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose TriRace , an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, TriRace introduces 3D communication scheduling , which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, TriRace also incorporates a topology-aware runtime which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of TriRace based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that TriRace achieves up to 1.07-1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24-1.81× speedup compared to the Megatron.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2024.3406420