Cross-Search With Improved Multi-Dimensional Dichotomy-Based Joint Optimization for Distributed Parallel Training of DNN

Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MB...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on parallel and distributed systems Vol. 36; no. 8; pp. 1680 - 1694
Main Authors:	Zhou, Guangyao, Fu, Yiqin, Lan, Haocheng, Xie, Yuanlun, Tian, Wenhong, Buyya, Rajkumar, Qian, Jiahong, Su, Teng
Format:	Journal Article
Language:	English
Published:	IEEE 01.08.2025
Subjects:	Computational modeling Costs cross-search Data models Distributed parallelism improved multi-dimensional dichotomy large DNN Mathematical models micro-batch pipeline Optimization Parallel processing Partitioning algorithms Pipelines Training transformer Transformers
ISSN:	1045-9219, 1558-2183
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MBPP, we establish a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as considers they are nonlinear with the amount of input data. Focusing on the jointly optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove improved multi-dimensional dichotomy (IMD) has appreciable theoretical optimality and linear computational complexity significantly faster than the state-of-the-art methods including dynamic programming and recursive algorithm. Extensive experiments on both CNN-based and transformer-based neural networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under MBPP. On average, the training speeds of CSIMD in CNN- and transformer-based DNNs are respectively <inline-formula><tex-math notation="LaTeX">(2.0, 2.5)\times</tex-math> <mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mo>)</mml:mo><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3580098.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">(2.66, 5.48)\times</tex-math> <mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>66</mml:mn><mml:mo>,</mml:mo><mml:mn>5</mml:mn><mml:mo>.</mml:mo><mml:mn>48</mml:mn><mml:mo>)</mml:mo><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3580098.gif"/> </inline-formula> of (MBPP-R, MBPP-E).
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2025.3580098