Cross-Search With Improved Multi-Dimensional Dichotomy-Based Joint Optimization for Distributed Parallel Training of DNN

Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MB...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on parallel and distributed systems Ročník 36; číslo 8; s. 1680 - 1694
Hlavní autori:	Zhou, Guangyao, Fu, Yiqin, Lan, Haocheng, Xie, Yuanlun, Tian, Wenhong, Buyya, Rajkumar, Qian, Jiahong, Su, Teng
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	IEEE 01.08.2025
Predmet:	Computational modeling Costs cross-search Data models Distributed parallelism improved multi-dimensional dichotomy large DNN Mathematical models micro-batch pipeline Optimization Parallel processing Partitioning algorithms Pipelines Training transformer Transformers
ISSN:	1045-9219, 1558-2183
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MBPP, we establish a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as considers they are nonlinear with the amount of input data. Focusing on the jointly optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove improved multi-dimensional dichotomy (IMD) has appreciable theoretical optimality and linear computational complexity significantly faster than the state-of-the-art methods including dynamic programming and recursive algorithm. Extensive experiments on both CNN-based and transformer-based neural networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under MBPP. On average, the training speeds of CSIMD in CNN- and transformer-based DNNs are respectively <inline-formula><tex-math notation="LaTeX">(2.0, 2.5)\times</tex-math> <mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mo>)</mml:mo><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3580098.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">(2.66, 5.48)\times</tex-math> <mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>66</mml:mn><mml:mo>,</mml:mo><mml:mn>5</mml:mn><mml:mo>.</mml:mo><mml:mn>48</mml:mn><mml:mo>)</mml:mo><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3580098.gif"/> </inline-formula> of (MBPP-R, MBPP-E).
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2025.3580098