Cross-Search With Improved Multi-Dimensional Dichotomy-Based Joint Optimization for Distributed Parallel Training of DNN

Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MB...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems Jg. 36; H. 8; S. 1680 - 1694
Hauptverfasser: Zhou, Guangyao, Fu, Yiqin, Lan, Haocheng, Xie, Yuanlun, Tian, Wenhong, Buyya, Rajkumar, Qian, Jiahong, Su, Teng
Format: Journal Article
Sprache:Englisch
Veröffentlicht: IEEE 01.08.2025
Schlagworte:
ISSN:1045-9219, 1558-2183
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MBPP, we establish a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as considers they are nonlinear with the amount of input data. Focusing on the jointly optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove improved multi-dimensional dichotomy (IMD) has appreciable theoretical optimality and linear computational complexity significantly faster than the state-of-the-art methods including dynamic programming and recursive algorithm. Extensive experiments on both CNN-based and transformer-based neural networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under MBPP. On average, the training speeds of CSIMD in CNN- and transformer-based DNNs are respectively <inline-formula><tex-math notation="LaTeX">(2.0, 2.5)\times</tex-math> <mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mo>)</mml:mo><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3580098.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">(2.66, 5.48)\times</tex-math> <mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>66</mml:mn><mml:mo>,</mml:mo><mml:mn>5</mml:mn><mml:mo>.</mml:mo><mml:mn>48</mml:mn><mml:mo>)</mml:mo><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3580098.gif"/> </inline-formula> of (MBPP-R, MBPP-E).
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2025.3580098