Optimal distributed parallel algorithms for deep learning framework Tensorflow

Since its release, the Tensorflow framework has been widely used in various fields due to its advantages in deep learning. However, it is still at its early state. Its native distributed implementation has difficulty in expanding for large models because it has issues of low utilization of multiple...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied intelligence (Dordrecht, Netherlands) Jg. 52; H. 4; S. 3880 - 3900
Hauptverfasser:	Xie, Yuanlun, He, Majun, Ma, Tingsong, Tian, Wenhong
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York Springer US 01.03.2022 Springer Nature B.V
Schlagworte:	Algorithms Artificial Intelligence Clusters Computer Science Deep learning Greedy algorithms Machine learning Machines Manufacturing Mechanical Engineering Parallel processing Pipelining (computers) Processes Training Deep learning Tensorflow Data parallelism Model parallelism Optimal distributed parallel algorithms
ISSN:	0924-669X, 1573-7497
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Since its release, the Tensorflow framework has been widely used in various fields due to its advantages in deep learning. However, it is still at its early state. Its native distributed implementation has difficulty in expanding for large models because it has issues of low utilization of multiple GPUs and slow distribution compared with running on single machine. It is of great significance to reduce the training time through parallel models. In view of this, we firstly provided an in-depth analysis of the implementation principle of Tensorflow and identify the bottlenecks of its native distributed parallel models to improve. Then, two optimal algorithms are designed and implemented based on data parallelism and model parallelism modes of Tensorflow. For data parallelism, the proposed algorithm is implemented to replace the native linear execution mode with pipeline execution mode. As for model parallelism, the native random partitioning mode is replaced by our proposed novel greedy algorithm. Finally, we built a homogeneous distributed cluster and a heterogeneous distributed cluster respectively to verify the effectiveness of the proposed algorithms. Through a number of comparative experiments, we showed that the proposed optimal parallel algorithms can effectively reduce model training time by an average of 26.5%(or average 1.5x speedup than native distributed algorithms) and improve the utilization of the cluster while keeping the same accuracy level of native Tensorflow.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0924-669X 1573-7497
DOI:	10.1007/s10489-021-02588-9