Optimal distributed parallel algorithms for deep learning framework Tensorflow

Since its release, the Tensorflow framework has been widely used in various fields due to its advantages in deep learning. However, it is still at its early state. Its native distributed implementation has difficulty in expanding for large models because it has issues of low utilization of multiple...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Applied intelligence (Dordrecht, Netherlands) Ročník 52; číslo 4; s. 3880 - 3900
Hlavní autoři:	Xie, Yuanlun, He, Majun, Ma, Tingsong, Tian, Wenhong
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York Springer US 01.03.2022 Springer Nature B.V
Témata:	Algorithms Artificial Intelligence Clusters Computer Science Deep learning Greedy algorithms Machine learning Machines Manufacturing Mechanical Engineering Parallel processing Pipelining (computers) Processes Training Deep learning Tensorflow Data parallelism Model parallelism Optimal distributed parallel algorithms
ISSN:	0924-669X, 1573-7497
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Since its release, the Tensorflow framework has been widely used in various fields due to its advantages in deep learning. However, it is still at its early state. Its native distributed implementation has difficulty in expanding for large models because it has issues of low utilization of multiple GPUs and slow distribution compared with running on single machine. It is of great significance to reduce the training time through parallel models. In view of this, we firstly provided an in-depth analysis of the implementation principle of Tensorflow and identify the bottlenecks of its native distributed parallel models to improve. Then, two optimal algorithms are designed and implemented based on data parallelism and model parallelism modes of Tensorflow. For data parallelism, the proposed algorithm is implemented to replace the native linear execution mode with pipeline execution mode. As for model parallelism, the native random partitioning mode is replaced by our proposed novel greedy algorithm. Finally, we built a homogeneous distributed cluster and a heterogeneous distributed cluster respectively to verify the effectiveness of the proposed algorithms. Through a number of comparative experiments, we showed that the proposed optimal parallel algorithms can effectively reduce model training time by an average of 26.5%(or average 1.5x speedup than native distributed algorithms) and improve the utilization of the cluster while keeping the same accuracy level of native Tensorflow.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0924-669X 1573-7497
DOI:	10.1007/s10489-021-02588-9