Falcon: Addressing Stragglers in Heterogeneous Parameter Server Via Multiple Parallelism

The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploi...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on computers Ročník 70; číslo 1; s. 139 - 155
Hlavní autoři: Zhou, Qihua, Guo, Song, Lu, Haodong, Li, Li, Guo, Minyi, Sun, Yanfei, Wang, Kun
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York IEEE 01.01.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:0018-9340, 1557-9956
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploit the computation resource of the cluster as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to design a heterogeneity-aware parameter server paradigm that addresses stragglers and accelerates DL training from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines to solve this problem in two aspects: (1) controlling each worker's training speed via elastic training parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully utilize the computation resource. Following these guidelines, we propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforced- and slack-synchronization schemes. The whole idea has been implemented into a prototype called <inline-formula><tex-math notation="LaTeX">{\sf Falcon}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">Falcon</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-2974461.gif"/> </inline-formula> which effectively accelerates the DL training speed with the presence of stragglers. Evaluation under various benchmarks with baseline comparison demonstrates the superiority of our system. Specifically, <inline-formula><tex-math notation="LaTeX">{\sf Falcon}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">Falcon</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq2-2974461.gif"/> </inline-formula> reduces the training convergence time, by up to 61.83, 55.19, 38.92, and 23.68 percent shorter than FlexRR, Sync-opt, ConSGD, and DynSGD, respectively.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2020.2974461