Optimal Load Allocation for Coded Distributed Computation in Heterogeneous Clusters

Recently, coding has been a useful technique to mitigate stragglers' effect in distributed computing. However, coding in this context has been mainly explored assuming homogeneous workers, although real-world clusters often consist of heterogeneous workers with different computing capabilities....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on communications Jg. 69; H. 1; S. 44 - 58
Hauptverfasser: Kim, Daejin, Park, Hyegyeong, Choi, Jun Kyun
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York IEEE 01.01.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:
ISSN:0090-6778, 1558-0857
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recently, coding has been a useful technique to mitigate stragglers' effect in distributed computing. However, coding in this context has been mainly explored assuming homogeneous workers, although real-world clusters often consist of heterogeneous workers with different computing capabilities. The uniform load allocation without considering the heterogeneity possibly causes a significant loss in latency. In this article, we suggest the optimal load allocation for coded distributed computing with heterogeneous workers. Specifically, we focus on the scenario that there exist workers having the same computing capability, which can be regarded as a group for analysis. We rely on the lower bound on the expected latency and obtain the optimal load allocation by showing that our load allocation achieves the minimum of the lower bound for a sufficiently large number of workers. Given the proposed optimal load allocation, we derive the optimal code rate to achieve the minimum expected latency. From numerical simulations, when assuming the group heterogeneity, our load allocation reduces the expected latency by orders of magnitude over the existing scheme. Furthermore, from experiments on Amazon EC2 for scenarios with distinct straggler/heterogeneity patterns, we observe that our scheme outperforms the competing schemes reducing the total finishing time by up to 52%.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0090-6778
1558-0857
DOI:10.1109/TCOMM.2020.3030667