Optimal Load Allocation for Coded Distributed Computation in Heterogeneous Clusters

Recently, coding has been a useful technique to mitigate stragglers' effect in distributed computing. However, coding in this context has been mainly explored assuming homogeneous workers, although real-world clusters often consist of heterogeneous workers with different computing capabilities....

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on communications Vol. 69; no. 1; pp. 44 - 58
Main Authors: Kim, Daejin, Park, Hyegyeong, Choi, Jun Kyun
Format: Journal Article
Language:English
Published: New York IEEE 01.01.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:0090-6778, 1558-0857
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recently, coding has been a useful technique to mitigate stragglers' effect in distributed computing. However, coding in this context has been mainly explored assuming homogeneous workers, although real-world clusters often consist of heterogeneous workers with different computing capabilities. The uniform load allocation without considering the heterogeneity possibly causes a significant loss in latency. In this article, we suggest the optimal load allocation for coded distributed computing with heterogeneous workers. Specifically, we focus on the scenario that there exist workers having the same computing capability, which can be regarded as a group for analysis. We rely on the lower bound on the expected latency and obtain the optimal load allocation by showing that our load allocation achieves the minimum of the lower bound for a sufficiently large number of workers. Given the proposed optimal load allocation, we derive the optimal code rate to achieve the minimum expected latency. From numerical simulations, when assuming the group heterogeneity, our load allocation reduces the expected latency by orders of magnitude over the existing scheme. Furthermore, from experiments on Amazon EC2 for scenarios with distinct straggler/heterogeneity patterns, we observe that our scheme outperforms the competing schemes reducing the total finishing time by up to 52%.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0090-6778
1558-0857
DOI:10.1109/TCOMM.2020.3030667