Improving Data Locality of Tasks by Executor Allocation in Spark Computing Environment
The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different no...
Gespeichert in:
| Veröffentlicht in: | IEEE transactions on cloud computing Jg. 12; H. 3; S. 876 - 888 |
|---|---|
| Hauptverfasser: | , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Piscataway
IEEE
01.07.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Schlagworte: | |
| ISSN: | 2168-7161, 2372-0018 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different nodes can directly affect the data locality achieved by the tasks. This article tries to improve the data locality of tasks by executor allocation in Spark framework. First, because of different communication modes at stages, we separately model the communication cost of tasks for transferring input data to the executors. Then formalize an optimal executor allocation problem to minimize the total communication cost of transferring all input data. This problem is proven to be NP-hard. Finally, we present a greed dropping heuristic algorithm to provide solution to the executor allocation problem. Our proposals are implemented in Spark-3.4.0 and its performance is evaluated through representative micro-benchmarks (i.e., WordCount , Join , Sort ) and macro-benchmarks (i.e., PageRank and LDA ). Extensive experiments show that the proposed executor allocation strategy can decrease the network traffic and data access time by improving the data locality during the task scheduling. Its performance benefits are particularly significant for iterative applications. |
|---|---|
| Bibliographie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2168-7161 2372-0018 |
| DOI: | 10.1109/TCC.2024.3406041 |