Improving Data Locality of Tasks by Executor Allocation in Spark Computing Environment

The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different no...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on cloud computing Ročník 12; číslo 3; s. 876 - 888
Hlavní autori: Fu, Zhongming, He, Mengsi, Yi, Yang, Tang, Zhuo
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Piscataway IEEE 01.07.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:2168-7161, 2372-0018
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different nodes can directly affect the data locality achieved by the tasks. This article tries to improve the data locality of tasks by executor allocation in Spark framework. First, because of different communication modes at stages, we separately model the communication cost of tasks for transferring input data to the executors. Then formalize an optimal executor allocation problem to minimize the total communication cost of transferring all input data. This problem is proven to be NP-hard. Finally, we present a greed dropping heuristic algorithm to provide solution to the executor allocation problem. Our proposals are implemented in Spark-3.4.0 and its performance is evaluated through representative micro-benchmarks (i.e., WordCount , Join , Sort ) and macro-benchmarks (i.e., PageRank and LDA ). Extensive experiments show that the proposed executor allocation strategy can decrease the network traffic and data access time by improving the data locality during the task scheduling. Its performance benefits are particularly significant for iterative applications.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2168-7161
2372-0018
DOI:10.1109/TCC.2024.3406041