Improving Data Locality of Tasks by Executor Allocation in Spark Computing Environment

The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different no...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on cloud computing Vol. 12; no. 3; pp. 876 - 888
Main Authors: Fu, Zhongming, He, Mengsi, Yi, Yang, Tang, Zhuo
Format: Journal Article
Language:English
Published: Piscataway IEEE 01.07.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:2168-7161, 2372-0018
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different nodes can directly affect the data locality achieved by the tasks. This article tries to improve the data locality of tasks by executor allocation in Spark framework. First, because of different communication modes at stages, we separately model the communication cost of tasks for transferring input data to the executors. Then formalize an optimal executor allocation problem to minimize the total communication cost of transferring all input data. This problem is proven to be NP-hard. Finally, we present a greed dropping heuristic algorithm to provide solution to the executor allocation problem. Our proposals are implemented in Spark-3.4.0 and its performance is evaluated through representative micro-benchmarks (i.e., WordCount , Join , Sort ) and macro-benchmarks (i.e., PageRank and LDA ). Extensive experiments show that the proposed executor allocation strategy can decrease the network traffic and data access time by improving the data locality during the task scheduling. Its performance benefits are particularly significant for iterative applications.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2168-7161
2372-0018
DOI:10.1109/TCC.2024.3406041