An efficient theta-join query processing in distributed environment

Theta-join query is very useful in many data analysis tasks, but it is not efficiently processed in distributed environment, especially in large scale data. Although there is much progress in dealing theta-join with MapReduce paradigm, the methods are either complex which require fundamental changes...

Full description

Saved in:
Bibliographic Details
Published in:Journal of parallel and distributed computing Vol. 121; pp. 42 - 52
Main Authors: Liu, Wenjie, Li, Zhanhuai
Format: Journal Article
Language:English
Published: Elsevier Inc 01.11.2018
Subjects:
ISSN:0743-7315, 1096-0848
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Theta-join query is very useful in many data analysis tasks, but it is not efficiently processed in distributed environment, especially in large scale data. Although there is much progress in dealing theta-join with MapReduce paradigm, the methods are either complex which require fundamental changes to MapReduce framework or only consider the overheads of load balance in the network, when data scale is large, they will make much computation cost and induce OOM (Out of Memory) errors. In this work, we propose a filter method for theta-join on the purpose of reducing the computation cost and achieving the minimum execution time in distributed environment. We consider not only the load balance in the cluster, but also the memory cost in parallel framework. We also propose a keys-based join solution for multi-way theta-join to reduce the data amount for cross product, then improve the performance of join efficiency. We implement our methods in a popular general-purpose data processing framework, Spark. The experimental results demonstrate that our methods can significantly improve the performance of theta-joins comparing with the state-of-art solutions. •Effective Max and Min values based filter strategy for theta-join computing in distributed environment.•Divide and Merge method for theta-join which reduces network overheads greatly.•Extensive experiments using real world and synthetic data sets.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2018.07.007