An efficient theta-join query processing in distributed environment

Theta-join query is very useful in many data analysis tasks, but it is not efficiently processed in distributed environment, especially in large scale data. Although there is much progress in dealing theta-join with MapReduce paradigm, the methods are either complex which require fundamental changes...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of parallel and distributed computing Ročník 121; s. 42 - 52
Hlavní autoři: Liu, Wenjie, Li, Zhanhuai
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Inc 01.11.2018
Témata:
ISSN:0743-7315, 1096-0848
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Theta-join query is very useful in many data analysis tasks, but it is not efficiently processed in distributed environment, especially in large scale data. Although there is much progress in dealing theta-join with MapReduce paradigm, the methods are either complex which require fundamental changes to MapReduce framework or only consider the overheads of load balance in the network, when data scale is large, they will make much computation cost and induce OOM (Out of Memory) errors. In this work, we propose a filter method for theta-join on the purpose of reducing the computation cost and achieving the minimum execution time in distributed environment. We consider not only the load balance in the cluster, but also the memory cost in parallel framework. We also propose a keys-based join solution for multi-way theta-join to reduce the data amount for cross product, then improve the performance of join efficiency. We implement our methods in a popular general-purpose data processing framework, Spark. The experimental results demonstrate that our methods can significantly improve the performance of theta-joins comparing with the state-of-art solutions. •Effective Max and Min values based filter strategy for theta-join computing in distributed environment.•Divide and Merge method for theta-join which reduces network overheads greatly.•Extensive experiments using real world and synthetic data sets.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2018.07.007