An Efficient Filter Strategy for Theta-Join Query in Distributed Environment

Theta-join query is a very popular application in traditional databases, but due to tremendous computation cost and communication cost in distributed environment, it is not efficiently processed for big data. Current researches focus on processing theta-join by using MapReduce framework, which mainl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings - International Workshops on Parallel Processing S. 77 - 84
Hauptverfasser: Wenjie Liu, Zhanhuai Li, Yuntao Zhou
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.08.2017
Schlagworte:
ISSN:1530-2016
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Theta-join query is a very popular application in traditional databases, but due to tremendous computation cost and communication cost in distributed environment, it is not efficiently processed for big data. Current researches focus on processing theta-join by using MapReduce framework, which mainly consider the overheads of load balance in the network, when the data sets become larger, massive intermediate results lead to high communication cost. In this work, we propose a filter method for theta-join to reduce the computation and communication cost in distributed environment, which can effectively improve the theta-join query. We consider both the load balance in the cluster and the memory cost in the parallel framework. We have implemented our method in a popular general-purpose data processing framework, Spark. The experimental results demonstrate that our method can significantly improve the performance of theta-joins comparing the state-of-art solutions.
AbstractList Theta-join query is a very popular application in traditional databases, but due to tremendous computation cost and communication cost in distributed environment, it is not efficiently processed for big data. Current researches focus on processing theta-join by using MapReduce framework, which mainly consider the overheads of load balance in the network, when the data sets become larger, massive intermediate results lead to high communication cost. In this work, we propose a filter method for theta-join to reduce the computation and communication cost in distributed environment, which can effectively improve the theta-join query. We consider both the load balance in the cluster and the memory cost in the parallel framework. We have implemented our method in a popular general-purpose data processing framework, Spark. The experimental results demonstrate that our method can significantly improve the performance of theta-joins comparing the state-of-art solutions.
Author Wenjie Liu
Yuntao Zhou
Zhanhuai Li
Author_xml – sequence: 1
  surname: Wenjie Liu
  fullname: Wenjie Liu
  email: liuwenjie@nwpu.edu.cn
  organization: Sch. of Comput., Northwestern Polytech. Univ., Xi'an, China
– sequence: 2
  surname: Zhanhuai Li
  fullname: Zhanhuai Li
  email: lizhh@nwpu.edu.cn
  organization: Sch. of Comput., Northwestern Polytech. Univ., Xi'an, China
– sequence: 3
  surname: Yuntao Zhou
  fullname: Yuntao Zhou
  email: zhouyuntao@nwpu.edu.cn
  organization: Div. of Sci. & Technol. Res. Manage., Northwestern Polytech. Univ., Xi'an, China
BookMark eNotTk9LwzAcjTDBdXr05CVfoDP5tUna46jdnBScOPE40vQXjWyppJnQb29BT-893h9eQma-90jILWdLzll5v612u_clMK6WkF-QhIuskJzlOczIfBIsnTx5RZJh-GIMWCbEnDQrT2trnXHoI127Y8RAX2PQET9GavtA958YdfrUO09fzhhGOpEHN8Tg2nPEjtb-x4Xen6b-Nbm0-jjgzT8uyNu63lePafO82VarJnVciZiKUmVK88Ia1WqhSpODKdAA5qDEdFGAlcBbQG61knknSlNKZae4AJRZly3I3d-uQ8TDd3AnHcZDwUAyBdkv9lFNkw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICPPW.2017.24
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 1538610442
9781538610442
EndPage 84
ExternalDocumentID 8026072
Genre orig-research
GroupedDBID 23M
29O
6IE
6IK
6IL
ALMA_UNASSIGNED_HOLDINGS
CBEJK
M43
RIE
RIL
RNS
ID FETCH-LOGICAL-i175t-59737a18fc7ba579c42c8ec2e427501652f621b2e1fa764d59c967ffc752e63d3
IEDL.DBID RIE
ISICitedReferencesCount 4
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000426948400011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1530-2016
IngestDate Wed Aug 27 02:23:41 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-59737a18fc7ba579c42c8ec2e427501652f621b2e1fa764d59c967ffc752e63d3
PageCount 8
ParticipantIDs ieee_primary_8026072
PublicationCentury 2000
PublicationDate 2017-Aug.
PublicationDateYYYYMMDD 2017-08-01
PublicationDate_xml – month: 08
  year: 2017
  text: 2017-Aug.
PublicationDecade 2010
PublicationTitle Proceedings - International Workshops on Parallel Processing
PublicationTitleAbbrev ICPPW
PublicationYear 2017
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0020355
Score 2.0299058
Snippet Theta-join query is a very popular application in traditional databases, but due to tremendous computation cost and communication cost in distributed...
SourceID ieee
SourceType Publisher
StartPage 77
SubjectTerms Big Data
big data query
distributed computing
Distributed databases
Electronic mail
filter strategy
Filtering algorithms
Sparks
theta-join
Transforms
Title An Efficient Filter Strategy for Theta-Join Query in Distributed Environment
URI https://ieeexplore.ieee.org/document/8026072
WOSCitedRecordID wos000426948400011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5zePA0dRN_k4NHs7VpmqRHmRsqMioo7jbS5AUK0o3ZCfvvTdK5efDiLTShhfeSvpfkfd-H0E0ELorFlhFZaEsYVymRVghiuXDhMMuE5kUQmxCTiZxOs7yFbrdYGAAIxWfQ981wl2_meuWPygbSE2AJ98PdE4I3WK3t5ipKgsKpW8CR83zMd3yag8dhnr_7Ki7R98D2XyoqIYiMO__7_CHq7dB4ON_GmSPUguoYdX7kGPBmdXbR812FR4ERwr0Gj8uP0Nuwz66xS06xmxO1Ik_zssIvK1iusWvce-ZcL3oFBo92qLceehuPXocPZCOWQEqXAdTEbQwSoWJptShUKjLNqJagKTDP4B7zlFpO44JCbJXgzKSZzriwbnhKnb9McoLa1byCU4QTmlpWKGDGJUtegbxQHsAa0cxIA0lyhrreOLNFw4cx29jl_O_HF-jAm74pmrtE7Xq5giu0r7_q8nN5HZz4DVhxm-k
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFH4haKInVDD-tgePFrZua7ejQQgokplg5Ea27jVZQobBYcJ_b1smePDiremapulr-17X930fwJ2D2ou5yqdhKhX1eRLQUAlBFRfaHUaRkDy1YhNiPA6n0yiuwf0WC4OINvkM26Zo3_KzhVyZX2Wd0BBgCX3g7hnlrAqttb1eOZ7VONVb2NG2d_mOUbMz7Mbxu8njEm0Dbf-lo2LdSL_xvwEcQWuHxyPx1tMcQw2LE2j8CDKQan82YfRQkJ7lhNDdkH4-t183_LNrosNToldFmdCnRV6Q1xUu10QXHg13rpG9woz0dri3Frz1e5PugFZyCTTXMUBJ9dXAE4kbKinSJBCR9JkMUTL0DYe7ywOmOHNThq5KBPezIJIRF0o3D5i2WOadQr1YFHgGxGOB8tME_UyHS0aDPE0MhNVhURZm6Hnn0DSTM_vYMGLMqnm5-Lv6Fg4Gk5fRbDQcP1_CoTHDJoXuCurlcoXXsC-_yvxzeWMN-g2pj58y
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+-+International+Workshops+on+Parallel+Processing&rft.atitle=An+Efficient+Filter+Strategy+for+Theta-Join+Query+in+Distributed+Environment&rft.au=Wenjie+Liu&rft.au=Zhanhuai+Li&rft.au=Yuntao+Zhou&rft.date=2017-08-01&rft.pub=IEEE&rft.issn=1530-2016&rft.spage=77&rft.epage=84&rft_id=info:doi/10.1109%2FICPPW.2017.24&rft.externalDocID=8026072
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1530-2016&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1530-2016&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1530-2016&client=summon