A Scalable Similarity Join Algorithm Based on MapReduce and LSH

Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold λ . In this paper, we introduce the MRS-join algorithm to perform similarity joins on larg...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	International journal of parallel programming Ročník 50; číslo 3-4; s. 360 - 380
Hlavní autori:	Rivault, Sébastien, Bamha, Mostafa, Limet, Sébastien, Robert, Sophie
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	New York Springer US 01.08.2022 Springer Nature B.V Springer Verlag
Predmet:	Algorithms Cognitive science Computer Science Cost analysis Data processing Datasets Histograms Processor Architectures Similarity Software Engineering/Programming and Operating Systems Special Issue on High-Level Parallel Programming and Applications 2021 Theory of Computation Time series Data skew Similarity join operations Local sensitive hashing (LSH) Hadoop framework MapReduce model
ISSN:	0885-7458, 1573-7640
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold λ . In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Fréchet distance on large datasets of trajectories from real world and synthetic data benchmarks.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0885-7458 1573-7640
DOI:	10.1007/s10766-022-00733-6