A Scalable Similarity Join Algorithm Based on MapReduce and LSH

Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold λ . In this paper, we introduce the MRS-join algorithm to perform similarity joins on larg...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of parallel programming Jg. 50; H. 3-4; S. 360 - 380
Hauptverfasser:	Rivault, Sébastien, Bamha, Mostafa, Limet, Sébastien, Robert, Sophie
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York Springer US 01.08.2022 Springer Nature B.V Springer Verlag
Schlagworte:	Algorithms Cognitive science Computer Science Cost analysis Data processing Datasets Histograms Processor Architectures Similarity Software Engineering/Programming and Operating Systems Special Issue on High-Level Parallel Programming and Applications 2021 Theory of Computation Time series Data skew Similarity join operations Local sensitive hashing (LSH) Hadoop framework MapReduce model
ISSN:	0885-7458, 1573-7640
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold λ . In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Fréchet distance on large datasets of trajectories from real world and synthetic data benchmarks.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0885-7458 1573-7640
DOI:	10.1007/s10766-022-00733-6