A Scalable Similarity Join Algorithm Based on MapReduce and LSH

Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold λ . In this paper, we introduce the MRS-join algorithm to perform similarity joins on larg...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of parallel programming Vol. 50; no. 3-4; pp. 360 - 380
Main Authors:	Rivault, Sébastien, Bamha, Mostafa, Limet, Sébastien, Robert, Sophie
Format:	Journal Article
Language:	English
Published:	New York Springer US 01.08.2022 Springer Nature B.V Springer Verlag
Subjects:	Algorithms Cognitive science Computer Science Cost analysis Data processing Datasets Histograms Processor Architectures Similarity Software Engineering/Programming and Operating Systems Special Issue on High-Level Parallel Programming and Applications 2021 Theory of Computation Time series Data skew Similarity join operations Local sensitive hashing (LSH) Hadoop framework MapReduce model
ISSN:	0885-7458, 1573-7640
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold λ . In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Fréchet distance on large datasets of trajectories from real world and synthetic data benchmarks.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0885-7458 1573-7640
DOI:	10.1007/s10766-022-00733-6