Algorithms for processing the group K nearest-neighbor query on distributed frameworks

Given two datasets of points (called Query and Training), the Group ( K ) Nearest-Neighbor (G K NN) query retrieves ( K ) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improvi...

Full description

Saved in:
Bibliographic Details
Published in:Distributed and parallel databases : an international journal Vol. 39; no. 3; pp. 733 - 784
Main Authors: Moutafis, Panagiotis, García-García, Francisco, Mavrommatis, George, Vassilakopoulos, Michael, Corral, Antonio, Iribarne, Luis
Format: Journal Article
Language:English
Published: New York Springer US 01.09.2021
Springer Nature B.V
Subjects:
ISSN:0926-8782, 1573-7578
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Given two datasets of points (called Query and Training), the Group ( K ) Nearest-Neighbor (G K NN) query retrieves ( K ) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. In previous work, we presented the first MapReduce algorithm, consisting of alternating local and parallel phases, which can be used to effectively process the G K NN query when the Query fits in memory, while the Training one belongs to the Big Data category. In this paper, we present a significantly improved algorithm that incorporates a new high-performance refining method, a fast way to calculate distance sums for pruning purposes and several other minor coding and algorithmic improvements. Moreover, we transform this algorithm (which has been implemented in the Hadoop framework) to SpatialHadoop (a popular distributed framework that is dedicated to spatial processing), using a novel two-level partitioning method. Using real world and synthetic datasets, we also present a thorough experimental study of the Hadoop and SpatialHadoop versions of the algorithm, including a backstage analysis of the algorithm’s performance, using metrics that highlight its internal functioning. Finally, we present an experimental comparison of the Hadoop, the SpatialHadoop versions and the version of our previous work, showing that the improved versions are the big winners, with the SpatialHadoop one being faster than its Hadoop counterpart.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0926-8782
1573-7578
DOI:10.1007/s10619-020-07317-8