Algorithms for processing the group K nearest-neighbor query on distributed frameworks
Given two datasets of points (called Query and Training), the Group ( K ) Nearest-Neighbor (G K NN) query retrieves ( K ) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improvi...
Uloženo v:
| Vydáno v: | Distributed and parallel databases : an international journal Ročník 39; číslo 3; s. 733 - 784 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
Springer US
01.09.2021
Springer Nature B.V |
| Témata: | |
| ISSN: | 0926-8782, 1573-7578 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Given two datasets of points (called Query and Training), the Group (
K
) Nearest-Neighbor (G
K
NN) query retrieves (
K
) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. In previous work, we presented the first MapReduce algorithm, consisting of alternating local and parallel phases, which can be used to effectively process the G
K
NN query when the Query fits in memory, while the Training one belongs to the Big Data category. In this paper, we present a significantly improved algorithm that incorporates a new high-performance refining method, a fast way to calculate distance sums for pruning purposes and several other minor coding and algorithmic improvements. Moreover, we transform this algorithm (which has been implemented in the Hadoop framework) to SpatialHadoop (a popular distributed framework that is dedicated to spatial processing), using a novel two-level partitioning method. Using real world and synthetic datasets, we also present a thorough experimental study of the Hadoop and SpatialHadoop versions of the algorithm, including a backstage analysis of the algorithm’s performance, using metrics that highlight its internal functioning. Finally, we present an experimental comparison of the Hadoop, the SpatialHadoop versions and the version of our previous work, showing that the improved versions are the big winners, with the SpatialHadoop one being faster than its Hadoop counterpart. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0926-8782 1573-7578 |
| DOI: | 10.1007/s10619-020-07317-8 |