A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms

The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This ste...

Full description

Saved in:
Bibliographic Details
Published in:Journal of computer science and technology Vol. 39; no. 3; pp. 610 - 636
Main Authors: Challa, Jagat Sesh, Goyal, Navneet, Sharma, Amogh, Sreekumar, Nikhil, Balasubramaniam, Sundar, Goyal, Poonam
Format: Journal Article
Language:English
Published: Singapore Springer Nature Singapore 01.05.2024
Springer Nature B.V
Advanced Data Analytics and Parallel Technologies Laboratory,Birla Institute of Technology and Science Pilani 333031,India%Uber,New York 11101,U.S.A.%Computer Science and Engineering Department,University of Minnesota,Minneapolis 55455,U.S.A
Subjects:
ISSN:1000-9000, 1860-4749
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd -tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1000-9000
1860-4749
DOI:10.1007/s11390-024-2700-0