MapReduce algorithms for robust center-based clustering in doubling metrics

Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k,ℓ)-clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the ℓ-th powers of the distances of points in P from their...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of parallel and distributed computing Vol. 194; p. 104966
Main Authors:	Dandolo, Enrico, Mazzetto, Alessio, Pietracaprina, Andrea, Pucci, Geppino
Format:	Journal Article
Language:	English
Published:	Elsevier Inc 01.12.2024
Subjects:	Clustering Coreset Distributed algorithm k-means k-median MapReduce Outliers Outliers Coreset Distributed algorithm k-means k-median Clustering MapReduce
ISSN:	0743-7315
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k,ℓ)-clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the ℓ-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (ℓ=1) and k-means (ℓ=2) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the (k,ℓ)-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. Remarkably, for D=O(1), our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension. •First distributed algorithm for center-based clustering with outliers.•Coreset-based approach featuring approximation close to best sequential one.•Scalable algorithm suitable for very large datasets of constant doubling dimension.•Analysis parametric in the intrinsic dimensionality of the input pointset.
ISSN:	0743-7315
DOI:	10.1016/j.jpdc.2024.104966