MapReduce algorithms for robust center-based clustering in doubling metrics

Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k,ℓ)-clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the ℓ-th powers of the distances of points in P from their...

Full description

Saved in:
Bibliographic Details
Published in:Journal of parallel and distributed computing Vol. 194; p. 104966
Main Authors: Dandolo, Enrico, Mazzetto, Alessio, Pietracaprina, Andrea, Pucci, Geppino
Format: Journal Article
Language:English
Published: Elsevier Inc 01.12.2024
Subjects:
ISSN:0743-7315
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k,ℓ)-clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the ℓ-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (ℓ=1) and k-means (ℓ=2) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the (k,ℓ)-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. Remarkably, for D=O(1), our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension. •First distributed algorithm for center-based clustering with outliers.•Coreset-based approach featuring approximation close to best sequential one.•Scalable algorithm suitable for very large datasets of constant doubling dimension.•Analysis parametric in the intrinsic dimensionality of the input pointset.
ISSN:0743-7315
DOI:10.1016/j.jpdc.2024.104966