Streaming Euclidean k-median and k-means with o(log n) Space

We consider the classic Euclidean k-median and k-means objective on data streams, where the goal is to provide a (1+\varepsilon)-approximation to the optimal k-median or k-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a trem...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings / annual Symposium on Foundations of Computer Science pp. 883 - 908
Main Authors: Cohen-Addad, Vincent, Woodruff, David P., Zhou, Samson
Format: Conference Proceeding
Language:English
Published: IEEE 06.11.2023
Subjects:
ISSN:2575-8454
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We consider the classic Euclidean k-median and k-means objective on data streams, where the goal is to provide a (1+\varepsilon)-approximation to the optimal k-median or k-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least \Omega(\log (n \Delta)) words of memory, where n is size of the input and \Delta is the aspect ratio. A natural question is if one can beat this logarithmic dependence on n and \Delta. In this paper, we break this barrier by first giving an insertion-only streaming algorithm that achieves a (1+\varepsilon)-approximation to the more general (k, z)-clustering problem, using \tilde{\mathcal{O}}\left(\frac{d k}{\varepsilon^{2}}\right) \cdot\left(2^{z \log z}\right) \cdot \min \left(\frac{1}{\varepsilon^{z}}, k\right) \cdot \operatorname{poly}(\log \log (n \Delta)) words of memory. Our techniques can also be used to achieve two-pass algorithms for k-median and k-means clustering on dynamic streams using \tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{2}}\right) \cdot \operatorname{poly}(d, k, \log \log (n \Delta)) words of memory.
ISSN:2575-8454
DOI:10.1109/FOCS57990.2023.00057