Bi-criteria sublinear time algorithms for clustering with outliers in high dimensions

•Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key theoretical innovation lies in the sample complexity being independent of both input size and dimensionality, making it particularly effective for lar...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Theoretical computer science Jg. 1057; S. 115538
Hauptverfasser: Huang, Jiawei, Liu, Wenjie, Ding, Hu
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 06.12.2025
Schlagworte:
ISSN:0304-3975
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key theoretical innovation lies in the sample complexity being independent of both input size and dimensionality, making it particularly effective for large-scale and high-dimensional datasets. The analysis introduces a “star-shaped transformation” technique that enables rigorous analysis of clustering quality despite the presence of outliers.•Research highlight 2 The paper proposes a practical sub-linear time algorithm by combining uniform sampling with an “augmented sandwich lemma” technique to boost success probability. The experimental results demonstrate that this approach achieves comparable clustering quality to state-of-the-art methods while running significantly faster, especially on large datasets. The algorithm’s simple implementation and theoretical guarantees make it particularly suitable for real-world applications with limited data access or computational resources. Real-world datasets often contain outliers, and the presence of outliers can make clustering problems be much more challenging. Existing algorithms for clustering with outliers often have high computational complexities. In this paper, we propose a simple yet effective sublinear framework for solving the representative center-based clustering with outliers problems: k-median/means clustering with outliers. Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. In particular, our sample complexity is independent of the input size and dimensionality, and thus it is suitable for dealing with large-scale and high-dimensional datasets. We also conduct a set of experiments to evaluate the effectiveness of our proposed method on both synthetic and real datasets.
ISSN:0304-3975
DOI:10.1016/j.tcs.2025.115538