Bi-criteria sublinear time algorithms for clustering with outliers in high dimensions

•Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key theoretical innovation lies in the sample complexity being independent of both input size and dimensionality, making it particularly effective for lar...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Theoretical computer science Ročník 1057; s. 115538
Hlavní autoři: Huang, Jiawei, Liu, Wenjie, Ding, Hu
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 06.12.2025
Témata:
ISSN:0304-3975
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:•Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key theoretical innovation lies in the sample complexity being independent of both input size and dimensionality, making it particularly effective for large-scale and high-dimensional datasets. The analysis introduces a “star-shaped transformation” technique that enables rigorous analysis of clustering quality despite the presence of outliers.•Research highlight 2 The paper proposes a practical sub-linear time algorithm by combining uniform sampling with an “augmented sandwich lemma” technique to boost success probability. The experimental results demonstrate that this approach achieves comparable clustering quality to state-of-the-art methods while running significantly faster, especially on large datasets. The algorithm’s simple implementation and theoretical guarantees make it particularly suitable for real-world applications with limited data access or computational resources. Real-world datasets often contain outliers, and the presence of outliers can make clustering problems be much more challenging. Existing algorithms for clustering with outliers often have high computational complexities. In this paper, we propose a simple yet effective sublinear framework for solving the representative center-based clustering with outliers problems: k-median/means clustering with outliers. Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. In particular, our sample complexity is independent of the input size and dimensionality, and thus it is suitable for dealing with large-scale and high-dimensional datasets. We also conduct a set of experiments to evaluate the effectiveness of our proposed method on both synthetic and real datasets.
ISSN:0304-3975
DOI:10.1016/j.tcs.2025.115538