Bi-criteria sublinear time algorithms for clustering with outliers in high dimensions

•Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key theoretical innovation lies in the sample complexity being independent of both input size and dimensionality, making it particularly effective for lar...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Theoretical computer science Jg. 1057; S. 115538
Hauptverfasser: Huang, Jiawei, Liu, Wenjie, Ding, Hu
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 06.12.2025
Schlagworte:
ISSN:0304-3975
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract •Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key theoretical innovation lies in the sample complexity being independent of both input size and dimensionality, making it particularly effective for large-scale and high-dimensional datasets. The analysis introduces a “star-shaped transformation” technique that enables rigorous analysis of clustering quality despite the presence of outliers.•Research highlight 2 The paper proposes a practical sub-linear time algorithm by combining uniform sampling with an “augmented sandwich lemma” technique to boost success probability. The experimental results demonstrate that this approach achieves comparable clustering quality to state-of-the-art methods while running significantly faster, especially on large datasets. The algorithm’s simple implementation and theoretical guarantees make it particularly suitable for real-world applications with limited data access or computational resources. Real-world datasets often contain outliers, and the presence of outliers can make clustering problems be much more challenging. Existing algorithms for clustering with outliers often have high computational complexities. In this paper, we propose a simple yet effective sublinear framework for solving the representative center-based clustering with outliers problems: k-median/means clustering with outliers. Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. In particular, our sample complexity is independent of the input size and dimensionality, and thus it is suitable for dealing with large-scale and high-dimensional datasets. We also conduct a set of experiments to evaluate the effectiveness of our proposed method on both synthetic and real datasets.
AbstractList •Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key theoretical innovation lies in the sample complexity being independent of both input size and dimensionality, making it particularly effective for large-scale and high-dimensional datasets. The analysis introduces a “star-shaped transformation” technique that enables rigorous analysis of clustering quality despite the presence of outliers.•Research highlight 2 The paper proposes a practical sub-linear time algorithm by combining uniform sampling with an “augmented sandwich lemma” technique to boost success probability. The experimental results demonstrate that this approach achieves comparable clustering quality to state-of-the-art methods while running significantly faster, especially on large datasets. The algorithm’s simple implementation and theoretical guarantees make it particularly suitable for real-world applications with limited data access or computational resources. Real-world datasets often contain outliers, and the presence of outliers can make clustering problems be much more challenging. Existing algorithms for clustering with outliers often have high computational complexities. In this paper, we propose a simple yet effective sublinear framework for solving the representative center-based clustering with outliers problems: k-median/means clustering with outliers. Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. In particular, our sample complexity is independent of the input size and dimensionality, and thus it is suitable for dealing with large-scale and high-dimensional datasets. We also conduct a set of experiments to evaluate the effectiveness of our proposed method on both synthetic and real datasets.
ArticleNumber 115538
Author Liu, Wenjie
Ding, Hu
Huang, Jiawei
Author_xml – sequence: 1
  givenname: Jiawei
  orcidid: 0000-0003-4819-2585
  surname: Huang
  fullname: Huang, Jiawei
  organization: University of Science and Technology of China, Hefei, 230000, Anhui, China
– sequence: 2
  givenname: Wenjie
  surname: Liu
  fullname: Liu, Wenjie
  organization: University of Science and Technology of China, Hefei, 230000, Anhui, China
– sequence: 3
  givenname: Hu
  orcidid: 0000-0002-1307-6077
  surname: Ding
  fullname: Ding, Hu
  email: huding@ustc.edu.cn, huding@buffalo.edu
  organization: University of Science and Technology of China, Hefei, 230000, Anhui, China
BookMark eNp9kMFOAyEQQDnUxLb6Ad74gV1hge4ST9qoNWnixZ4JC0NLswUDW41_L816di6TzMybzLwFmoUYAKE7SmpK6Or-WI8m1w1pRE2pEKyboTlhhFdMtuIaLXI-khKiXc3R7slXJvkRktc4n_vBB9AJj_4EWA_7WFqHU8YuJmyGc77MhT3-LlUcz-PgIWXsAz74_QHbAoXsY8g36MrpIcPtX16i3cvzx3pTbd9f39aP28o0go4Vc0zzhlnNWM9d1znZ94TTtrOil40BAdzJhjDprObSOA3EWOk6ZwznrLVsiei016SYcwKnPpM_6fSjKFEXF-qoigt1caEmF4V5mBgoh32VB1Q2HoIB6xOYUdno_6F_ARyobYE
Cites_doi 10.1007/s11704-019-9059-3
10.1016/j.neucom.2021.04.028
10.1145/1970392.1970395
10.1145/304181.304206
10.1023/B:MACH.0000033114.18632.e0
10.1109/TKDE.2019.2954317
10.1023/B:MACH.0000033115.78247.f0
10.1214/aos/1031833664
10.1016/j.patrec.2009.09.011
10.14778/3067421.3067425
10.1145/1541880.1541882
ContentType Journal Article
Copyright 2025 Elsevier B.V.
Copyright_xml – notice: 2025 Elsevier B.V.
DBID AAYXX
CITATION
DOI 10.1016/j.tcs.2025.115538
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Mathematics
Computer Science
ExternalDocumentID 10_1016_j_tcs_2025_115538
S0304397525004761
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
123
1B1
1RT
1~.
1~5
4.4
457
4G.
5VS
7-5
71M
8P~
9JN
AABNK
AAEDW
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AATTM
AAXKI
AAXUO
AAYFN
AAYWO
ABAOU
ABBOA
ABJNI
ABMAC
ACDAQ
ACGFS
ACLOT
ACRLP
ACVFH
ACZNC
ADBBV
ADCNI
ADEZE
AEBSH
AEIPS
AEKER
AENEX
AEUPX
AFJKZ
AFPUW
AFTJW
AGUBO
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIGII
AIIUN
AIKHN
AITUG
AKBMS
AKRWK
AKYEP
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
AOUOD
APXCP
ARUGR
AXJTR
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EFKBS
EFLBG
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HVGLF
IHE
IXB
J1W
KOM
LG9
M26
M41
MHUIS
MO0
N9A
O-L
O9-
OAUVE
OK1
OZT
P-8
P-9
P2P
PC.
Q38
ROL
RPZ
SCC
SDF
SDG
SES
SEW
SPC
SPCBC
SSV
SSW
T5K
TN5
WH7
YNT
ZMT
~G-
~HD
29Q
9DU
AAEDT
AAQXK
AAYXX
ABDPE
ABEFU
ABFNM
ABWVN
ABXDB
ACNNM
ACRPL
ADMUD
ADNMO
ADVLN
AEXQZ
AGHFR
AGQPQ
ASPBG
AVWKF
AZFZN
CITATION
EJD
FGOYB
G-2
HZ~
R2-
SSZ
TAE
WUQ
ZY4
ID FETCH-LOGICAL-c251t-3f3a423da33b4f88f9bb04178d5b92ce5e4f92039fda49cfae0cd9f8fcc4437d3
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001578144500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0304-3975
IngestDate Sat Nov 29 06:53:49 EST 2025
Wed Dec 10 14:38:43 EST 2025
IsPeerReviewed true
IsScholarly true
Keywords Outliers
Clustering
Sampling
Approximation algorithm
Sublinear
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c251t-3f3a423da33b4f88f9bb04178d5b92ce5e4f92039fda49cfae0cd9f8fcc4437d3
ORCID 0000-0003-4819-2585
0000-0002-1307-6077
ParticipantIDs crossref_primary_10_1016_j_tcs_2025_115538
elsevier_sciencedirect_doi_10_1016_j_tcs_2025_115538
PublicationCentury 2000
PublicationDate 2025-12-06
PublicationDateYYYYMMDD 2025-12-06
PublicationDate_xml – month: 12
  year: 2025
  text: 2025-12-06
  day: 06
PublicationDecade 2020
PublicationTitle Theoretical computer science
PublicationYear 2025
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Meyerson, O’callaghan, Plotkin (bib0019) 2004; 56
Huang, Jiang, Li, Wu (bib0020) 2018
Czumaj, Sohler (bib0018) 2004
Chen, Azer, Zhang (bib0012) 2018
Manning, Raghavan, Schütze (bib0048) 2008
Friggstad, Khodamoradi, Rezapour, Salavatipour (bib0033) 2018
Liu, Li, Wu, Fu (bib0034) 2019; 33
Paul, Chakraborty, Das, Xu (bib0035) 2021; 34
Ash, Zhang, Krishnamurthy, Langford, Agarwal (bib0006) 2020
Huang, Jiang, Lou (bib0042) 2023
Charikar, Khuller, Mount, Narasimhan (bib0023) 2001
Deshpande, Kacham, Pratap (bib0010) 2020; 124
N. Alon, J.H. Spencer, The probabilistic method, 2004.
Chandola, Banerjee, Kumar (bib0003) 2009; 41
P. Awasthi, M.-F. Balcan, Center based clustering: a foundational perspective(2014).
.
Huang, Feng, Huang, Xu, Wang (bib0013) 2024
Arthur, Vassilvitskii (bib0022) 2007
Chawla, Gionis (bib0024) 2013
Tang, Ding, Jankov, Yuan, Bourgeois, Jermaine (bib0027) 2023
Indyk (bib0017) 1999
Gupta (bib0028) 2018
Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, Wu (bib0041) 2022
Charikar, O’Callaghan, Panigrahy (bib0038) 2003
Jain (bib0001) 2010; 31
Huang, Jiang, Lou, Wu (bib0040) 2023
Candès, Li, Ma, Wright (bib0049) 2011; 58
Bachem, Lucic, Hassani, Krause (bib0046) 2016
Chen (bib0031) 2008
Grunau, Rozhoň (bib0011) 2022
Ding, Wang (bib0039) 2020; 119
Zhao, Christensen, Li, Hu, Yi (bib0016) 2018
Bhattacharjee, Mitra (bib0005) 2021; 15
Zhang, Feng, Huang, Guo, Xu, Wang (bib0025) 2021; 450
Chaudhuri, Motwani, Narasayya (bib0015) 1999; 28
Cuesta-Albertos, Gordaliza, Matran (bib0029) 1997; 25
Im, Qaem, Moseley, Sun, Zhou (bib0008) 2020
D. Dua, C. Graff, UCI Machine learning repository, 2017.
Ester, Kriegel, Sander, Xu (bib0004) 1996
Sanders (bib0014) 2009
Mettu, Plaxton (bib0037) 2004; 56
Feldman, Langberg (bib0050) 2011
Gupta, Kumar, Lu, Moseley, Vassilvitskii (bib0007) 2017; 10
Georgogiannis (bib0030) 2016; 29
Bhaskara, Vadgama, Xu (bib0009) 2019; 32
Chen, Han, Xu, Xu, Zhang (bib0036) 2023
Moseley, Pruhs, Samadian, Wang (bib0026) 2021; 198
Krishnaswamy, Li, Sandeep (bib0032) 2018
Aggarwal, Deshpande, Kannan (bib0044) 2009
Ding (bib0045) 2020; 173
Mishra, Oblinger, Pitt (bib0021) 2001
Indyk (10.1016/j.tcs.2025.115538_bib0017) 1999
Mishra (10.1016/j.tcs.2025.115538_bib0021) 2001
Huang (10.1016/j.tcs.2025.115538_bib0013) 2024
Zhang (10.1016/j.tcs.2025.115538_bib0025) 2021; 450
Huang (10.1016/j.tcs.2025.115538_bib0040) 2023
Sanders (10.1016/j.tcs.2025.115538_bib0014) 2009
Jain (10.1016/j.tcs.2025.115538_bib0001) 2010; 31
Chen (10.1016/j.tcs.2025.115538_bib0031) 2008
Braverman (10.1016/j.tcs.2025.115538_bib0041) 2022
Chen (10.1016/j.tcs.2025.115538_bib0036) 2023
Mettu (10.1016/j.tcs.2025.115538_bib0037) 2004; 56
Feldman (10.1016/j.tcs.2025.115538_bib0050) 2011
Ester (10.1016/j.tcs.2025.115538_bib0004) 1996
Arthur (10.1016/j.tcs.2025.115538_bib0022) 2007
Gupta (10.1016/j.tcs.2025.115538_bib0028) 2018
Krishnaswamy (10.1016/j.tcs.2025.115538_bib0032) 2018
Aggarwal (10.1016/j.tcs.2025.115538_bib0044) 2009
Charikar (10.1016/j.tcs.2025.115538_bib0038) 2003
Bhattacharjee (10.1016/j.tcs.2025.115538_bib0005) 2021; 15
Huang (10.1016/j.tcs.2025.115538_bib0020) 2018
Friggstad (10.1016/j.tcs.2025.115538_bib0033) 2018
Moseley (10.1016/j.tcs.2025.115538_bib0026) 2021; 198
Grunau (10.1016/j.tcs.2025.115538_bib0011) 2022
Tang (10.1016/j.tcs.2025.115538_bib0027) 2023
Bachem (10.1016/j.tcs.2025.115538_bib0046) 2016
Charikar (10.1016/j.tcs.2025.115538_bib0023) 2001
10.1016/j.tcs.2025.115538_bib0043
Cuesta-Albertos (10.1016/j.tcs.2025.115538_bib0029) 1997; 25
10.1016/j.tcs.2025.115538_bib0047
10.1016/j.tcs.2025.115538_bib0002
Im (10.1016/j.tcs.2025.115538_bib0008) 2020
Paul (10.1016/j.tcs.2025.115538_bib0035) 2021; 34
Zhao (10.1016/j.tcs.2025.115538_bib0016) 2018
Georgogiannis (10.1016/j.tcs.2025.115538_bib0030) 2016; 29
Ding (10.1016/j.tcs.2025.115538_bib0045) 2020; 173
Chawla (10.1016/j.tcs.2025.115538_bib0024) 2013
Ding (10.1016/j.tcs.2025.115538_bib0039) 2020; 119
Chaudhuri (10.1016/j.tcs.2025.115538_bib0015) 1999; 28
Bhaskara (10.1016/j.tcs.2025.115538_bib0009) 2019; 32
Huang (10.1016/j.tcs.2025.115538_bib0042) 2023
Candès (10.1016/j.tcs.2025.115538_bib0049) 2011; 58
Czumaj (10.1016/j.tcs.2025.115538_bib0018) 2004
Ash (10.1016/j.tcs.2025.115538_bib0006) 2020
Chen (10.1016/j.tcs.2025.115538_bib0012) 2018
Liu (10.1016/j.tcs.2025.115538_bib0034) 2019; 33
Chandola (10.1016/j.tcs.2025.115538_bib0003) 2009; 41
Deshpande (10.1016/j.tcs.2025.115538_bib0010) 2020; 124
Meyerson (10.1016/j.tcs.2025.115538_bib0019) 2004; 56
Gupta (10.1016/j.tcs.2025.115538_bib0007) 2017; 10
Manning (10.1016/j.tcs.2025.115538_bib0048) 2008
References_xml – start-page: 7845
  year: 2022
  end-page: 7886
  ident: bib0011
  article-title: Adapting K-means algorithms for outliers
  publication-title: International Conference on Machine Learning
– start-page: 428
  year: 1999
  end-page: 434
  ident: bib0017
  article-title: Sublinear time algorithms for metric space problems
  publication-title: Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, May 1–4, 1999, Atlanta, Georgia, USA
– start-page: 55
  year: 2016
  end-page: 63
  ident: bib0046
  article-title: Fast and provably good seedings for k-means
  publication-title: Proceedings of the 30th International Conference on Neural Information Processing Systems
– start-page: 642
  year: 2001
  end-page: 651
  ident: bib0023
  article-title: Algorithms for facility location problems with outliers
  publication-title: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms
– reference: P. Awasthi, M.-F. Balcan, Center based clustering: a foundational perspective(2014).
– start-page: 398
  year: 2018
  end-page: 414
  ident: bib0033
  article-title: Approximation schemes for clustering with outliers
  publication-title: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms
– year: 2023
  ident: bib0042
  article-title: The power of uniform sampling for k-median
  publication-title: International Conference on Machine Learning
– start-page: 814
  year: 2018
  end-page: 825
  ident: bib0020
  article-title: Epsilon-coresets for clustering (with outliers) in doubling metrics
  publication-title: 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS)
– volume: 58
  start-page: 11:1
  year: 2011
  end-page: 11:37
  ident: bib0049
  article-title: Robust principal component analysis?
  publication-title: J. ACM
– start-page: 33581
  year: 2023
  end-page: 33598
  ident: bib0027
  article-title: Auto-differentiation of relational computations for very large scale machine learning
  publication-title: International Conference on Machine Learning
– start-page: 462
  year: 2022
  end-page: 473
  ident: bib0041
  article-title: The power of uniform sampling for coresets
  publication-title: 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS)
– start-page: 226
  year: 1996
  end-page: 231
  ident: bib0004
  article-title: A density-based algorithm for discovering clusters in large spatial databases with noise
  publication-title: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
– year: 2018
  ident: bib0028
  publication-title: Approximation Algorithms for Clustering and Facility Location Problems
– start-page: 646
  year: 2018
  end-page: 659
  ident: bib0032
  article-title: Constant approximation for k-median and k-means with outliers via iterative rounding
  publication-title: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing
– year: 2023
  ident: bib0040
  article-title: Near-optimal coresets for robust clustering
  publication-title: International Conference on Learning Representations, ICLR
– start-page: 569
  year: 2011
  end-page: 578
  ident: bib0050
  article-title: A unified framework for approximating and clustering data
  publication-title: Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing
– volume: 28
  start-page: 263
  year: 1999
  end-page: 274
  ident: bib0015
  article-title: On random sampling over joins
  publication-title: ACM SIGMOD Rec.
– start-page: 1027
  year: 2007
  end-page: 1035
  ident: bib0022
  article-title: k-means++: the advantages of careful seeding
  publication-title: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms
– volume: 31
  start-page: 651
  year: 2010
  end-page: 666
  ident: bib0001
  article-title: Data clustering: 50 years beyond K-means
  publication-title: Pattern Recognit. Lett.
– volume: 15
  start-page: 1
  year: 2021
  end-page: 27
  ident: bib0005
  article-title: A survey of density based clustering algorithms
  publication-title: Front. Comput. Sci.
– volume: 198
  year: 2021
  ident: bib0026
  article-title: Relational algorithms for k-means clustering
  publication-title: International Colloquium on Automata, Languages, and Programming
– start-page: 30
  year: 2003
  end-page: 39
  ident: bib0038
  article-title: Better streaming algorithms for clustering problems
  publication-title: Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing
– year: 2008
  ident: bib0048
  article-title: Introduction to information retrieval
– volume: 32
  year: 2019
  ident: bib0009
  article-title: Greedy sampling for approximate clustering in the presence of outliers
  publication-title: Advances in Neural Information Processing Systems
– volume: 56
  start-page: 35
  year: 2004
  end-page: 60
  ident: bib0037
  article-title: Optimal time bounds for approximate clustering
  publication-title: Mach. Learn.
– start-page: 321
  year: 2009
  end-page: 340
  ident: bib0014
  article-title: Algorithm engineering–an attempt at a definition
  publication-title: Efficient Algorithms: Essays Dedicated to Kurt Mehlhorn on the Occasion of His 60th Birthday
– volume: 34
  year: 2021
  ident: bib0035
  article-title: Uniform concentration bounds toward a unified framework for robust clustering
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 10
  start-page: 757
  year: 2017
  end-page: 768
  ident: bib0007
  article-title: Local search methods for k-means with outliers
  publication-title: Proc. VLDB Endow.
– start-page: 15
  year: 2009
  end-page: 28
  ident: bib0044
  article-title: Adaptive sampling for k-means clustering
  publication-title: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
– volume: 29
  year: 2016
  ident: bib0030
  article-title: Robust k-means: a theoretical revisit
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 25
  start-page: 553
  year: 1997
  end-page: 576
  ident: bib0029
  article-title: Trimmed k-means: an attempt to robustify quantizers
  publication-title: Ann. Stat.
– volume: 124
  start-page: 799
  year: 2020
  end-page: 808
  ident: bib0010
  article-title: Robust K-means++
  publication-title: Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)
– volume: 56
  start-page: 61
  year: 2004
  end-page: 87
  ident: bib0019
  article-title: A k-median algorithm with running time independent of data size
  publication-title: Mach. Learn.
– start-page: 189
  year: 2013
  end-page: 197
  ident: bib0024
  article-title: k-means–: a unified approach to clustering and outlier detection
  publication-title: Proceedings of the 2013 SIAM International Conference on Data Mining
– volume: 173
  start-page: 38:1
  year: 2020
  end-page: 38:21
  ident: bib0045
  article-title: A sub-linear time framework for geometric optimization with outliers in high dimensions
  publication-title: 28th Annual European Symposium on Algorithms, ESA 2020, September 7–9, 2020, Pisa, Italy (Virtual Conference)
– volume: 41
  start-page: 15
  year: 2009
  ident: bib0003
  article-title: Anomaly detection: a survey
  publication-title: ACM Comput. Surv.
– start-page: 396
  year: 2004
  end-page: 407
  ident: bib0018
  article-title: Sublinear-time approximation for clustering via random sampling
  publication-title: International Colloquium on Automata, Languages, and Programming
– volume: 450
  start-page: 230
  year: 2021
  end-page: 241
  ident: bib0025
  article-title: A local search algorithm for k-means with outliers
  publication-title: Neurocomputing
– volume: 119
  start-page: 2556
  year: 2020
  end-page: 2566
  ident: bib0039
  article-title: Layered sampling for robust optimization problems
  publication-title: Proceedings of the 37th International Conference on Machine Learning, ICML
– start-page: 2253
  year: 2018
  end-page: 2262
  ident: bib0012
  article-title: A practical algorithm for distributed clustering and outlier detection
  publication-title: Advances in Neural Information Processing Systems
– start-page: 295
  year: 2023
  end-page: 302
  ident: bib0036
  article-title: k-median/means with outliers revisited: a simple Fpt approximation
  publication-title: International Computing and Combinatorics Conference
– start-page: 1525
  year: 2018
  end-page: 1539
  ident: bib0016
  article-title: Random sampling over joins revisited
  publication-title: Proceedings of the 2018 International Conference on Management of Data
– volume: 33
  start-page: 2369
  year: 2019
  end-page: 2379
  ident: bib0034
  article-title: Clustering with outlier removal
  publication-title: IEEE Trans. Knowl. Data Eng.
– year: 2020
  ident: bib0008
  article-title: Fast noise removal for k-means clustering
  publication-title: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics
– reference: .
– year: 2024
  ident: bib0013
  article-title: Near-linear time approximation algorithms for k-means with outliers
  publication-title: Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21–27, 2024
– reference: N. Alon, J.H. Spencer, The probabilistic method, 2004.
– year: 2020
  ident: bib0006
  article-title: Deep batch active learning by diverse, uncertain gradient lower bounds
  publication-title: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020
– start-page: 439
  year: 2001
  end-page: 447
  ident: bib0021
  article-title: Sublinear time approximate clustering
  publication-title: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms
– start-page: 826
  year: 2008
  end-page: 835
  ident: bib0031
  article-title: A constant factor approximation algorithm for k-median clustering with outliers
  publication-title: Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms
– reference: D. Dua, C. Graff, UCI Machine learning repository, 2017.
– volume: 15
  start-page: 1
  year: 2021
  ident: 10.1016/j.tcs.2025.115538_bib0005
  article-title: A survey of density based clustering algorithms
  publication-title: Front. Comput. Sci.
  doi: 10.1007/s11704-019-9059-3
– start-page: 295
  year: 2023
  ident: 10.1016/j.tcs.2025.115538_bib0036
  article-title: k-median/means with outliers revisited: a simple Fpt approximation
– year: 2008
  ident: 10.1016/j.tcs.2025.115538_bib0048
– year: 2020
  ident: 10.1016/j.tcs.2025.115538_bib0008
  article-title: Fast noise removal for k-means clustering
– start-page: 30
  year: 2003
  ident: 10.1016/j.tcs.2025.115538_bib0038
  article-title: Better streaming algorithms for clustering problems
– volume: 450
  start-page: 230
  year: 2021
  ident: 10.1016/j.tcs.2025.115538_bib0025
  article-title: A local search algorithm for k-means with outliers
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2021.04.028
– volume: 58
  start-page: 11:1
  issue: 3
  year: 2011
  ident: 10.1016/j.tcs.2025.115538_bib0049
  article-title: Robust principal component analysis?
  publication-title: J. ACM
  doi: 10.1145/1970392.1970395
– start-page: 642
  year: 2001
  ident: 10.1016/j.tcs.2025.115538_bib0023
  article-title: Algorithms for facility location problems with outliers
– volume: 28
  start-page: 263
  issue: 2
  year: 1999
  ident: 10.1016/j.tcs.2025.115538_bib0015
  article-title: On random sampling over joins
  publication-title: ACM SIGMOD Rec.
  doi: 10.1145/304181.304206
– year: 2023
  ident: 10.1016/j.tcs.2025.115538_bib0040
  article-title: Near-optimal coresets for robust clustering
– year: 2018
  ident: 10.1016/j.tcs.2025.115538_bib0028
– volume: 124
  start-page: 799
  year: 2020
  ident: 10.1016/j.tcs.2025.115538_bib0010
  article-title: Robust K-means++
– year: 2023
  ident: 10.1016/j.tcs.2025.115538_bib0042
  article-title: The power of uniform sampling for k-median
– start-page: 1525
  year: 2018
  ident: 10.1016/j.tcs.2025.115538_bib0016
  article-title: Random sampling over joins revisited
– start-page: 33581
  year: 2023
  ident: 10.1016/j.tcs.2025.115538_bib0027
  article-title: Auto-differentiation of relational computations for very large scale machine learning
– volume: 56
  start-page: 35
  issue: 1–3
  year: 2004
  ident: 10.1016/j.tcs.2025.115538_bib0037
  article-title: Optimal time bounds for approximate clustering
  publication-title: Mach. Learn.
  doi: 10.1023/B:MACH.0000033114.18632.e0
– start-page: 2253
  year: 2018
  ident: 10.1016/j.tcs.2025.115538_bib0012
  article-title: A practical algorithm for distributed clustering and outlier detection
– volume: 119
  start-page: 2556
  year: 2020
  ident: 10.1016/j.tcs.2025.115538_bib0039
  article-title: Layered sampling for robust optimization problems
– volume: 33
  start-page: 2369
  issue: 6
  year: 2019
  ident: 10.1016/j.tcs.2025.115538_bib0034
  article-title: Clustering with outlier removal
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2019.2954317
– volume: 34
  year: 2021
  ident: 10.1016/j.tcs.2025.115538_bib0035
  article-title: Uniform concentration bounds toward a unified framework for robust clustering
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.tcs.2025.115538_bib0047
– volume: 56
  start-page: 61
  issue: 1–3
  year: 2004
  ident: 10.1016/j.tcs.2025.115538_bib0019
  article-title: A k-median algorithm with running time independent of data size
  publication-title: Mach. Learn.
  doi: 10.1023/B:MACH.0000033115.78247.f0
– start-page: 439
  year: 2001
  ident: 10.1016/j.tcs.2025.115538_bib0021
  article-title: Sublinear time approximate clustering
– year: 2020
  ident: 10.1016/j.tcs.2025.115538_bib0006
  article-title: Deep batch active learning by diverse, uncertain gradient lower bounds
– start-page: 826
  year: 2008
  ident: 10.1016/j.tcs.2025.115538_bib0031
  article-title: A constant factor approximation algorithm for k-median clustering with outliers
– start-page: 1027
  year: 2007
  ident: 10.1016/j.tcs.2025.115538_bib0022
  article-title: k-means++: the advantages of careful seeding
– volume: 32
  year: 2019
  ident: 10.1016/j.tcs.2025.115538_bib0009
  article-title: Greedy sampling for approximate clustering in the presence of outliers
– ident: 10.1016/j.tcs.2025.115538_bib0043
– volume: 29
  year: 2016
  ident: 10.1016/j.tcs.2025.115538_bib0030
  article-title: Robust k-means: a theoretical revisit
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 55
  year: 2016
  ident: 10.1016/j.tcs.2025.115538_bib0046
  article-title: Fast and provably good seedings for k-means
– start-page: 646
  year: 2018
  ident: 10.1016/j.tcs.2025.115538_bib0032
  article-title: Constant approximation for k-median and k-means with outliers via iterative rounding
– start-page: 15
  year: 2009
  ident: 10.1016/j.tcs.2025.115538_bib0044
  article-title: Adaptive sampling for k-means clustering
– ident: 10.1016/j.tcs.2025.115538_bib0002
– start-page: 428
  year: 1999
  ident: 10.1016/j.tcs.2025.115538_bib0017
  article-title: Sublinear time algorithms for metric space problems
– start-page: 7845
  year: 2022
  ident: 10.1016/j.tcs.2025.115538_bib0011
  article-title: Adapting K-means algorithms for outliers
– start-page: 321
  year: 2009
  ident: 10.1016/j.tcs.2025.115538_bib0014
  article-title: Algorithm engineering–an attempt at a definition
– volume: 25
  start-page: 553
  issue: 2
  year: 1997
  ident: 10.1016/j.tcs.2025.115538_bib0029
  article-title: Trimmed k-means: an attempt to robustify quantizers
  publication-title: Ann. Stat.
  doi: 10.1214/aos/1031833664
– volume: 173
  start-page: 38:1
  year: 2020
  ident: 10.1016/j.tcs.2025.115538_bib0045
  article-title: A sub-linear time framework for geometric optimization with outliers in high dimensions
– start-page: 226
  year: 1996
  ident: 10.1016/j.tcs.2025.115538_bib0004
  article-title: A density-based algorithm for discovering clusters in large spatial databases with noise
– start-page: 569
  year: 2011
  ident: 10.1016/j.tcs.2025.115538_bib0050
  article-title: A unified framework for approximating and clustering data
– year: 2024
  ident: 10.1016/j.tcs.2025.115538_bib0013
  article-title: Near-linear time approximation algorithms for k-means with outliers
– start-page: 398
  year: 2018
  ident: 10.1016/j.tcs.2025.115538_bib0033
  article-title: Approximation schemes for clustering with outliers
– volume: 31
  start-page: 651
  issue: 8
  year: 2010
  ident: 10.1016/j.tcs.2025.115538_bib0001
  article-title: Data clustering: 50 years beyond K-means
  publication-title: Pattern Recognit. Lett.
  doi: 10.1016/j.patrec.2009.09.011
– volume: 10
  start-page: 757
  issue: 7
  year: 2017
  ident: 10.1016/j.tcs.2025.115538_bib0007
  article-title: Local search methods for k-means with outliers
  publication-title: Proc. VLDB Endow.
  doi: 10.14778/3067421.3067425
– start-page: 189
  year: 2013
  ident: 10.1016/j.tcs.2025.115538_bib0024
  article-title: k-means–: a unified approach to clustering and outlier detection
– volume: 41
  start-page: 15
  issue: 3
  year: 2009
  ident: 10.1016/j.tcs.2025.115538_bib0003
  article-title: Anomaly detection: a survey
  publication-title: ACM Comput. Surv.
  doi: 10.1145/1541880.1541882
– start-page: 462
  year: 2022
  ident: 10.1016/j.tcs.2025.115538_bib0041
  article-title: The power of uniform sampling for coresets
– start-page: 396
  year: 2004
  ident: 10.1016/j.tcs.2025.115538_bib0018
  article-title: Sublinear-time approximation for clustering via random sampling
– start-page: 814
  year: 2018
  ident: 10.1016/j.tcs.2025.115538_bib0020
  article-title: Epsilon-coresets for clustering (with outliers) in doubling metrics
– volume: 198
  year: 2021
  ident: 10.1016/j.tcs.2025.115538_bib0026
  article-title: Relational algorithms for k-means clustering
SSID ssj0000576
Score 2.459046
Snippet •Research highlight 1 This paper introduces a novel uniform sampling framework for solving k-means/ median clustering with outliers problems. The key...
SourceID crossref
elsevier
SourceType Index Database
Publisher
StartPage 115538
SubjectTerms Approximation algorithm
Clustering
Outliers
Sampling
Sublinear
Title Bi-criteria sublinear time algorithms for clustering with outliers in high dimensions
URI https://dx.doi.org/10.1016/j.tcs.2025.115538
Volume 1057
WOSCitedRecordID wos001578144500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  issn: 0304-3975
  databaseCode: AIEXJ
  dateStart: 20211214
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: false
  ssIdentifier: ssj0000576
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELZQywEOPAqI8pIPnIiCQmKv7WOBolJBhUQr7S1y_KBZVWnV7NL-_M7EjrNqQQIkLlaUhxPNN5qMxzPfEPJaKMH1TOvcSgcLFC9cDn6yzWe2KrVnVko2FAp_EQcHcj5X32Kbu35oJyC6Tl5eqrP_CjWcA7CxdPYv4E6Twgk4BtBhBNhh_CPg37c5WAKkYNZZj0GmDrl6sId8pk9-nMKl40DCkJmTFdIkpHAsJgdhZ2yMgSCNcWaR-r9PEb1F0qxU-2hiU4gs_kknLYlx6P1WX7g2pf20qyGpz3WLNt37MbZV2VuthyBKPqRzzKa42I3amFCPhXsuKvRFSba2CHTUNwx3iCEs3i4NcqiXHEw554H35Rof9necGOcF561gApe-m6XgCqzy5s7n3fn-9CPmImxVxw8ZN7WH9L5rL_q1W7Lmahw-IPfiGoHuBGwfkluu2yL3x_4bNJrjLXL3a-Lc7R-RozXgaQKeIvB0Ap4C8HQCniLwdASeth1F4OkE_GNy9Gn38MNeHptm5AZc1WVe-UqDi2x1VTXMS-lV0xTsnZCWN6o0jjvmVVlUylvNlPHaFcYqL70xjFXCVk_IRnfauaeEanCduWgMeOkFcyWXs6ZxsjRSG1sVym2TN6PM6rPAjVKPSYOLGgRco4DrIOBtwkap1lElg9NWgwr8_rFn__bYc3JnUtQXZGN5vnIvyW3zc9n256-iolwBHn12hg
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Bi-criteria+sublinear+time+algorithms+for+clustering+with+outliers+in+high+dimensions&rft.jtitle=Theoretical+computer+science&rft.au=Huang%2C+Jiawei&rft.au=Liu%2C+Wenjie&rft.au=Ding%2C+Hu&rft.date=2025-12-06&rft.pub=Elsevier+B.V&rft.issn=0304-3975&rft.volume=1057&rft_id=info:doi/10.1016%2Fj.tcs.2025.115538&rft.externalDocID=S0304397525004761
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0304-3975&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0304-3975&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0304-3975&client=summon