Accelerating the k-Means++ Algorithm by Using Geometric Information

Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data mining. Real-world use cases include social network analysis, medical imaging, market segmentation, and anomaly detection, to name a few. In this...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access Vol. 13; pp. 67693 - 67717
Main Authors: Rodriguez Corominas, Guillem, Blesa, Maria J., Blum, Christian
Format: Journal Article
Language:English
Published: IEEE 2025
Subjects:
ISSN:2169-3536, 2169-3536
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data mining. Real-world use cases include social network analysis, medical imaging, market segmentation, and anomaly detection, to name a few. In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup.
AbstractList Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data mining. Real-world use cases include social network analysis, medical imaging, market segmentation, and anomaly detection, to name a few. In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup.
Author Blesa, Maria J.
Rodriguez Corominas, Guillem
Blum, Christian
Author_xml – sequence: 1
  givenname: Guillem
  orcidid: 0000-0002-3863-2017
  surname: Rodriguez Corominas
  fullname: Rodriguez Corominas, Guillem
  email: guillem.rodriguez.corominas@upc.edu
  organization: Department of Computer Science, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain
– sequence: 2
  givenname: Maria J.
  surname: Blesa
  fullname: Blesa, Maria J.
  organization: Department of Computer Science, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain
– sequence: 3
  givenname: Christian
  orcidid: 0000-0002-1736-3559
  surname: Blum
  fullname: Blum, Christian
  organization: Artificial Intelligence Research Institute (IIIA-CSIC), Bellaterra, Spain
BookMark eNpNkE9vwjAMxaOJSWOMT7Adekdl-UPa5IgqxpCYdmCcIzd1oaxtprQXvv3CiiZ8sfXk35P9HsmodS0S8szonDGqX5dZttrt5pxyORcyYVyLOzLmLNGxkCIZ3cwPZNp1JxpKBUmmY5ItrcUaPfRVe4j6I0bf8QdC281m0bI-OF_1xybKz9G-uyys0TXY-8pGm7Z0vgmYa5_IfQl1h9Nrn5D92-ore4-3n-tNttzGVnAqYrCgNGKRliAkT1RaBAG45EUubLkArfOUI9coOUphE5pIlAW3KtCKMSUmZDP4Fg5O5sdXDfizcVCZP8H5gwHfV7ZGwxhXKQekooBFmUsNFIBatUhpHi7A4CUGL-td13ks__0YNZdYzRCrucRqrrEG6mWgKkS8IXQS3pHiF_ASdOc
CODEN IAECCG
Cites_doi 10.1109/TIT.1982.1056489
10.1145/2133803.2184450
10.1016/j.chemolab.2013.10.012
10.1145/1327452.1327492
10.3390/app11156963
10.3844/jcssp.2014.1197.1206
10.1609/aaai.v30i1.10259
10.1186/1471-2105-7-173
10.1007/s10115-007-0114-2
10.3390/a14010006
10.1007/978-3-642-32512-0_50
10.24963/ijcai.2021/403
10.1049/el:19901037
10.1007/978-3-031-30678-5_11
10.1109/5.726791
10.1016/j.snb.2012.01.074
10.3390/s23052368
10.1109/MDM.2013.24
10.1007/s10994-009-5103-0
10.1137/1.9781611974348.37
10.3390/app11031043
10.2307/3001968
10.1016/j.ins.2022.11.139
10.1016/j.eswa.2015.03.013
10.1016/j.petrol.2021.109681
10.1016/j.patcog.2019.04.014
10.14778/2180912.2180915
10.1038/ncomms5308
10.1137/1.9781611972801.12
10.1137/1.9781611974973.32
10.1109/TPDS.2014.2306193
10.1109/76.825859
10.1609/icwsm.v2i1.18651
10.1145/1046456.1046470
10.1007/978-3-319-09259-1_2
10.1155/2021/9982484
10.1016/j.snb.2015.03.028
10.1002/9781118032794
ContentType Journal Article
DBID 97E
ESBDL
RIA
RIE
AAYXX
CITATION
DOA
DOI 10.1109/ACCESS.2025.3561293
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE Xplore Open Access Journals
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2169-3536
EndPage 67717
ExternalDocumentID oai_doaj_org_article_112872ae03da4fb59a0aa0c8470bed7e
10_1109_ACCESS_2025_3561293
10966875
Genre orig-research
GrantInformation_xml – fundername: Agencia Estatal de Investigación (AEI)
  grantid: PID-2020-112581GB-C21 (MOTION)
– fundername: Department of Research and Universities of the Government of Catalonia by means of an European Social Fund (ESF)-Founded Pre-Doctoral Grant of the Catalan Agency for Management of University and Research Grants (AGAUR)
  grantid: 2022 FI_B 00903
– fundername: Ministerio de Ciencia e Innovación (MCIN)/AEI/10.13039/501100011033
  grantid: TED2021-129319B-I00; PID2022-136787NB-I00
– fundername: European Commission–NextGenerationEU, through Momentum CSIC Programme: Develop Your Digital Talent
  grantid: MMT24-IIIA-02
  funderid: 10.13039/501100000780
GroupedDBID 0R~
4.4
5VS
6IK
97E
AAJGR
ABAZT
ABVLG
ACGFS
ADBBV
AGSQL
ALMA_UNASSIGNED_HOLDINGS
BCNDV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
EJD
ESBDL
GROUPED_DOAJ
IPLJI
JAVBF
KQ8
M43
M~E
O9-
OCL
OK1
RIA
RIE
RNS
AAYXX
CITATION
ID FETCH-LOGICAL-c3203-aca89eed7fa352687dca8a252db3cf4a99b72e29e52e53c6065e5d2c832081183
IEDL.DBID RIE
ISSN 2169-3536
IngestDate Fri Oct 03 12:52:49 EDT 2025
Sat Nov 29 07:23:22 EST 2025
Wed Oct 08 06:20:34 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
License https://creativecommons.org/licenses/by/4.0/legalcode
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c3203-aca89eed7fa352687dca8a252db3cf4a99b72e29e52e53c6065e5d2c832081183
ORCID 0000-0002-1736-3559
0000-0002-3863-2017
OpenAccessLink https://ieeexplore.ieee.org/document/10966875
PageCount 25
ParticipantIDs ieee_primary_10966875
doaj_primary_oai_doaj_org_article_112872ae03da4fb59a0aa0c8470bed7e
crossref_primary_10_1109_ACCESS_2025_3561293
PublicationCentury 2000
PublicationDate 20250000
2025-00-00
2025-01-01
PublicationDateYYYYMMDD 2025-01-01
PublicationDate_xml – year: 2025
  text: 20250000
PublicationDecade 2020
PublicationTitle IEEE access
PublicationTitleAbbrev Access
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
References ref13
ref15
ref14
ref58
ref53
ref52
ref11
ref55
ref10
ref54
Dau (ref56) 2018
ref17
Kaul (ref42) 2013
Logacjov (ref46) 2023
ref16
ref19
ref18
Heck (ref37) 1998
Elkan (ref36)
Lattanzi (ref25)
Krizhevsky (ref57) 2009
Bachem (ref23); 29
ref47
ref44
ref43
(ref50) 2024
Bhatt (ref41) 2009
ref49
ref8
ref7
ref9
Vergara (ref51) 2012
ref4
ref3
ref6
Bachem (ref12); 37
Liang (ref26) 2022
ref35
ref34
ref31
Whiteson (ref62) 2014
ref30
Xia (ref33) 2022; 44
ref2
ref1
Bock (ref38) 2004
Anagnostopoulos (ref40) 2018
Newling (ref32); 48
Arthur (ref5)
ref20
ref63
ref22
ref21
Fonollosa (ref48) 2015
ref28
ref27
ref29
Hebrail (ref45) 2006
Cohen-Addad (ref24); 33
ref61
Ortega-Binderberger (ref39) 1998
Bertin-Mahieux (ref60) 2011
Bertin-Mahieux (ref59)
References_xml – ident: ref2
  doi: 10.1109/TIT.1982.1056489
– ident: ref21
  doi: 10.1145/2133803.2184450
– ident: ref53
  doi: 10.1016/j.chemolab.2013.10.012
– volume-title: R6A—Yahoo! Front Page Today Module User Click Log Dataset, Version 1.0
  year: 2024
  ident: ref50
– ident: ref20
  doi: 10.1145/1327452.1327492
– volume-title: Year Prediction MSD Dataset
  year: 2011
  ident: ref60
– ident: ref27
  doi: 10.3390/app11156963
– ident: ref7
  doi: 10.3844/jcssp.2014.1197.1206
– start-page: 1027
  volume-title: Proc. 18th Annu. ACM-SIAM Symp. Discrete Algorithms (SODA)
  ident: ref5
  article-title: K-means++: The advantages of careful seeding
– ident: ref22
  doi: 10.1609/aaai.v30i1.10259
– ident: ref44
  doi: 10.1186/1471-2105-7-173
– ident: ref3
  doi: 10.1007/s10115-007-0114-2
– volume: 37
  start-page: 209
  volume-title: Proc. 32nd Int. Conf. Mach. Learn. (ICML)
  ident: ref12
  article-title: Coresets for nonparametric estimation—The case of DP-means
– ident: ref18
  doi: 10.3390/a14010006
– ident: ref13
  doi: 10.1007/978-3-642-32512-0_50
– volume-title: Gas Sensor Array Drift Dataset
  year: 2012
  ident: ref51
– start-page: 1
  year: 1998
  ident: ref37
  article-title: CORSIKA: A Monte Carlo code to simulate extensive air showers
  publication-title: Forschungszentrum Karlsruhe
– volume-title: Road Network Dataset
  year: 2013
  ident: ref42
– ident: ref15
  doi: 10.24963/ijcai.2021/403
– ident: ref30
  doi: 10.1049/el:19901037
– ident: ref16
  doi: 10.1007/978-3-031-30678-5_11
– volume-title: The UCR Time Series Classification Archive
  year: 2018
  ident: ref56
– ident: ref58
  doi: 10.1109/5.726791
– ident: ref52
  doi: 10.1016/j.snb.2012.01.074
– volume-title: Query Analytics Workloads Dataset
  year: 2018
  ident: ref40
– ident: ref47
  doi: 10.3390/s23052368
– ident: ref43
  doi: 10.1109/MDM.2013.24
– ident: ref1
  doi: 10.1007/s10994-009-5103-0
– ident: ref14
  doi: 10.1137/1.9781611974348.37
– ident: ref35
  doi: 10.3390/app11031043
– ident: ref63
  doi: 10.2307/3001968
– ident: ref4
  doi: 10.1016/j.ins.2022.11.139
– ident: ref8
  doi: 10.1016/j.eswa.2015.03.013
– ident: ref11
  doi: 10.1016/j.petrol.2021.109681
– ident: ref6
  doi: 10.1016/j.patcog.2019.04.014
– volume-title: Skin Segmentation Dataset
  year: 2009
  ident: ref41
– volume-title: Individual Household Electric Power Consumption Dataset
  year: 2006
  ident: ref45
– start-page: 591
  volume-title: Proc. 12th Int. Conf. Music Inf. Retr. (ISMIR)
  ident: ref59
  article-title: The million song dataset
– ident: ref17
  doi: 10.14778/2180912.2180915
– ident: ref61
  doi: 10.1038/ncomms5308
– volume: 29
  start-page: 55
  volume-title: Proc. Adv. Neural Inf. Process. Syst.
  ident: ref23
  article-title: Fast and provably good seedings for k-means
– ident: ref29
  doi: 10.1137/1.9781611972801.12
– volume-title: Corel Image Features Dataset
  year: 1998
  ident: ref39
– volume-title: Dataset
  year: 2023
  ident: ref46
– volume-title: MAGIC Gamma Telescope Dataset
  year: 2004
  ident: ref38
– volume: 33
  start-page: 16235
  volume-title: Proc. Adv. Neural Inf. Process. Syst.
  ident: ref24
  article-title: Fast and accurate k-means++ via rejection sampling
– ident: ref55
  doi: 10.1137/1.9781611974973.32
– start-page: 3662
  volume-title: Proc. 36th Int. Conf. Mach. Learn. (ICML)
  ident: ref25
  article-title: A better k-means++ algorithm via local search
– ident: ref19
  doi: 10.1109/TPDS.2014.2306193
– volume: 48
  start-page: 936
  volume-title: Proc. 33rd Int. Conf. Mach. Learn.
  ident: ref32
  article-title: Fast k-means with accurate bounds
– ident: ref34
  doi: 10.1109/76.825859
– ident: ref10
  doi: 10.1609/icwsm.v2i1.18651
– volume-title: Learning Multiple Layers of Features From Tiny Images
  year: 2009
  ident: ref57
– ident: ref54
  doi: 10.1145/1046456.1046470
– volume: 44
  start-page: 87
  issue: 1
  year: 2022
  ident: ref33
  article-title: Ball k-means: Fast adaptive clustering with no bounds
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– start-page: 147
  volume-title: Proc. 20th Int. Conf. Int. Conf. Mach. Learn. (ICML)
  ident: ref36
  article-title: Using the triangle inequality to accelerate k-means
– ident: ref31
  doi: 10.1007/978-3-319-09259-1_2
– volume-title: Gas Sensor Array Under Dynamic Gas Mixtures Dataset
  year: 2015
  ident: ref48
– ident: ref9
  doi: 10.1155/2021/9982484
– year: 2022
  ident: ref26
  article-title: A faster k-means++ algorithm
  publication-title: arXiv:2211.15118
– ident: ref49
  doi: 10.1016/j.snb.2015.03.028
– volume-title: Dataset
  year: 2014
  ident: ref62
– ident: ref28
  doi: 10.1002/9781118032794
SSID ssj0000816957
Score 2.3344011
Snippet Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data...
SourceID doaj
crossref
ieee
SourceType Open Website
Index Database
Publisher
StartPage 67693
SubjectTerms Approximation algorithms
Clustering
Clustering algorithms
Computational efficiency
D² sampling
Filters
k-means
Machine learning algorithms
Measurement
norm
Prediction algorithms
Scalability
triangle inequality
Vectors
SummonAdditionalLinks – databaseName: DOAJ Directory of Open Access Journals
  dbid: DOA
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELVQxQAD4qOI8iUPbCXUtePYHkPExwAVA6BukWM7BUFTVAoS_56zE1A6sbBaUeK8i_3uRb53CJ1Ib0Lnq3F4bEGgQEIQSRbriNgS6MQAgYbysccbMRrJ8VjdtVp9-TNhtT1wDdwA8gEpqHaEWR2XBVeaaE0MbKqkcFY4v_sSoVpiKuzBcpgoLhqboSFRgzTL4I1AEFJ-xnxLSMWWqCg49i-1WAkMc7mJNprUEKf1lLbQiqu20XrLMHAHZakxwBM-atUEQ-6GX6JbB2zT7-P0dTIDpf80xcUXDicB8JWbTX3HLIObqiMfhS56uLy4z66jpg1CZBglLNJGSwVUJkrtzeylsDCgKae2YKaMtVKFoI4qx6njzIAi4Y5bamCtAgywZHdRp5pVbg9hGzsQKKU1ZeHihBmZWJoIQZST1mpW9tDpDyL5W-12kQeVQFReA5h7APMGwB4696j9XuqtqsMABDBvApj_FcAe6nrMW88DBQYqav8_bn6A1vyE638mh6izmH-4I7RqPhfP7_Pj8M18A8p9wqk
  priority: 102
  providerName: Directory of Open Access Journals
Title Accelerating the k-Means++ Algorithm by Using Geometric Information
URI https://ieeexplore.ieee.org/document/10966875
https://doaj.org/article/112872ae03da4fb59a0aa0c8470bed7e
Volume 13
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2169-3536
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000816957
  issn: 2169-3536
  databaseCode: DOA
  dateStart: 20130101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2169-3536
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000816957
  issn: 2169-3536
  databaseCode: M~E
  dateStart: 20130101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVoxQADn0WUj8oDW0mJ7DiOx1IVGGjFAKhb5NiXUkFbVFokFn47ZydU7cDAEkWWo9jPOb17ju-OkIvEJaFz0TgisihQ0CEIEh7pILQ50olBAvXhY8_3st9PBgP1UAar-1gYAPCHz6Dlbv2_fDs1C7dVhhaOzjk62BVSkTIugrWWGyqugoQSsswshF2v2p0OTgI1IBMt7qpAKr7GPj5J_1pVFU8qN7v_HM4e2Sm9R9oulnufbMDkgGyv5BQ8JJ22MUglbmEnQ4ruHX0NeoCE1GzS9ttwOhvNX8Y0-6L-sAC9henYFdUytAxMcgtVI0833cfOXVBWSggMZyEPtNGJQraTuXb57hNpsUEzwWzGTR5ppTLJgCkQDAQ3KFoECMsMmjPChlZ9RKqT6QSOCbURoIbJrckziGJuktiyWMpQQWKt5nmdXP4imL4XCTFSLyRClRaApw7wtAS8Tq4dysuuLpu1b0Ak09I48GHUbUxDyK2O8kwoHWodGiTOMMNJQZ3UHPor7yuAP_mj_ZRsuTEUOyVnpDqfLeCcbJrP-ehj1vCyG6-9727Df0I_-gnB5g
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED7xkoCBZxHl6YGtBCw7buKxVLxEqRgAdYsc-1IqoEWlIPHvOTsBwcDAFlmOYn_O6buzfd8BHKRehM5n46jYUYBCDkGUythE3BVEJ5YINKSP3XeSbjft9fRNlawecmEQMVw-wyP_GM7y3ci--a0ysnByzsnBnoZZFceCl-la31sqvoaEVkmlLUSdj1vtNk2DokChjqSvA6nlL_4JMv2_6qoEWjlb_ueAVmCp8h9Zq1zwVZjC4Ros_lAVXId2y1oiE7-0wz4jB489RtdIlNRosNZTfzQeTB6eWf7BwnUBdo6jZ19Wy7IqNckvVQ3uzk5v2xdRVSshslJwGRlrUk18lxTGK96niaMGI5RwubRFbLTOE4FCoxKopKWwRaFywpJBE2xk1xswMxwNcROYi5GimMLZIse4KW3adKKZJFxj6pyRRR0OvxDMXkpJjCyEElxnJeCZBzyrAK_DiUf5u6vXsw4NhGRWmQe9TJGbMMilM3GRK224MdwSdfKcJoV1qHn0f3yvBH7rj_Z9mL-4ve5kncvu1TYs-PGU-yY7MDMZv-EuzNn3yeB1vBd-oU8Ud8MH
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Accelerating+the+k-Means%2B%2B+Algorithm+by+Using+Geometric+Information&rft.jtitle=IEEE+access&rft.au=Rodriguez+Corominas%2C+Guillem&rft.au=Blesa%2C+Maria+J.&rft.au=Blum%2C+Christian&rft.date=2025&rft.pub=IEEE&rft.eissn=2169-3536&rft.volume=13&rft.spage=67693&rft.epage=67717&rft_id=info:doi/10.1109%2FACCESS.2025.3561293&rft.externalDocID=10966875
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon