Accelerating the k-Means++ Algorithm by Using Geometric Information
Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data mining. Real-world use cases include social network analysis, medical imaging, market segmentation, and anomaly detection, to name a few. In this...
Saved in:
| Published in: | IEEE access Vol. 13; pp. 67693 - 67717 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
IEEE
2025
|
| Subjects: | |
| ISSN: | 2169-3536, 2169-3536 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data mining. Real-world use cases include social network analysis, medical imaging, market segmentation, and anomaly detection, to name a few. In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup. |
|---|---|
| AbstractList | Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data mining. Real-world use cases include social network analysis, medical imaging, market segmentation, and anomaly detection, to name a few. In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup. |
| Author | Blesa, Maria J. Rodriguez Corominas, Guillem Blum, Christian |
| Author_xml | – sequence: 1 givenname: Guillem orcidid: 0000-0002-3863-2017 surname: Rodriguez Corominas fullname: Rodriguez Corominas, Guillem email: guillem.rodriguez.corominas@upc.edu organization: Department of Computer Science, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain – sequence: 2 givenname: Maria J. surname: Blesa fullname: Blesa, Maria J. organization: Department of Computer Science, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain – sequence: 3 givenname: Christian orcidid: 0000-0002-1736-3559 surname: Blum fullname: Blum, Christian organization: Artificial Intelligence Research Institute (IIIA-CSIC), Bellaterra, Spain |
| BookMark | eNpNkE9vwjAMxaOJSWOMT7Adekdl-UPa5IgqxpCYdmCcIzd1oaxtprQXvv3CiiZ8sfXk35P9HsmodS0S8szonDGqX5dZttrt5pxyORcyYVyLOzLmLNGxkCIZ3cwPZNp1JxpKBUmmY5ItrcUaPfRVe4j6I0bf8QdC281m0bI-OF_1xybKz9G-uyys0TXY-8pGm7Z0vgmYa5_IfQl1h9Nrn5D92-ore4-3n-tNttzGVnAqYrCgNGKRliAkT1RaBAG45EUubLkArfOUI9coOUphE5pIlAW3KtCKMSUmZDP4Fg5O5sdXDfizcVCZP8H5gwHfV7ZGwxhXKQekooBFmUsNFIBatUhpHi7A4CUGL-td13ks__0YNZdYzRCrucRqrrEG6mWgKkS8IXQS3pHiF_ASdOc |
| CODEN | IAECCG |
| Cites_doi | 10.1109/TIT.1982.1056489 10.1145/2133803.2184450 10.1016/j.chemolab.2013.10.012 10.1145/1327452.1327492 10.3390/app11156963 10.3844/jcssp.2014.1197.1206 10.1609/aaai.v30i1.10259 10.1186/1471-2105-7-173 10.1007/s10115-007-0114-2 10.3390/a14010006 10.1007/978-3-642-32512-0_50 10.24963/ijcai.2021/403 10.1049/el:19901037 10.1007/978-3-031-30678-5_11 10.1109/5.726791 10.1016/j.snb.2012.01.074 10.3390/s23052368 10.1109/MDM.2013.24 10.1007/s10994-009-5103-0 10.1137/1.9781611974348.37 10.3390/app11031043 10.2307/3001968 10.1016/j.ins.2022.11.139 10.1016/j.eswa.2015.03.013 10.1016/j.petrol.2021.109681 10.1016/j.patcog.2019.04.014 10.14778/2180912.2180915 10.1038/ncomms5308 10.1137/1.9781611972801.12 10.1137/1.9781611974973.32 10.1109/TPDS.2014.2306193 10.1109/76.825859 10.1609/icwsm.v2i1.18651 10.1145/1046456.1046470 10.1007/978-3-319-09259-1_2 10.1155/2021/9982484 10.1016/j.snb.2015.03.028 10.1002/9781118032794 |
| ContentType | Journal Article |
| DBID | 97E ESBDL RIA RIE AAYXX CITATION DOA |
| DOI | 10.1109/ACCESS.2025.3561293 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE Xplore Open Access Journals IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISSN | 2169-3536 |
| EndPage | 67717 |
| ExternalDocumentID | oai_doaj_org_article_112872ae03da4fb59a0aa0c8470bed7e 10_1109_ACCESS_2025_3561293 10966875 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Agencia Estatal de Investigación (AEI) grantid: PID-2020-112581GB-C21 (MOTION) – fundername: Department of Research and Universities of the Government of Catalonia by means of an European Social Fund (ESF)-Founded Pre-Doctoral Grant of the Catalan Agency for Management of University and Research Grants (AGAUR) grantid: 2022 FI_B 00903 – fundername: Ministerio de Ciencia e Innovación (MCIN)/AEI/10.13039/501100011033 grantid: TED2021-129319B-I00; PID2022-136787NB-I00 – fundername: European Commission–NextGenerationEU, through Momentum CSIC Programme: Develop Your Digital Talent grantid: MMT24-IIIA-02 funderid: 10.13039/501100000780 |
| GroupedDBID | 0R~ 4.4 5VS 6IK 97E AAJGR ABAZT ABVLG ACGFS ADBBV AGSQL ALMA_UNASSIGNED_HOLDINGS BCNDV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD ESBDL GROUPED_DOAJ IPLJI JAVBF KQ8 M43 M~E O9- OCL OK1 RIA RIE RNS AAYXX CITATION |
| ID | FETCH-LOGICAL-c3203-aca89eed7fa352687dca8a252db3cf4a99b72e29e52e53c6065e5d2c832081183 |
| IEDL.DBID | DOA |
| ISSN | 2169-3536 |
| IngestDate | Fri Oct 03 12:52:49 EDT 2025 Sat Nov 29 07:23:22 EST 2025 Wed Oct 08 06:20:34 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| License | https://creativecommons.org/licenses/by/4.0/legalcode |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c3203-aca89eed7fa352687dca8a252db3cf4a99b72e29e52e53c6065e5d2c832081183 |
| ORCID | 0000-0002-1736-3559 0000-0002-3863-2017 |
| OpenAccessLink | https://doaj.org/article/112872ae03da4fb59a0aa0c8470bed7e |
| PageCount | 25 |
| ParticipantIDs | ieee_primary_10966875 doaj_primary_oai_doaj_org_article_112872ae03da4fb59a0aa0c8470bed7e crossref_primary_10_1109_ACCESS_2025_3561293 |
| PublicationCentury | 2000 |
| PublicationDate | 20250000 2025-00-00 2025-01-01 |
| PublicationDateYYYYMMDD | 2025-01-01 |
| PublicationDate_xml | – year: 2025 text: 20250000 |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE access |
| PublicationTitleAbbrev | Access |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| References | ref13 ref15 ref14 ref58 ref53 ref52 ref11 ref55 ref10 ref54 Dau (ref56) 2018 ref17 Kaul (ref42) 2013 Logacjov (ref46) 2023 ref16 ref19 ref18 Heck (ref37) 1998 Elkan (ref36) Lattanzi (ref25) Krizhevsky (ref57) 2009 Bachem (ref23); 29 ref47 ref44 ref43 (ref50) 2024 Bhatt (ref41) 2009 ref49 ref8 ref7 ref9 Vergara (ref51) 2012 ref4 ref3 ref6 Bachem (ref12); 37 Liang (ref26) 2022 ref35 ref34 ref31 Whiteson (ref62) 2014 ref30 Xia (ref33) 2022; 44 ref2 ref1 Bock (ref38) 2004 Anagnostopoulos (ref40) 2018 Newling (ref32); 48 Arthur (ref5) ref20 ref63 ref22 ref21 Fonollosa (ref48) 2015 ref28 ref27 ref29 Hebrail (ref45) 2006 Cohen-Addad (ref24); 33 ref61 Ortega-Binderberger (ref39) 1998 Bertin-Mahieux (ref60) 2011 Bertin-Mahieux (ref59) |
| References_xml | – ident: ref2 doi: 10.1109/TIT.1982.1056489 – ident: ref21 doi: 10.1145/2133803.2184450 – ident: ref53 doi: 10.1016/j.chemolab.2013.10.012 – volume-title: R6A—Yahoo! Front Page Today Module User Click Log Dataset, Version 1.0 year: 2024 ident: ref50 – ident: ref20 doi: 10.1145/1327452.1327492 – volume-title: Year Prediction MSD Dataset year: 2011 ident: ref60 – ident: ref27 doi: 10.3390/app11156963 – ident: ref7 doi: 10.3844/jcssp.2014.1197.1206 – start-page: 1027 volume-title: Proc. 18th Annu. ACM-SIAM Symp. Discrete Algorithms (SODA) ident: ref5 article-title: K-means++: The advantages of careful seeding – ident: ref22 doi: 10.1609/aaai.v30i1.10259 – ident: ref44 doi: 10.1186/1471-2105-7-173 – ident: ref3 doi: 10.1007/s10115-007-0114-2 – volume: 37 start-page: 209 volume-title: Proc. 32nd Int. Conf. Mach. Learn. (ICML) ident: ref12 article-title: Coresets for nonparametric estimation—The case of DP-means – ident: ref18 doi: 10.3390/a14010006 – ident: ref13 doi: 10.1007/978-3-642-32512-0_50 – volume-title: Gas Sensor Array Drift Dataset year: 2012 ident: ref51 – start-page: 1 year: 1998 ident: ref37 article-title: CORSIKA: A Monte Carlo code to simulate extensive air showers publication-title: Forschungszentrum Karlsruhe – volume-title: Road Network Dataset year: 2013 ident: ref42 – ident: ref15 doi: 10.24963/ijcai.2021/403 – ident: ref30 doi: 10.1049/el:19901037 – ident: ref16 doi: 10.1007/978-3-031-30678-5_11 – volume-title: The UCR Time Series Classification Archive year: 2018 ident: ref56 – ident: ref58 doi: 10.1109/5.726791 – ident: ref52 doi: 10.1016/j.snb.2012.01.074 – volume-title: Query Analytics Workloads Dataset year: 2018 ident: ref40 – ident: ref47 doi: 10.3390/s23052368 – ident: ref43 doi: 10.1109/MDM.2013.24 – ident: ref1 doi: 10.1007/s10994-009-5103-0 – ident: ref14 doi: 10.1137/1.9781611974348.37 – ident: ref35 doi: 10.3390/app11031043 – ident: ref63 doi: 10.2307/3001968 – ident: ref4 doi: 10.1016/j.ins.2022.11.139 – ident: ref8 doi: 10.1016/j.eswa.2015.03.013 – ident: ref11 doi: 10.1016/j.petrol.2021.109681 – ident: ref6 doi: 10.1016/j.patcog.2019.04.014 – volume-title: Skin Segmentation Dataset year: 2009 ident: ref41 – volume-title: Individual Household Electric Power Consumption Dataset year: 2006 ident: ref45 – start-page: 591 volume-title: Proc. 12th Int. Conf. Music Inf. Retr. (ISMIR) ident: ref59 article-title: The million song dataset – ident: ref17 doi: 10.14778/2180912.2180915 – ident: ref61 doi: 10.1038/ncomms5308 – volume: 29 start-page: 55 volume-title: Proc. Adv. Neural Inf. Process. Syst. ident: ref23 article-title: Fast and provably good seedings for k-means – ident: ref29 doi: 10.1137/1.9781611972801.12 – volume-title: Corel Image Features Dataset year: 1998 ident: ref39 – volume-title: Dataset year: 2023 ident: ref46 – volume-title: MAGIC Gamma Telescope Dataset year: 2004 ident: ref38 – volume: 33 start-page: 16235 volume-title: Proc. Adv. Neural Inf. Process. Syst. ident: ref24 article-title: Fast and accurate k-means++ via rejection sampling – ident: ref55 doi: 10.1137/1.9781611974973.32 – start-page: 3662 volume-title: Proc. 36th Int. Conf. Mach. Learn. (ICML) ident: ref25 article-title: A better k-means++ algorithm via local search – ident: ref19 doi: 10.1109/TPDS.2014.2306193 – volume: 48 start-page: 936 volume-title: Proc. 33rd Int. Conf. Mach. Learn. ident: ref32 article-title: Fast k-means with accurate bounds – ident: ref34 doi: 10.1109/76.825859 – ident: ref10 doi: 10.1609/icwsm.v2i1.18651 – volume-title: Learning Multiple Layers of Features From Tiny Images year: 2009 ident: ref57 – ident: ref54 doi: 10.1145/1046456.1046470 – volume: 44 start-page: 87 issue: 1 year: 2022 ident: ref33 article-title: Ball k-means: Fast adaptive clustering with no bounds publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – start-page: 147 volume-title: Proc. 20th Int. Conf. Int. Conf. Mach. Learn. (ICML) ident: ref36 article-title: Using the triangle inequality to accelerate k-means – ident: ref31 doi: 10.1007/978-3-319-09259-1_2 – volume-title: Gas Sensor Array Under Dynamic Gas Mixtures Dataset year: 2015 ident: ref48 – ident: ref9 doi: 10.1155/2021/9982484 – year: 2022 ident: ref26 article-title: A faster k-means++ algorithm publication-title: arXiv:2211.15118 – ident: ref49 doi: 10.1016/j.snb.2015.03.028 – volume-title: Dataset year: 2014 ident: ref62 – ident: ref28 doi: 10.1002/9781118032794 |
| SSID | ssj0000816957 |
| Score | 2.3344011 |
| Snippet | Clustering is a fundamental task in data analysis with applications across a wide range of fields, such as computer vision, pattern recognition, and data... |
| SourceID | doaj crossref ieee |
| SourceType | Open Website Index Database Publisher |
| StartPage | 67693 |
| SubjectTerms | Approximation algorithms Clustering Clustering algorithms Computational efficiency D² sampling Filters k-means Machine learning algorithms Measurement norm Prediction algorithms Scalability triangle inequality Vectors |
| SummonAdditionalLinks | – databaseName: IEEE Electronic Library (IEL) dbid: RIE link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagYoCBZxHlJQ9sJSV14joeQ0VhoWIAqVtkO5dSQRNUWiT-PWfHVO3AwBZZjmJ_59N3d_HdEXIlY1loJNKgVwgexHGhg6TLk4AXzDCQOtK-2YQYDpPRSD75ZHWXCwMA7vIZdOyj-5efV2ZhQ2Wo4Wico4G9STaF6NXJWsuAiu0gIbnwlYVw6k3a7-Mm0AdkvBPZLpAyWmMfV6R_rauKI5XB3j-Xs092vfVI01rcB2QDykOys1JT8Ij0U2OQSqxgyzFF846-BY-AhNRu0_R9XM0m89cp1d_UXRag91BNbVMtQ31ikhVUk7wM7p77D4HvlBCYiIVRoIxKJLKdKJStd5-IHAcU4yzXkSliJaUWDJgEzoBHBp0WDjxnBtUZYUOtPiaNsirhhNBcdtEFQ9q2gQ6hbB5rLy-gCHE-13ncIte_CGYfdUGMzDkSocxqwDMLeOYBb5Fbi_Jyqq1m7QYQycwrB76MfhtTEEa5wtPCpQqVCg0SZ6hxU9AiTYv-yvdq4E__GD8j23YNdaTknDTmswVckC3zNZ98zi7dsfkBYfq_jg priority: 102 providerName: IEEE |
| Title | Accelerating the k-Means++ Algorithm by Using Geometric Information |
| URI | https://ieeexplore.ieee.org/document/10966875 https://doaj.org/article/112872ae03da4fb59a0aa0c8470bed7e |
| Volume | 13 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2169-3536 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000816957 issn: 2169-3536 databaseCode: DOA dateStart: 20130101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2169-3536 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000816957 issn: 2169-3536 databaseCode: M~E dateStart: 20130101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV27TsMwFLVQxQAD4lFEeVQe2Iqp68R1PIaohaUVA0jdIj8LgqaoFCQWvp3rJKAwsbBksCzHOTf28bHscxE6l7H0GoiUDL3gJI69JsmAJ4R7ZpiTOtJ1sgkxnSazmbxtpPoKZ8Iqe-AKuD6sBxLBlKORVdASl4oqRQ1MqlQ7K1yYfamQDTFVzsHJYCi5qG2GBlT20yyDLwJByPhlFFJCyugXFZWO_b9SrJQMM95FO_XSEKdVl_bQhiv20XbDMPAAZakxwBMhasUcw9oNP5GJA7bp9XD6PF-C0n9YYP2By5MA-NotFyFjlsH1raMQhTa6H4_ushtSp0EgJmI0IsqoRAKVCa-CmX0iLBQoxpnVkfGxklIL5ph0nDkeGVAk3HHLDIxVgAGG7CFqFcvCHSFs5QD0FXBy2MUQKlxSHVrvPIX6XNu4gy6-EclfKreLvFQJVOYVgHkAMK8B7KCrgNpP1WBVXRZAAPM6gPlfAeygdsC88T5QYKCijv-j8RO0FTpc7ZmcotZ69ebO0KZ5Xz--rrrlPwPPyeeoW978-wI8tMSp |
| linkProvider | Directory of Open Access Journals |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDLZgIAEHnkOMZw7cRqGkzdocSwUMAROHIe1WJak7JmBDYyDx73HSMo0DB25VlKrJ51if7cY2wLEMZaGJSL1WEQkvDAvtxeci9kTBDUepA101m4g6nbjXkw9VsrrLhUFEd_kMT-2j-5efj8yHDZWRhpNxTgb2PCyIMOR-ma41DanYHhJSRFVtIZp8lqQpbYO8QC5OA9sHUga_-MeV6f_VV8XRytXaPxe0DquV_ciSUuAbMIfDTViZqSq4BWliDJGJFe2wz8jAY8_ePRIlNZsseemPxoPJ0yvTX8xdF2DXOHq1bbUMq1KTrKjq8Hh12U3bXtUrwTMB9wNPGRVL4ruoULbifRzlNKC44LkOTBEqKXXEkUsUHEVgyG0RKHJuSKEJNtLrbagNR0PcAZbLc3LCiLhtqCNSNpO1lRdY-DRf6DxswMkPgtlbWRIjc66EL7MS8MwCnlWAN-DCojydautZuwFCMqvUg14mz40r9INc0XkRUvlK-Yao09e0KWxA3aI_870S-N0_xo9gqd29v8vubjq3e7Bs11PGTfahNhl_4AEsms_J4H186I7QN4VVwtU |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Accelerating+the+k-Means%2B%2B+Algorithm+by+Using+Geometric+Information&rft.jtitle=IEEE+access&rft.au=Rodriguez+Corominas%2C+Guillem&rft.au=Blesa%2C+Maria+J.&rft.au=Blum%2C+Christian&rft.date=2025&rft.pub=IEEE&rft.eissn=2169-3536&rft.volume=13&rft.spage=67693&rft.epage=67717&rft_id=info:doi/10.1109%2FACCESS.2025.3561293&rft.externalDocID=10966875 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon |