Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling
•Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to...
Gespeichert in:
| Veröffentlicht in: | Expert systems with applications Jg. 150; S. 113288 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
New York
Elsevier Ltd
15.07.2020
Elsevier BV |
| Schlagworte: | |
| ISSN: | 0957-4174, 1873-6793 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | •Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method.
Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors.
In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster.
We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets. |
|---|---|
| AbstractList | Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors. In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster. We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets. •Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors. In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster. We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets. |
| ArticleNumber | 113288 |
| Author | Cho, Sungzoon Kim, Han Kyul Kim, Hyunjoong |
| Author_xml | – sequence: 1 givenname: Hyunjoong surname: Kim fullname: Kim, Hyunjoong email: hyunjoong@dm.snu.ac.kr organization: Department of Industrial Engineering, Seoul National University, South Korea – sequence: 2 givenname: Han Kyul orcidid: 0000-0002-4854-7211 surname: Kim fullname: Kim, Han Kyul email: hank@dm.snu.ac.kr organization: Department of Industrial Engineering, Seoul National University, South Korea – sequence: 3 givenname: Sungzoon orcidid: 0000-0002-1695-1973 surname: Cho fullname: Cho, Sungzoon email: zoon@snu.ac.kr organization: Department of Industrial Engineering, Seoul National University, South Korea |
| BookMark | eNp9kM1OGzEUhS0UJMLPC7Cy1G0n2J4_p-qmQgUiRWIDa-uOfQc8TOzUdqjoug-Ow7CousjKkn2-43u_UzJz3iEhl5wtOOPN1bDA-BsWgol8wUsh5RGZc9mWRdMuyxmZs2XdFhVvqxNyGuPAGG8Za-fk72qzDf7Vuicat88YrIaRvhQbBBdp7wM1Xu826BLV4y6mHHBP3-gNxESts8nCaP9Ast59zTyEiFTncPDW0Nw7oJ7ewBmKfW-1_aeKjtDhmAvPyXEPY8SLz_OMPN78fLi-K9b3t6vrH-tC541S0XEpZF0jyg7KrmNCd31TiZY3jNVNy03POVQCamYkX2JpdFNzo7WooNGAUJ6RL1NvHu3XDmNSg98Fl79UoiqlEGxZyZwSU0oHH2PAXm2D3UB4U5ypvW01qL1ttbetJtsZkv9B2qYPMSmAHQ-j3ycU8-qvFoOKe00ajQ3ZnzLeHsLfAXhZoAE |
| CitedBy_id | crossref_primary_10_1016_j_aei_2022_101805 crossref_primary_10_1016_j_patrec_2025_04_019 crossref_primary_10_1177_01655515231165230 crossref_primary_10_1016_j_eswa_2021_114652 crossref_primary_10_52080_rvgluz_30_110_17 crossref_primary_10_5585_2024_23974 crossref_primary_10_1016_j_eswa_2021_115560 crossref_primary_10_1016_j_aei_2023_102277 crossref_primary_10_1371_journal_pone_0313238 crossref_primary_10_1145_3665324 crossref_primary_10_1080_01969722_2023_2175135 crossref_primary_10_2166_ws_2022_273 crossref_primary_10_3390_su16114639 crossref_primary_10_1109_TFUZZ_2023_3235384 crossref_primary_10_1155_2023_4181159 crossref_primary_10_1007_s11390_021_0102_0 crossref_primary_10_1016_j_inffus_2024_102886 crossref_primary_10_1007_s12626_020_00063_4 crossref_primary_10_1016_j_knosys_2021_107591 crossref_primary_10_1016_j_eswa_2020_113598 crossref_primary_10_3390_su15086748 crossref_primary_10_1007_s11432_021_3316_x crossref_primary_10_1145_3588685 crossref_primary_10_1016_j_asoc_2025_113699 crossref_primary_10_1016_j_swevo_2024_101720 crossref_primary_10_3233_JIFS_202079 crossref_primary_10_3389_fmed_2023_1076794 |
| Cites_doi | 10.1093/comjnl/16.1.30 10.1016/j.jcss.2012.05.004 10.1016/j.patcog.2008.04.004 10.1109/TIT.1982.1056489 10.1016/j.eswa.2017.05.002 10.1108/eb026526 10.1109/TCYB.2013.2283497 10.1016/0377-0427(87)90125-7 10.1007/s10107-010-0420-4 10.1007/s40745-015-0040-1 10.1016/j.is.2016.02.007 10.1103/PhysRevE.70.066111 10.1016/j.knosys.2016.06.031 10.1016/j.neucom.2017.05.046 10.1016/S0031-3203(02)00060-2 10.14778/2180912.2180915 10.1016/j.eswa.2016.03.045 10.1023/A:1007612920971 |
| ContentType | Journal Article |
| Copyright | 2020 Copyright Elsevier BV Jul 15, 2020 |
| Copyright_xml | – notice: 2020 – notice: Copyright Elsevier BV Jul 15, 2020 |
| DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
| DOI | 10.1016/j.eswa.2020.113288 |
| DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Computer and Information Systems Abstracts |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1873-6793 |
| ExternalDocumentID | 10_1016_j_eswa_2020_113288 S0957417420301135 |
| GroupedDBID | --K --M .DC .~1 0R~ 13V 1B1 1RT 1~. 1~5 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN 9JO AAAKF AABNK AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AARIN AAXUO AAYFN ABBOA ABFNM ABMAC ABMVD ABUCO ABYKQ ACDAQ ACGFS ACHRH ACNTT ACRLP ACZNC ADBBV ADEZE ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGJBL AGUBO AGUMN AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJOXV ALEQD ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD APLSM AXJTR BJAXD BKOJK BLXMC BNSAS CS3 DU5 EBS EFJIC EFLBG EO8 EO9 EP2 EP3 F5P FDB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HAMUX IHE J1W JJJVA KOM LG9 LY1 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. PQQKQ Q38 ROL RPZ SDF SDG SDP SDS SES SPC SPCBC SSB SSD SSL SST SSV SSZ T5K TN5 ~G- 29G 9DU AAAKG AAQXK AATTM AAXKI AAYWO AAYXX ABJNI ABKBG ABUFD ABWVN ABXDB ACLOT ACNNM ACRPL ACVFH ADCNI ADJOM ADMUD ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP ASPBG AVWKF AZFZN CITATION EFKBS EJD FEDTE FGOYB G-2 HLZ HVGLF HZ~ R2- SBC SET SEW WUQ XPP ZMT ~HD 7SC 8FD AFXIZ AGCQF AGRNS BNPGV JQ2 L7M L~C L~D SSH |
| ID | FETCH-LOGICAL-c328t-b182855ee8ba3bb02cbf642716005671df11a42a50d819e3dc651dcc24a6caea3 |
| ISICitedReferencesCount | 37 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000528193700020&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0957-4174 |
| IngestDate | Sat Jul 26 02:32:53 EDT 2025 Tue Nov 18 21:32:28 EST 2025 Sat Nov 29 07:09:12 EST 2025 Fri Feb 23 02:48:37 EST 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Spherical k-means Document clustering Sparse vector projection Clustering labeling k-means initialization |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c328t-b182855ee8ba3bb02cbf642716005671df11a42a50d819e3dc651dcc24a6caea3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-1695-1973 0000-0002-4854-7211 |
| PQID | 2438220948 |
| PQPubID | 2045477 |
| ParticipantIDs | proquest_journals_2438220948 crossref_primary_10_1016_j_eswa_2020_113288 crossref_citationtrail_10_1016_j_eswa_2020_113288 elsevier_sciencedirect_doi_10_1016_j_eswa_2020_113288 |
| PublicationCentury | 2000 |
| PublicationDate | 2020-07-15 |
| PublicationDateYYYYMMDD | 2020-07-15 |
| PublicationDate_xml | – month: 07 year: 2020 text: 2020-07-15 day: 15 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | Expert systems with applications |
| PublicationYear | 2020 |
| Publisher | Elsevier Ltd Elsevier BV |
| Publisher_xml | – name: Elsevier Ltd – name: Elsevier BV |
| References | Xie, Girshick, Farhadi (bib0039) 2016 Chuang, Ramage, Manning, Heer (bib0011) 2012 Likas, Vlassis, Verbeek (bib0026) 2003; 36 Shen, Liu, Tsang, Shen, Sun (bib0034) 2017 Lewis, Ackerman, de Sa (bib0024) 2012; 34 Yang, Fu, Sidiropoulos, Hong (bib0041) 2017 Bagirov (bib0005) 2008; 41 Buchta, Kober, Feinerer, Hornik (bib0008) 2012; 50 Lloyd (bib0027) 1982; 28 Arthur, Vassilvitskii (bib0003) 2007 Li, Zhao, Chu, Liu (bib0025) 2013; 79 Abualigah, Khader, Al-Betar, Alomari (bib0001) 2017; 84 Onan, Korukoğlu, Bulut (bib0029) 2016; 57 Jurafsky (bib0022) 2000 Ester, Kriegel, Hans-Peter nd Sander, Xu, thers (bib0018) 1996 Snyder, Knowles, Dredze, Gormley, Wolfe (bib0037) 2013 Dhillon, Guan, Kogan (bib0015) 2002 Zhang, Xu, Tang, Li (bib0042) 2006 Kim, Kim, Cho (bib0023) 2017; 266 Bachem, Lucic, Hassani, Krause (bib0004) 2016 Capó, Pérez, Lozano (bib0009) 2017; 117 Jin, Li, Lin, Cai (bib0021) 2013; 44 He, Wen, Sun (bib0019) 2013 Almeida, Guedes, Meira, Zaki (bib0002) 2011 Sievert, Shirley (bib0036) 2014 Sparck Jones (bib0038) 1972; 28 Chuang, Manning, Heer (bib0010) 2012 Blei, Ng, Jordan (bib0007) 2003; 3 Huang (bib0020) 2008 Coates, Ng, Lee (bib0013) 2011 Newman, Noh, Talley, Karimi, Baldwin (bib0028) 2010 Rousseeuw (bib0030) 1987; 20 Shalev-Shwartz, Singer, Srebro, Cotter (bib0033) 2011; 127 Duchi, Shalev-Shwartz, Singer, Chandra (bib0017) 2008 Xu, Tian (bib0040) 2015; 2 Bahmani, Moseley, Vattani, Kumar, Vassilvitskii (bib0006) 2012; 5 Clauset, Newman, oore (bib0012) 2004; 70 Dhillon, Modha (bib0016) 2001; 42 Sibson (bib0035) 1973; 16 Coates, Ng (bib0014) 2012 Sculley (bib0031) 2010 Shahrivari, Jalili (bib0032) 2016; 60 Capó (10.1016/j.eswa.2020.113288_bib0009) 2017; 117 Snyder (10.1016/j.eswa.2020.113288_bib0037) 2013 Chuang (10.1016/j.eswa.2020.113288_bib0011) 2012 Newman (10.1016/j.eswa.2020.113288_bib0028) 2010 Bahmani (10.1016/j.eswa.2020.113288_bib0006) 2012; 5 Clauset (10.1016/j.eswa.2020.113288_bib0012) 2004; 70 He (10.1016/j.eswa.2020.113288_bib0019) 2013 Lewis (10.1016/j.eswa.2020.113288_bib0024) 2012; 34 Coates (10.1016/j.eswa.2020.113288_bib0014) 2012 Xu (10.1016/j.eswa.2020.113288_bib0040) 2015; 2 Sculley (10.1016/j.eswa.2020.113288_bib0031) 2010 Shahrivari (10.1016/j.eswa.2020.113288_bib0032) 2016; 60 Shalev-Shwartz (10.1016/j.eswa.2020.113288_bib0033) 2011; 127 Shen (10.1016/j.eswa.2020.113288_bib0034) 2017 Sparck Jones (10.1016/j.eswa.2020.113288_bib0038) 1972; 28 Onan (10.1016/j.eswa.2020.113288_bib0029) 2016; 57 Duchi (10.1016/j.eswa.2020.113288_bib0017) 2008 Lloyd (10.1016/j.eswa.2020.113288_bib0027) 1982; 28 Coates (10.1016/j.eswa.2020.113288_bib0013) 2011 Kim (10.1016/j.eswa.2020.113288_bib0023) 2017; 266 Sievert (10.1016/j.eswa.2020.113288_bib0036) 2014 Sibson (10.1016/j.eswa.2020.113288_bib0035) 1973; 16 Yang (10.1016/j.eswa.2020.113288_bib0041) 2017 Likas (10.1016/j.eswa.2020.113288_bib0026) 2003; 36 Dhillon (10.1016/j.eswa.2020.113288_bib0016) 2001; 42 Abualigah (10.1016/j.eswa.2020.113288_bib0001) 2017; 84 Buchta (10.1016/j.eswa.2020.113288_bib0008) 2012; 50 Chuang (10.1016/j.eswa.2020.113288_bib0010) 2012 Bachem (10.1016/j.eswa.2020.113288_bib0004) 2016 Blei (10.1016/j.eswa.2020.113288_bib0007) 2003; 3 Almeida (10.1016/j.eswa.2020.113288_bib0002) 2011 Huang (10.1016/j.eswa.2020.113288_bib0020) 2008 Arthur (10.1016/j.eswa.2020.113288_bib0003) 2007 Xie (10.1016/j.eswa.2020.113288_bib0039) 2016 Dhillon (10.1016/j.eswa.2020.113288_bib0015) 2002 Bagirov (10.1016/j.eswa.2020.113288_bib0005) 2008; 41 Li (10.1016/j.eswa.2020.113288_bib0025) 2013; 79 Ester (10.1016/j.eswa.2020.113288_bib0018) 1996 Jurafsky (10.1016/j.eswa.2020.113288_bib0022) 2000 Jin (10.1016/j.eswa.2020.113288_bib0021) 2013; 44 Zhang (10.1016/j.eswa.2020.113288_bib0042) 2006 Rousseeuw (10.1016/j.eswa.2020.113288_bib0030) 1987; 20 |
| References_xml | – start-page: 44 year: 2011 end-page: 59 ident: bib0002 article-title: Is there a best quality metric for graph clusters? – volume: 84 start-page: 24 year: 2017 end-page: 36 ident: bib0001 article-title: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering publication-title: Expert Systems with Applications – start-page: 55 year: 2016 end-page: 63 ident: bib0004 article-title: Fast and provably good seedings for k-means publication-title: Advances in neural information processing systems – volume: 79 start-page: 216 year: 2013 end-page: 229 ident: bib0025 article-title: Speeding up k-means algorithm by gpus publication-title: Journal of Computer and System Sciences – volume: 41 start-page: 3192 year: 2008 end-page: 3199 ident: bib0005 article-title: Modified global k-means algorithm for minimum sum-of-squares clustering problems publication-title: Pattern Recognition – volume: 44 start-page: 1362 year: 2013 end-page: 1371 ident: bib0021 article-title: Density sensitive hashing – volume: 60 start-page: 1 year: 2016 end-page: 12 ident: bib0032 article-title: Single-pass and linear-time k-means clustering based on mapreduce publication-title: Information Systems – start-page: 85 year: 2006 end-page: 96 ident: bib0042 article-title: Keyword extraction using support vector machine publication-title: International conference on web-age information management – volume: 42 start-page: 143 year: 2001 end-page: 175 ident: bib0016 article-title: Concept decompositions for large sparse text data using clustering – volume: 50 start-page: 1 year: 2012 end-page: 22 ident: bib0008 article-title: Spherical k-means clustering publication-title: Journal of Statistical Software – volume: 3 start-page: 993 year: 2003 end-page: 1022 ident: bib0007 article-title: Latent dirichlet allocation publication-title: Journal of machine Learning research – start-page: 226 year: 1996 end-page: 231 ident: bib0018 article-title: A density-based algorithm for discovering clusters in large spatial databases with noise. publication-title: Proceeding of the 2nd international conference of knowledge discovery and data mining – start-page: 1177 year: 2010 end-page: 1178 ident: bib0031 article-title: Web-scale k-means clustering publication-title: Proceedings of the 19th international conference on world wide web – start-page: 3861 year: 2017 end-page: 3870 ident: bib0041 article-title: Towards k-means-friendly spaces: Simultaneous deep learning and clustering publication-title: Proceedings of the 34th international conference on machine learning-volume 70 – volume: 266 start-page: 336 year: 2017 end-page: 352 ident: bib0023 article-title: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation publication-title: Neurocomputing – start-page: 131 year: 2002 end-page: 138 ident: bib0015 article-title: Iterative clustering of high dimensional text data augmented by local search – start-page: 49 year: 2008 end-page: 56 ident: bib0020 article-title: Similarity measures for text document clustering – volume: 36 start-page: 451 year: 2003 end-page: 461 ident: bib0026 article-title: The global k-means clustering algorithm – start-page: 5 year: 2013 end-page: 9 ident: bib0037 article-title: Topic models and metadata for visualizing text corpora publication-title: Proceedings of the 2013 NAACL HLT Demonstration Session – start-page: 561 year: 2012 end-page: 580 ident: bib0014 article-title: Learning feature representations with k-means publication-title: Neural networks: Tricks of the trade – volume: 2 start-page: 165 year: 2015 end-page: 193 ident: bib0040 article-title: A comprehensive survey of clustering algorithms publication-title: Annals of Data Science – start-page: 2938 year: 2013 end-page: 2945 ident: bib0019 article-title: K-means hashing: An affinity-preserving quantization method for learning binary compact codes publication-title: Proceedings of the ieee conference on computer vision and pattern recognition – start-page: 443 year: 2012 end-page: 452 ident: bib0011 article-title: Interpretation and trust: Designing model-driven visualizations for text analysis publication-title: Proceedings of the sigchi conference on human factors in computing systems – volume: 57 start-page: 232 year: 2016 end-page: 247 ident: bib0029 article-title: Ensemble of keyword extraction methods and classifiers in text classification publication-title: Expert Systems with Applications – start-page: 272 year: 2008 end-page: 279 ident: bib0017 article-title: Efficient projections onto the l 1-ball for learning in high dimensions publication-title: Proceedings of the 25th international conference on machine learning – start-page: 215 year: 2011 end-page: 223 ident: bib0013 article-title: An analysis of single-layer networks in unsupervised feature learning publication-title: Proceedings of the fourteenth international conference on artificial intelligence and statistics – volume: 20 start-page: 53 year: 1987 end-page: 65 ident: bib0030 article-title: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis – volume: 5 start-page: 622 year: 2012 end-page: 633 ident: bib0006 article-title: Scalable k-means++ publication-title: Proceedings of the VLDB Endowment – volume: 28 start-page: 11 year: 1972 end-page: 21 ident: bib0038 article-title: A statistical interpretation of term specificity and its application in retrieval – volume: 34 year: 2012 ident: bib0024 article-title: Human cluster evaluation and formal quality measures: A comparative study publication-title: Proceedings of the annual meeting of the cognitive science society – volume: 28 start-page: 129 year: 1982 end-page: 137 ident: bib0027 article-title: Least squares quantization in pcm – start-page: 478 year: 2016 end-page: 487 ident: bib0039 article-title: Unsupervised deep embedding for clustering analysis publication-title: International conference on machine learning – volume: 70 start-page: 66111 year: 2004 ident: bib0012 article-title: Finding community structure in very large networks – volume: 16 start-page: 30 year: 1973 end-page: 34 ident: bib0035 article-title: Slink: an optimally efficient algorithm for the single-link cluster method – start-page: 63 year: 2014 end-page: 70 ident: bib0036 article-title: Ldavis: A method for visualizing and interpreting topics publication-title: Proceedings of the workshop on interactive language learning, visualization, and interfaces – start-page: 215 year: 2010 end-page: 224 ident: bib0028 article-title: Evaluating topic models for digital libraries publication-title: Proceedings of the 10th annual joint conference on digital libraries – volume: 117 start-page: 56 year: 2017 end-page: 69 ident: bib0009 article-title: An efficient approximation to the k-means clustering for massive data publication-title: Knowledge-Based Systems – start-page: 1027 year: 2007 end-page: 1035 ident: bib0003 article-title: k-means++: The advantages of careful seeding – year: 2000 ident: bib0022 article-title: Speech & language processing – start-page: 74 year: 2012 end-page: 77 ident: bib0010 article-title: Termite: Visualization techniques for assessing textual topic models publication-title: Proceedings of the international working conference on advanced visual interfaces – volume: 127 start-page: 3 year: 2011 end-page: 30 ident: bib0033 article-title: Pegasos: Primal estimated sub-gradient solver for svm – year: 2017 ident: bib0034 article-title: Compressed k-means for large-scale clustering publication-title: Thirty-first aaai conference on artificial intelligence – volume: 16 start-page: 30 issue: 1 year: 1973 ident: 10.1016/j.eswa.2020.113288_bib0035 article-title: Slink: an optimally efficient algorithm for the single-link cluster method publication-title: The Computer Journal doi: 10.1093/comjnl/16.1.30 – volume: 79 start-page: 216 issue: 2 year: 2013 ident: 10.1016/j.eswa.2020.113288_bib0025 article-title: Speeding up k-means algorithm by gpus publication-title: Journal of Computer and System Sciences doi: 10.1016/j.jcss.2012.05.004 – volume: 3 start-page: 993 issue: Jan year: 2003 ident: 10.1016/j.eswa.2020.113288_bib0007 article-title: Latent dirichlet allocation publication-title: Journal of machine Learning research – volume: 41 start-page: 3192 issue: 10 year: 2008 ident: 10.1016/j.eswa.2020.113288_bib0005 article-title: Modified global k-means algorithm for minimum sum-of-squares clustering problems publication-title: Pattern Recognition doi: 10.1016/j.patcog.2008.04.004 – start-page: 85 year: 2006 ident: 10.1016/j.eswa.2020.113288_bib0042 article-title: Keyword extraction using support vector machine – volume: 28 start-page: 129 issue: 2 year: 1982 ident: 10.1016/j.eswa.2020.113288_bib0027 article-title: Least squares quantization in pcm publication-title: IEEE Transactions on Information Theory doi: 10.1109/TIT.1982.1056489 – volume: 84 start-page: 24 year: 2017 ident: 10.1016/j.eswa.2020.113288_bib0001 article-title: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering publication-title: Expert Systems with Applications doi: 10.1016/j.eswa.2017.05.002 – volume: 50 start-page: 1 issue: 10 year: 2012 ident: 10.1016/j.eswa.2020.113288_bib0008 article-title: Spherical k-means clustering publication-title: Journal of Statistical Software – volume: 28 start-page: 11 issue: 1 year: 1972 ident: 10.1016/j.eswa.2020.113288_bib0038 article-title: A statistical interpretation of term specificity and its application in retrieval publication-title: Journal of Documentation doi: 10.1108/eb026526 – volume: 44 start-page: 1362 issue: 8 year: 2013 ident: 10.1016/j.eswa.2020.113288_bib0021 article-title: Density sensitive hashing publication-title: IEEE Transactions on Cybernetics doi: 10.1109/TCYB.2013.2283497 – volume: 20 start-page: 53 year: 1987 ident: 10.1016/j.eswa.2020.113288_bib0030 article-title: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis publication-title: Journal of Computational and Applied Mathematics doi: 10.1016/0377-0427(87)90125-7 – volume: 127 start-page: 3 issue: 1 year: 2011 ident: 10.1016/j.eswa.2020.113288_bib0033 article-title: Pegasos: Primal estimated sub-gradient solver for svm publication-title: Mathematical Programming doi: 10.1007/s10107-010-0420-4 – start-page: 74 year: 2012 ident: 10.1016/j.eswa.2020.113288_bib0010 article-title: Termite: Visualization techniques for assessing textual topic models – start-page: 131 year: 2002 ident: 10.1016/j.eswa.2020.113288_bib0015 article-title: Iterative clustering of high dimensional text data augmented by local search – start-page: 478 year: 2016 ident: 10.1016/j.eswa.2020.113288_bib0039 article-title: Unsupervised deep embedding for clustering analysis – start-page: 55 year: 2016 ident: 10.1016/j.eswa.2020.113288_bib0004 article-title: Fast and provably good seedings for k-means – volume: 2 start-page: 165 issue: 2 year: 2015 ident: 10.1016/j.eswa.2020.113288_bib0040 article-title: A comprehensive survey of clustering algorithms publication-title: Annals of Data Science doi: 10.1007/s40745-015-0040-1 – start-page: 2938 year: 2013 ident: 10.1016/j.eswa.2020.113288_bib0019 article-title: K-means hashing: An affinity-preserving quantization method for learning binary compact codes – volume: 60 start-page: 1 year: 2016 ident: 10.1016/j.eswa.2020.113288_bib0032 article-title: Single-pass and linear-time k-means clustering based on mapreduce publication-title: Information Systems doi: 10.1016/j.is.2016.02.007 – start-page: 226 year: 1996 ident: 10.1016/j.eswa.2020.113288_bib0018 article-title: A density-based algorithm for discovering clusters in large spatial databases with noise. – volume: 70 start-page: 66111 issue: 6 year: 2004 ident: 10.1016/j.eswa.2020.113288_bib0012 article-title: Finding community structure in very large networks publication-title: Physical Review E doi: 10.1103/PhysRevE.70.066111 – year: 2000 ident: 10.1016/j.eswa.2020.113288_bib0022 – start-page: 1027 year: 2007 ident: 10.1016/j.eswa.2020.113288_bib0003 article-title: k-means++: The advantages of careful seeding – volume: 117 start-page: 56 year: 2017 ident: 10.1016/j.eswa.2020.113288_bib0009 article-title: An efficient approximation to the k-means clustering for massive data publication-title: Knowledge-Based Systems doi: 10.1016/j.knosys.2016.06.031 – volume: 266 start-page: 336 year: 2017 ident: 10.1016/j.eswa.2020.113288_bib0023 article-title: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation publication-title: Neurocomputing doi: 10.1016/j.neucom.2017.05.046 – volume: 34 year: 2012 ident: 10.1016/j.eswa.2020.113288_bib0024 article-title: Human cluster evaluation and formal quality measures: A comparative study – start-page: 1177 year: 2010 ident: 10.1016/j.eswa.2020.113288_bib0031 article-title: Web-scale k-means clustering – start-page: 3861 year: 2017 ident: 10.1016/j.eswa.2020.113288_bib0041 article-title: Towards k-means-friendly spaces: Simultaneous deep learning and clustering – start-page: 215 year: 2010 ident: 10.1016/j.eswa.2020.113288_bib0028 article-title: Evaluating topic models for digital libraries – year: 2017 ident: 10.1016/j.eswa.2020.113288_bib0034 article-title: Compressed k-means for large-scale clustering – volume: 36 start-page: 451 issue: 2 year: 2003 ident: 10.1016/j.eswa.2020.113288_bib0026 article-title: The global k-means clustering algorithm publication-title: Pattern Recognition doi: 10.1016/S0031-3203(02)00060-2 – volume: 5 start-page: 622 issue: 7 year: 2012 ident: 10.1016/j.eswa.2020.113288_bib0006 article-title: Scalable k-means++ publication-title: Proceedings of the VLDB Endowment doi: 10.14778/2180912.2180915 – start-page: 215 year: 2011 ident: 10.1016/j.eswa.2020.113288_bib0013 article-title: An analysis of single-layer networks in unsupervised feature learning – volume: 57 start-page: 232 year: 2016 ident: 10.1016/j.eswa.2020.113288_bib0029 article-title: Ensemble of keyword extraction methods and classifiers in text classification publication-title: Expert Systems with Applications doi: 10.1016/j.eswa.2016.03.045 – start-page: 561 year: 2012 ident: 10.1016/j.eswa.2020.113288_bib0014 article-title: Learning feature representations with k-means – start-page: 44 year: 2011 ident: 10.1016/j.eswa.2020.113288_bib0002 article-title: Is there a best quality metric for graph clusters? – start-page: 49 year: 2008 ident: 10.1016/j.eswa.2020.113288_bib0020 article-title: Similarity measures for text document clustering – start-page: 63 year: 2014 ident: 10.1016/j.eswa.2020.113288_bib0036 article-title: Ldavis: A method for visualizing and interpreting topics – start-page: 5 year: 2013 ident: 10.1016/j.eswa.2020.113288_bib0037 article-title: Topic models and metadata for visualizing text corpora publication-title: Proceedings of the 2013 NAACL HLT Demonstration Session – start-page: 443 year: 2012 ident: 10.1016/j.eswa.2020.113288_bib0011 article-title: Interpretation and trust: Designing model-driven visualizations for text analysis – start-page: 272 year: 2008 ident: 10.1016/j.eswa.2020.113288_bib0017 article-title: Efficient projections onto the l 1-ball for learning in high dimensions – volume: 42 start-page: 143 issue: 1 year: 2001 ident: 10.1016/j.eswa.2020.113288_bib0016 article-title: Concept decompositions for large sparse text data using clustering publication-title: Machine Learning doi: 10.1023/A:1007612920971 |
| SSID | ssj0017007 |
| Score | 2.4789648 |
| Snippet | •Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation... Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number... |
| SourceID | proquest crossref elsevier |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 113288 |
| SubjectTerms | Centroids Clustering Clustering labeling Datasets Dispersion Document clustering k-means initialization Labeling Sparse vector projection Spherical k-means |
| Title | Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling |
| URI | https://dx.doi.org/10.1016/j.eswa.2020.113288 https://www.proquest.com/docview/2438220948 |
| Volume | 150 |
| WOSCitedRecordID | wos000528193700020&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1873-6793 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0017007 issn: 0957-4174 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9NAEF6FlgMX3ohCQXtAXNqt_F6HW1UlKlAFDinKbWWv15AQ7JBHaTnzY_iZzOzDTiqI4MDFitbxOvF8npmdnfmGkBdRXJQ8LVPGwd6wKPZKloVRyjwVe0XOiyhRUjeb4INBOhp133c6P10tzMWUV1V6edmd_VdRwxgIG0tn_0HczaQwAJ9B6HAEscPxrwS_FiZAygAthM_siwKbpHMKi1qudAaAnK6QJcFWPPezBXYLGMMbP7W1mfj4QeHMMaMdg8D1GDkFdOTGnsWou9IkFGsTHgCwdJX7RtQfKZWXljjaldStbZ63uQAaoKdXq2pS13aOtXFQR2-v2mTGk0-1yS2qPn6vLcJsDAMWrMhMGbeBtaa45sNGgJKzyDc9fI6UUc8pD1nCTU_FRn8b5lqrgf3f2gUTopgcqcU3JJsKdC-bwDQU3CThHrwT_fOzMzHsjYYvZ18Z9ifDfXzbrOUG2Q143AUTsHv8ujd60-xYcc-U5rtfbQu0TC7h9dv-yQm65g5oH2d4l9y2ixN6bEB1j3RUdZ_ccY0_qLUDD8iPBmO0wRi1GKOAMeowRluMvaKIMLqJsENq8EUdvmiLr0MK6KINutxU1KHrITnv94Ynp8y282AS_vOS5T6yJcZKpXkW5rkXyLyE1S8s2JGPlvtF6ftZFGSgJMBNVWEhk9gvpAyiLJGZysJHZKeqK_WYUPChSpl0U16A_yyLJONlEcJdJLheyo_KPeK75yuk5brHlitT4ZIaJwJlIlAmwshkjxw018wM08vWb8dObML6qsYHFQC5rdftOxkLqzQWIsDd-MDrRumT7aefklvt67NPdpbzlXpGbsqL5Xgxf24h-Qtk5MPK |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Improving+spherical+k-means+for+document+clustering%3A+Fast+initialization%2C+sparse+centroid+projection%2C+and+efficient+cluster+labeling&rft.jtitle=Expert+systems+with+applications&rft.au=Kim%2C+Hyunjoong&rft.au=Kim%2C+Han+Kyul&rft.au=Cho%2C+Sungzoon&rft.date=2020-07-15&rft.pub=Elsevier+BV&rft.issn=0957-4174&rft.eissn=1873-6793&rft.volume=150&rft.spage=1&rft_id=info:doi/10.1016%2Fj.eswa.2020.113288&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0957-4174&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0957-4174&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0957-4174&client=summon |