Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling

•Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications Jg. 150; S. 113288
Hauptverfasser: Kim, Hyunjoong, Kim, Han Kyul, Cho, Sungzoon
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York Elsevier Ltd 15.07.2020
Elsevier BV
Schlagworte:
ISSN:0957-4174, 1873-6793
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract •Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors. In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster. We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets.
AbstractList Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors. In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster. We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets.
•Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors. In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster. We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets.
ArticleNumber 113288
Author Cho, Sungzoon
Kim, Han Kyul
Kim, Hyunjoong
Author_xml – sequence: 1
  givenname: Hyunjoong
  surname: Kim
  fullname: Kim, Hyunjoong
  email: hyunjoong@dm.snu.ac.kr
  organization: Department of Industrial Engineering, Seoul National University, South Korea
– sequence: 2
  givenname: Han Kyul
  orcidid: 0000-0002-4854-7211
  surname: Kim
  fullname: Kim, Han Kyul
  email: hank@dm.snu.ac.kr
  organization: Department of Industrial Engineering, Seoul National University, South Korea
– sequence: 3
  givenname: Sungzoon
  orcidid: 0000-0002-1695-1973
  surname: Cho
  fullname: Cho, Sungzoon
  email: zoon@snu.ac.kr
  organization: Department of Industrial Engineering, Seoul National University, South Korea
BookMark eNp9kM1OGzEUhS0UJMLPC7Cy1G0n2J4_p-qmQgUiRWIDa-uOfQc8TOzUdqjoug-Ow7CousjKkn2-43u_UzJz3iEhl5wtOOPN1bDA-BsWgol8wUsh5RGZc9mWRdMuyxmZs2XdFhVvqxNyGuPAGG8Za-fk72qzDf7Vuicat88YrIaRvhQbBBdp7wM1Xu826BLV4y6mHHBP3-gNxESts8nCaP9Ast59zTyEiFTncPDW0Nw7oJ7ewBmKfW-1_aeKjtDhmAvPyXEPY8SLz_OMPN78fLi-K9b3t6vrH-tC541S0XEpZF0jyg7KrmNCd31TiZY3jNVNy03POVQCamYkX2JpdFNzo7WooNGAUJ6RL1NvHu3XDmNSg98Fl79UoiqlEGxZyZwSU0oHH2PAXm2D3UB4U5ypvW01qL1ttbetJtsZkv9B2qYPMSmAHQ-j3ycU8-qvFoOKe00ajQ3ZnzLeHsLfAXhZoAE
CitedBy_id crossref_primary_10_1016_j_aei_2022_101805
crossref_primary_10_1016_j_patrec_2025_04_019
crossref_primary_10_1177_01655515231165230
crossref_primary_10_1016_j_eswa_2021_114652
crossref_primary_10_52080_rvgluz_30_110_17
crossref_primary_10_5585_2024_23974
crossref_primary_10_1016_j_eswa_2021_115560
crossref_primary_10_1016_j_aei_2023_102277
crossref_primary_10_1371_journal_pone_0313238
crossref_primary_10_1145_3665324
crossref_primary_10_1080_01969722_2023_2175135
crossref_primary_10_2166_ws_2022_273
crossref_primary_10_3390_su16114639
crossref_primary_10_1109_TFUZZ_2023_3235384
crossref_primary_10_1155_2023_4181159
crossref_primary_10_1007_s11390_021_0102_0
crossref_primary_10_1016_j_inffus_2024_102886
crossref_primary_10_1007_s12626_020_00063_4
crossref_primary_10_1016_j_knosys_2021_107591
crossref_primary_10_1016_j_eswa_2020_113598
crossref_primary_10_3390_su15086748
crossref_primary_10_1007_s11432_021_3316_x
crossref_primary_10_1145_3588685
crossref_primary_10_1016_j_asoc_2025_113699
crossref_primary_10_1016_j_swevo_2024_101720
crossref_primary_10_3233_JIFS_202079
crossref_primary_10_3389_fmed_2023_1076794
Cites_doi 10.1093/comjnl/16.1.30
10.1016/j.jcss.2012.05.004
10.1016/j.patcog.2008.04.004
10.1109/TIT.1982.1056489
10.1016/j.eswa.2017.05.002
10.1108/eb026526
10.1109/TCYB.2013.2283497
10.1016/0377-0427(87)90125-7
10.1007/s10107-010-0420-4
10.1007/s40745-015-0040-1
10.1016/j.is.2016.02.007
10.1103/PhysRevE.70.066111
10.1016/j.knosys.2016.06.031
10.1016/j.neucom.2017.05.046
10.1016/S0031-3203(02)00060-2
10.14778/2180912.2180915
10.1016/j.eswa.2016.03.045
10.1023/A:1007612920971
ContentType Journal Article
Copyright 2020
Copyright Elsevier BV Jul 15, 2020
Copyright_xml – notice: 2020
– notice: Copyright Elsevier BV Jul 15, 2020
DBID AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1016/j.eswa.2020.113288
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts

DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1873-6793
ExternalDocumentID 10_1016_j_eswa_2020_113288
S0957417420301135
GroupedDBID --K
--M
.DC
.~1
0R~
13V
1B1
1RT
1~.
1~5
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
9JO
AAAKF
AABNK
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AARIN
AAXUO
AAYFN
ABBOA
ABFNM
ABMAC
ABMVD
ABUCO
ABYKQ
ACDAQ
ACGFS
ACHRH
ACNTT
ACRLP
ACZNC
ADBBV
ADEZE
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGJBL
AGUBO
AGUMN
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
ALEQD
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
APLSM
AXJTR
BJAXD
BKOJK
BLXMC
BNSAS
CS3
DU5
EBS
EFJIC
EFLBG
EO8
EO9
EP2
EP3
F5P
FDB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HAMUX
IHE
J1W
JJJVA
KOM
LG9
LY1
LY7
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
PQQKQ
Q38
ROL
RPZ
SDF
SDG
SDP
SDS
SES
SPC
SPCBC
SSB
SSD
SSL
SST
SSV
SSZ
T5K
TN5
~G-
29G
9DU
AAAKG
AAQXK
AATTM
AAXKI
AAYWO
AAYXX
ABJNI
ABKBG
ABUFD
ABWVN
ABXDB
ACLOT
ACNNM
ACRPL
ACVFH
ADCNI
ADJOM
ADMUD
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
ASPBG
AVWKF
AZFZN
CITATION
EFKBS
EJD
FEDTE
FGOYB
G-2
HLZ
HVGLF
HZ~
R2-
SBC
SET
SEW
WUQ
XPP
ZMT
~HD
7SC
8FD
AFXIZ
AGCQF
AGRNS
BNPGV
JQ2
L7M
L~C
L~D
SSH
ID FETCH-LOGICAL-c328t-b182855ee8ba3bb02cbf642716005671df11a42a50d819e3dc651dcc24a6caea3
ISICitedReferencesCount 37
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000528193700020&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0957-4174
IngestDate Sat Jul 26 02:32:53 EDT 2025
Tue Nov 18 21:32:28 EST 2025
Sat Nov 29 07:09:12 EST 2025
Fri Feb 23 02:48:37 EST 2024
IsPeerReviewed true
IsScholarly true
Keywords Spherical k-means
Document clustering
Sparse vector projection
Clustering labeling
k-means initialization
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c328t-b182855ee8ba3bb02cbf642716005671df11a42a50d819e3dc651dcc24a6caea3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-1695-1973
0000-0002-4854-7211
PQID 2438220948
PQPubID 2045477
ParticipantIDs proquest_journals_2438220948
crossref_primary_10_1016_j_eswa_2020_113288
crossref_citationtrail_10_1016_j_eswa_2020_113288
elsevier_sciencedirect_doi_10_1016_j_eswa_2020_113288
PublicationCentury 2000
PublicationDate 2020-07-15
PublicationDateYYYYMMDD 2020-07-15
PublicationDate_xml – month: 07
  year: 2020
  text: 2020-07-15
  day: 15
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle Expert systems with applications
PublicationYear 2020
Publisher Elsevier Ltd
Elsevier BV
Publisher_xml – name: Elsevier Ltd
– name: Elsevier BV
References Xie, Girshick, Farhadi (bib0039) 2016
Chuang, Ramage, Manning, Heer (bib0011) 2012
Likas, Vlassis, Verbeek (bib0026) 2003; 36
Shen, Liu, Tsang, Shen, Sun (bib0034) 2017
Lewis, Ackerman, de Sa (bib0024) 2012; 34
Yang, Fu, Sidiropoulos, Hong (bib0041) 2017
Bagirov (bib0005) 2008; 41
Buchta, Kober, Feinerer, Hornik (bib0008) 2012; 50
Lloyd (bib0027) 1982; 28
Arthur, Vassilvitskii (bib0003) 2007
Li, Zhao, Chu, Liu (bib0025) 2013; 79
Abualigah, Khader, Al-Betar, Alomari (bib0001) 2017; 84
Onan, Korukoğlu, Bulut (bib0029) 2016; 57
Jurafsky (bib0022) 2000
Ester, Kriegel, Hans-Peter nd Sander, Xu, thers (bib0018) 1996
Snyder, Knowles, Dredze, Gormley, Wolfe (bib0037) 2013
Dhillon, Guan, Kogan (bib0015) 2002
Zhang, Xu, Tang, Li (bib0042) 2006
Kim, Kim, Cho (bib0023) 2017; 266
Bachem, Lucic, Hassani, Krause (bib0004) 2016
Capó, Pérez, Lozano (bib0009) 2017; 117
Jin, Li, Lin, Cai (bib0021) 2013; 44
He, Wen, Sun (bib0019) 2013
Almeida, Guedes, Meira, Zaki (bib0002) 2011
Sievert, Shirley (bib0036) 2014
Sparck Jones (bib0038) 1972; 28
Chuang, Manning, Heer (bib0010) 2012
Blei, Ng, Jordan (bib0007) 2003; 3
Huang (bib0020) 2008
Coates, Ng, Lee (bib0013) 2011
Newman, Noh, Talley, Karimi, Baldwin (bib0028) 2010
Rousseeuw (bib0030) 1987; 20
Shalev-Shwartz, Singer, Srebro, Cotter (bib0033) 2011; 127
Duchi, Shalev-Shwartz, Singer, Chandra (bib0017) 2008
Xu, Tian (bib0040) 2015; 2
Bahmani, Moseley, Vattani, Kumar, Vassilvitskii (bib0006) 2012; 5
Clauset, Newman, oore (bib0012) 2004; 70
Dhillon, Modha (bib0016) 2001; 42
Sibson (bib0035) 1973; 16
Coates, Ng (bib0014) 2012
Sculley (bib0031) 2010
Shahrivari, Jalili (bib0032) 2016; 60
Capó (10.1016/j.eswa.2020.113288_bib0009) 2017; 117
Snyder (10.1016/j.eswa.2020.113288_bib0037) 2013
Chuang (10.1016/j.eswa.2020.113288_bib0011) 2012
Newman (10.1016/j.eswa.2020.113288_bib0028) 2010
Bahmani (10.1016/j.eswa.2020.113288_bib0006) 2012; 5
Clauset (10.1016/j.eswa.2020.113288_bib0012) 2004; 70
He (10.1016/j.eswa.2020.113288_bib0019) 2013
Lewis (10.1016/j.eswa.2020.113288_bib0024) 2012; 34
Coates (10.1016/j.eswa.2020.113288_bib0014) 2012
Xu (10.1016/j.eswa.2020.113288_bib0040) 2015; 2
Sculley (10.1016/j.eswa.2020.113288_bib0031) 2010
Shahrivari (10.1016/j.eswa.2020.113288_bib0032) 2016; 60
Shalev-Shwartz (10.1016/j.eswa.2020.113288_bib0033) 2011; 127
Shen (10.1016/j.eswa.2020.113288_bib0034) 2017
Sparck Jones (10.1016/j.eswa.2020.113288_bib0038) 1972; 28
Onan (10.1016/j.eswa.2020.113288_bib0029) 2016; 57
Duchi (10.1016/j.eswa.2020.113288_bib0017) 2008
Lloyd (10.1016/j.eswa.2020.113288_bib0027) 1982; 28
Coates (10.1016/j.eswa.2020.113288_bib0013) 2011
Kim (10.1016/j.eswa.2020.113288_bib0023) 2017; 266
Sievert (10.1016/j.eswa.2020.113288_bib0036) 2014
Sibson (10.1016/j.eswa.2020.113288_bib0035) 1973; 16
Yang (10.1016/j.eswa.2020.113288_bib0041) 2017
Likas (10.1016/j.eswa.2020.113288_bib0026) 2003; 36
Dhillon (10.1016/j.eswa.2020.113288_bib0016) 2001; 42
Abualigah (10.1016/j.eswa.2020.113288_bib0001) 2017; 84
Buchta (10.1016/j.eswa.2020.113288_bib0008) 2012; 50
Chuang (10.1016/j.eswa.2020.113288_bib0010) 2012
Bachem (10.1016/j.eswa.2020.113288_bib0004) 2016
Blei (10.1016/j.eswa.2020.113288_bib0007) 2003; 3
Almeida (10.1016/j.eswa.2020.113288_bib0002) 2011
Huang (10.1016/j.eswa.2020.113288_bib0020) 2008
Arthur (10.1016/j.eswa.2020.113288_bib0003) 2007
Xie (10.1016/j.eswa.2020.113288_bib0039) 2016
Dhillon (10.1016/j.eswa.2020.113288_bib0015) 2002
Bagirov (10.1016/j.eswa.2020.113288_bib0005) 2008; 41
Li (10.1016/j.eswa.2020.113288_bib0025) 2013; 79
Ester (10.1016/j.eswa.2020.113288_bib0018) 1996
Jurafsky (10.1016/j.eswa.2020.113288_bib0022) 2000
Jin (10.1016/j.eswa.2020.113288_bib0021) 2013; 44
Zhang (10.1016/j.eswa.2020.113288_bib0042) 2006
Rousseeuw (10.1016/j.eswa.2020.113288_bib0030) 1987; 20
References_xml – start-page: 44
  year: 2011
  end-page: 59
  ident: bib0002
  article-title: Is there a best quality metric for graph clusters?
– volume: 84
  start-page: 24
  year: 2017
  end-page: 36
  ident: bib0001
  article-title: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering
  publication-title: Expert Systems with Applications
– start-page: 55
  year: 2016
  end-page: 63
  ident: bib0004
  article-title: Fast and provably good seedings for k-means
  publication-title: Advances in neural information processing systems
– volume: 79
  start-page: 216
  year: 2013
  end-page: 229
  ident: bib0025
  article-title: Speeding up k-means algorithm by gpus
  publication-title: Journal of Computer and System Sciences
– volume: 41
  start-page: 3192
  year: 2008
  end-page: 3199
  ident: bib0005
  article-title: Modified global k-means algorithm for minimum sum-of-squares clustering problems
  publication-title: Pattern Recognition
– volume: 44
  start-page: 1362
  year: 2013
  end-page: 1371
  ident: bib0021
  article-title: Density sensitive hashing
– volume: 60
  start-page: 1
  year: 2016
  end-page: 12
  ident: bib0032
  article-title: Single-pass and linear-time k-means clustering based on mapreduce
  publication-title: Information Systems
– start-page: 85
  year: 2006
  end-page: 96
  ident: bib0042
  article-title: Keyword extraction using support vector machine
  publication-title: International conference on web-age information management
– volume: 42
  start-page: 143
  year: 2001
  end-page: 175
  ident: bib0016
  article-title: Concept decompositions for large sparse text data using clustering
– volume: 50
  start-page: 1
  year: 2012
  end-page: 22
  ident: bib0008
  article-title: Spherical k-means clustering
  publication-title: Journal of Statistical Software
– volume: 3
  start-page: 993
  year: 2003
  end-page: 1022
  ident: bib0007
  article-title: Latent dirichlet allocation
  publication-title: Journal of machine Learning research
– start-page: 226
  year: 1996
  end-page: 231
  ident: bib0018
  article-title: A density-based algorithm for discovering clusters in large spatial databases with noise.
  publication-title: Proceeding of the 2nd international conference of knowledge discovery and data mining
– start-page: 1177
  year: 2010
  end-page: 1178
  ident: bib0031
  article-title: Web-scale k-means clustering
  publication-title: Proceedings of the 19th international conference on world wide web
– start-page: 3861
  year: 2017
  end-page: 3870
  ident: bib0041
  article-title: Towards k-means-friendly spaces: Simultaneous deep learning and clustering
  publication-title: Proceedings of the 34th international conference on machine learning-volume 70
– volume: 266
  start-page: 336
  year: 2017
  end-page: 352
  ident: bib0023
  article-title: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation
  publication-title: Neurocomputing
– start-page: 131
  year: 2002
  end-page: 138
  ident: bib0015
  article-title: Iterative clustering of high dimensional text data augmented by local search
– start-page: 49
  year: 2008
  end-page: 56
  ident: bib0020
  article-title: Similarity measures for text document clustering
– volume: 36
  start-page: 451
  year: 2003
  end-page: 461
  ident: bib0026
  article-title: The global k-means clustering algorithm
– start-page: 5
  year: 2013
  end-page: 9
  ident: bib0037
  article-title: Topic models and metadata for visualizing text corpora
  publication-title: Proceedings of the 2013 NAACL HLT Demonstration Session
– start-page: 561
  year: 2012
  end-page: 580
  ident: bib0014
  article-title: Learning feature representations with k-means
  publication-title: Neural networks: Tricks of the trade
– volume: 2
  start-page: 165
  year: 2015
  end-page: 193
  ident: bib0040
  article-title: A comprehensive survey of clustering algorithms
  publication-title: Annals of Data Science
– start-page: 2938
  year: 2013
  end-page: 2945
  ident: bib0019
  article-title: K-means hashing: An affinity-preserving quantization method for learning binary compact codes
  publication-title: Proceedings of the ieee conference on computer vision and pattern recognition
– start-page: 443
  year: 2012
  end-page: 452
  ident: bib0011
  article-title: Interpretation and trust: Designing model-driven visualizations for text analysis
  publication-title: Proceedings of the sigchi conference on human factors in computing systems
– volume: 57
  start-page: 232
  year: 2016
  end-page: 247
  ident: bib0029
  article-title: Ensemble of keyword extraction methods and classifiers in text classification
  publication-title: Expert Systems with Applications
– start-page: 272
  year: 2008
  end-page: 279
  ident: bib0017
  article-title: Efficient projections onto the l 1-ball for learning in high dimensions
  publication-title: Proceedings of the 25th international conference on machine learning
– start-page: 215
  year: 2011
  end-page: 223
  ident: bib0013
  article-title: An analysis of single-layer networks in unsupervised feature learning
  publication-title: Proceedings of the fourteenth international conference on artificial intelligence and statistics
– volume: 20
  start-page: 53
  year: 1987
  end-page: 65
  ident: bib0030
  article-title: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
– volume: 5
  start-page: 622
  year: 2012
  end-page: 633
  ident: bib0006
  article-title: Scalable k-means++
  publication-title: Proceedings of the VLDB Endowment
– volume: 28
  start-page: 11
  year: 1972
  end-page: 21
  ident: bib0038
  article-title: A statistical interpretation of term specificity and its application in retrieval
– volume: 34
  year: 2012
  ident: bib0024
  article-title: Human cluster evaluation and formal quality measures: A comparative study
  publication-title: Proceedings of the annual meeting of the cognitive science society
– volume: 28
  start-page: 129
  year: 1982
  end-page: 137
  ident: bib0027
  article-title: Least squares quantization in pcm
– start-page: 478
  year: 2016
  end-page: 487
  ident: bib0039
  article-title: Unsupervised deep embedding for clustering analysis
  publication-title: International conference on machine learning
– volume: 70
  start-page: 66111
  year: 2004
  ident: bib0012
  article-title: Finding community structure in very large networks
– volume: 16
  start-page: 30
  year: 1973
  end-page: 34
  ident: bib0035
  article-title: Slink: an optimally efficient algorithm for the single-link cluster method
– start-page: 63
  year: 2014
  end-page: 70
  ident: bib0036
  article-title: Ldavis: A method for visualizing and interpreting topics
  publication-title: Proceedings of the workshop on interactive language learning, visualization, and interfaces
– start-page: 215
  year: 2010
  end-page: 224
  ident: bib0028
  article-title: Evaluating topic models for digital libraries
  publication-title: Proceedings of the 10th annual joint conference on digital libraries
– volume: 117
  start-page: 56
  year: 2017
  end-page: 69
  ident: bib0009
  article-title: An efficient approximation to the k-means clustering for massive data
  publication-title: Knowledge-Based Systems
– start-page: 1027
  year: 2007
  end-page: 1035
  ident: bib0003
  article-title: k-means++: The advantages of careful seeding
– year: 2000
  ident: bib0022
  article-title: Speech & language processing
– start-page: 74
  year: 2012
  end-page: 77
  ident: bib0010
  article-title: Termite: Visualization techniques for assessing textual topic models
  publication-title: Proceedings of the international working conference on advanced visual interfaces
– volume: 127
  start-page: 3
  year: 2011
  end-page: 30
  ident: bib0033
  article-title: Pegasos: Primal estimated sub-gradient solver for svm
– year: 2017
  ident: bib0034
  article-title: Compressed k-means for large-scale clustering
  publication-title: Thirty-first aaai conference on artificial intelligence
– volume: 16
  start-page: 30
  issue: 1
  year: 1973
  ident: 10.1016/j.eswa.2020.113288_bib0035
  article-title: Slink: an optimally efficient algorithm for the single-link cluster method
  publication-title: The Computer Journal
  doi: 10.1093/comjnl/16.1.30
– volume: 79
  start-page: 216
  issue: 2
  year: 2013
  ident: 10.1016/j.eswa.2020.113288_bib0025
  article-title: Speeding up k-means algorithm by gpus
  publication-title: Journal of Computer and System Sciences
  doi: 10.1016/j.jcss.2012.05.004
– volume: 3
  start-page: 993
  issue: Jan
  year: 2003
  ident: 10.1016/j.eswa.2020.113288_bib0007
  article-title: Latent dirichlet allocation
  publication-title: Journal of machine Learning research
– volume: 41
  start-page: 3192
  issue: 10
  year: 2008
  ident: 10.1016/j.eswa.2020.113288_bib0005
  article-title: Modified global k-means algorithm for minimum sum-of-squares clustering problems
  publication-title: Pattern Recognition
  doi: 10.1016/j.patcog.2008.04.004
– start-page: 85
  year: 2006
  ident: 10.1016/j.eswa.2020.113288_bib0042
  article-title: Keyword extraction using support vector machine
– volume: 28
  start-page: 129
  issue: 2
  year: 1982
  ident: 10.1016/j.eswa.2020.113288_bib0027
  article-title: Least squares quantization in pcm
  publication-title: IEEE Transactions on Information Theory
  doi: 10.1109/TIT.1982.1056489
– volume: 84
  start-page: 24
  year: 2017
  ident: 10.1016/j.eswa.2020.113288_bib0001
  article-title: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering
  publication-title: Expert Systems with Applications
  doi: 10.1016/j.eswa.2017.05.002
– volume: 50
  start-page: 1
  issue: 10
  year: 2012
  ident: 10.1016/j.eswa.2020.113288_bib0008
  article-title: Spherical k-means clustering
  publication-title: Journal of Statistical Software
– volume: 28
  start-page: 11
  issue: 1
  year: 1972
  ident: 10.1016/j.eswa.2020.113288_bib0038
  article-title: A statistical interpretation of term specificity and its application in retrieval
  publication-title: Journal of Documentation
  doi: 10.1108/eb026526
– volume: 44
  start-page: 1362
  issue: 8
  year: 2013
  ident: 10.1016/j.eswa.2020.113288_bib0021
  article-title: Density sensitive hashing
  publication-title: IEEE Transactions on Cybernetics
  doi: 10.1109/TCYB.2013.2283497
– volume: 20
  start-page: 53
  year: 1987
  ident: 10.1016/j.eswa.2020.113288_bib0030
  article-title: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
  publication-title: Journal of Computational and Applied Mathematics
  doi: 10.1016/0377-0427(87)90125-7
– volume: 127
  start-page: 3
  issue: 1
  year: 2011
  ident: 10.1016/j.eswa.2020.113288_bib0033
  article-title: Pegasos: Primal estimated sub-gradient solver for svm
  publication-title: Mathematical Programming
  doi: 10.1007/s10107-010-0420-4
– start-page: 74
  year: 2012
  ident: 10.1016/j.eswa.2020.113288_bib0010
  article-title: Termite: Visualization techniques for assessing textual topic models
– start-page: 131
  year: 2002
  ident: 10.1016/j.eswa.2020.113288_bib0015
  article-title: Iterative clustering of high dimensional text data augmented by local search
– start-page: 478
  year: 2016
  ident: 10.1016/j.eswa.2020.113288_bib0039
  article-title: Unsupervised deep embedding for clustering analysis
– start-page: 55
  year: 2016
  ident: 10.1016/j.eswa.2020.113288_bib0004
  article-title: Fast and provably good seedings for k-means
– volume: 2
  start-page: 165
  issue: 2
  year: 2015
  ident: 10.1016/j.eswa.2020.113288_bib0040
  article-title: A comprehensive survey of clustering algorithms
  publication-title: Annals of Data Science
  doi: 10.1007/s40745-015-0040-1
– start-page: 2938
  year: 2013
  ident: 10.1016/j.eswa.2020.113288_bib0019
  article-title: K-means hashing: An affinity-preserving quantization method for learning binary compact codes
– volume: 60
  start-page: 1
  year: 2016
  ident: 10.1016/j.eswa.2020.113288_bib0032
  article-title: Single-pass and linear-time k-means clustering based on mapreduce
  publication-title: Information Systems
  doi: 10.1016/j.is.2016.02.007
– start-page: 226
  year: 1996
  ident: 10.1016/j.eswa.2020.113288_bib0018
  article-title: A density-based algorithm for discovering clusters in large spatial databases with noise.
– volume: 70
  start-page: 66111
  issue: 6
  year: 2004
  ident: 10.1016/j.eswa.2020.113288_bib0012
  article-title: Finding community structure in very large networks
  publication-title: Physical Review E
  doi: 10.1103/PhysRevE.70.066111
– year: 2000
  ident: 10.1016/j.eswa.2020.113288_bib0022
– start-page: 1027
  year: 2007
  ident: 10.1016/j.eswa.2020.113288_bib0003
  article-title: k-means++: The advantages of careful seeding
– volume: 117
  start-page: 56
  year: 2017
  ident: 10.1016/j.eswa.2020.113288_bib0009
  article-title: An efficient approximation to the k-means clustering for massive data
  publication-title: Knowledge-Based Systems
  doi: 10.1016/j.knosys.2016.06.031
– volume: 266
  start-page: 336
  year: 2017
  ident: 10.1016/j.eswa.2020.113288_bib0023
  article-title: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2017.05.046
– volume: 34
  year: 2012
  ident: 10.1016/j.eswa.2020.113288_bib0024
  article-title: Human cluster evaluation and formal quality measures: A comparative study
– start-page: 1177
  year: 2010
  ident: 10.1016/j.eswa.2020.113288_bib0031
  article-title: Web-scale k-means clustering
– start-page: 3861
  year: 2017
  ident: 10.1016/j.eswa.2020.113288_bib0041
  article-title: Towards k-means-friendly spaces: Simultaneous deep learning and clustering
– start-page: 215
  year: 2010
  ident: 10.1016/j.eswa.2020.113288_bib0028
  article-title: Evaluating topic models for digital libraries
– year: 2017
  ident: 10.1016/j.eswa.2020.113288_bib0034
  article-title: Compressed k-means for large-scale clustering
– volume: 36
  start-page: 451
  issue: 2
  year: 2003
  ident: 10.1016/j.eswa.2020.113288_bib0026
  article-title: The global k-means clustering algorithm
  publication-title: Pattern Recognition
  doi: 10.1016/S0031-3203(02)00060-2
– volume: 5
  start-page: 622
  issue: 7
  year: 2012
  ident: 10.1016/j.eswa.2020.113288_bib0006
  article-title: Scalable k-means++
  publication-title: Proceedings of the VLDB Endowment
  doi: 10.14778/2180912.2180915
– start-page: 215
  year: 2011
  ident: 10.1016/j.eswa.2020.113288_bib0013
  article-title: An analysis of single-layer networks in unsupervised feature learning
– volume: 57
  start-page: 232
  year: 2016
  ident: 10.1016/j.eswa.2020.113288_bib0029
  article-title: Ensemble of keyword extraction methods and classifiers in text classification
  publication-title: Expert Systems with Applications
  doi: 10.1016/j.eswa.2016.03.045
– start-page: 561
  year: 2012
  ident: 10.1016/j.eswa.2020.113288_bib0014
  article-title: Learning feature representations with k-means
– start-page: 44
  year: 2011
  ident: 10.1016/j.eswa.2020.113288_bib0002
  article-title: Is there a best quality metric for graph clusters?
– start-page: 49
  year: 2008
  ident: 10.1016/j.eswa.2020.113288_bib0020
  article-title: Similarity measures for text document clustering
– start-page: 63
  year: 2014
  ident: 10.1016/j.eswa.2020.113288_bib0036
  article-title: Ldavis: A method for visualizing and interpreting topics
– start-page: 5
  year: 2013
  ident: 10.1016/j.eswa.2020.113288_bib0037
  article-title: Topic models and metadata for visualizing text corpora
  publication-title: Proceedings of the 2013 NAACL HLT Demonstration Session
– start-page: 443
  year: 2012
  ident: 10.1016/j.eswa.2020.113288_bib0011
  article-title: Interpretation and trust: Designing model-driven visualizations for text analysis
– start-page: 272
  year: 2008
  ident: 10.1016/j.eswa.2020.113288_bib0017
  article-title: Efficient projections onto the l 1-ball for learning in high dimensions
– volume: 42
  start-page: 143
  issue: 1
  year: 2001
  ident: 10.1016/j.eswa.2020.113288_bib0016
  article-title: Concept decompositions for large sparse text data using clustering
  publication-title: Machine Learning
  doi: 10.1023/A:1007612920971
SSID ssj0017007
Score 2.4789648
Snippet •Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation...
Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number...
SourceID proquest
crossref
elsevier
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 113288
SubjectTerms Centroids
Clustering
Clustering labeling
Datasets
Dispersion
Document clustering
k-means initialization
Labeling
Sparse vector projection
Spherical k-means
Title Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling
URI https://dx.doi.org/10.1016/j.eswa.2020.113288
https://www.proquest.com/docview/2438220948
Volume 150
WOSCitedRecordID wos000528193700020&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1873-6793
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0017007
  issn: 0957-4174
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9NAEF6FlgMX3ohCQXtAXNqt_F6HW1UlKlAFDinKbWWv15AQ7JBHaTnzY_iZzOzDTiqI4MDFitbxOvF8npmdnfmGkBdRXJQ8LVPGwd6wKPZKloVRyjwVe0XOiyhRUjeb4INBOhp133c6P10tzMWUV1V6edmd_VdRwxgIG0tn_0HczaQwAJ9B6HAEscPxrwS_FiZAygAthM_siwKbpHMKi1qudAaAnK6QJcFWPPezBXYLGMMbP7W1mfj4QeHMMaMdg8D1GDkFdOTGnsWou9IkFGsTHgCwdJX7RtQfKZWXljjaldStbZ63uQAaoKdXq2pS13aOtXFQR2-v2mTGk0-1yS2qPn6vLcJsDAMWrMhMGbeBtaa45sNGgJKzyDc9fI6UUc8pD1nCTU_FRn8b5lqrgf3f2gUTopgcqcU3JJsKdC-bwDQU3CThHrwT_fOzMzHsjYYvZ18Z9ifDfXzbrOUG2Q143AUTsHv8ujd60-xYcc-U5rtfbQu0TC7h9dv-yQm65g5oH2d4l9y2ixN6bEB1j3RUdZ_ccY0_qLUDD8iPBmO0wRi1GKOAMeowRluMvaKIMLqJsENq8EUdvmiLr0MK6KINutxU1KHrITnv94Ynp8y282AS_vOS5T6yJcZKpXkW5rkXyLyE1S8s2JGPlvtF6ftZFGSgJMBNVWEhk9gvpAyiLJGZysJHZKeqK_WYUPChSpl0U16A_yyLJONlEcJdJLheyo_KPeK75yuk5brHlitT4ZIaJwJlIlAmwshkjxw018wM08vWb8dObML6qsYHFQC5rdftOxkLqzQWIsDd-MDrRumT7aefklvt67NPdpbzlXpGbsqL5Xgxf24h-Qtk5MPK
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Improving+spherical+k-means+for+document+clustering%3A+Fast+initialization%2C+sparse+centroid+projection%2C+and+efficient+cluster+labeling&rft.jtitle=Expert+systems+with+applications&rft.au=Kim%2C+Hyunjoong&rft.au=Kim%2C+Han+Kyul&rft.au=Cho%2C+Sungzoon&rft.date=2020-07-15&rft.pub=Elsevier+BV&rft.issn=0957-4174&rft.eissn=1873-6793&rft.volume=150&rft.spage=1&rft_id=info:doi/10.1016%2Fj.eswa.2020.113288&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0957-4174&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0957-4174&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0957-4174&client=summon