Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling

•Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Expert systems with applications Ročník 150; s. 113288
Hlavní autoři: Kim, Hyunjoong, Kim, Han Kyul, Cho, Sungzoon
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York Elsevier Ltd 15.07.2020
Elsevier BV
Témata:
ISSN:0957-4174, 1873-6793
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:•Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors. In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster. We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2020.113288