Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling
•Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to...
Uloženo v:
| Vydáno v: | Expert systems with applications Ročník 150; s. 113288 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
Elsevier Ltd
15.07.2020
Elsevier BV |
| Témata: | |
| ISSN: | 0957-4174, 1873-6793 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | •Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method.
Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors.
In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster.
We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0957-4174 1873-6793 |
| DOI: | 10.1016/j.eswa.2020.113288 |