Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling
•Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method. Due to...
Saved in:
| Published in: | Expert systems with applications Vol. 150; p. 113288 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
New York
Elsevier Ltd
15.07.2020
Elsevier BV |
| Subjects: | |
| ISSN: | 0957-4174, 1873-6793 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | •Spherical k-means for document clustering is improved to overcome its weaknesses.•Our method ensures dispersed initial points with faster computation time.•Our method preserves sparsity of centroid vectors for better interpretability.•We provide unsupervised document cluster labeling method.
Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors.
In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster.
We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0957-4174 1873-6793 |
| DOI: | 10.1016/j.eswa.2020.113288 |