How much can k-means be improved by using better initialization and repeats?

•K-means clustering algorithm can be significantly improved by using a better initialization technique, and by repeating (re-starting) the algorithm.•When the data has overlapping clusters, k-means can improve the results of the initialization technique.•When the data has well separated clusters, th...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Pattern recognition Ročník 93; s. 95 - 112
Hlavní autori:	Fränti, Pasi, Sieranoja, Sami
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier Ltd 01.09.2019
Predmet:	Clustering accuracy Clustering algorithms Initialization K-means Prototype selection K-means Initialization Prototype selection Clustering algorithms Clustering accuracy
ISSN:	0031-3203, 1873-5142
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	•K-means clustering algorithm can be significantly improved by using a better initialization technique, and by repeating (re-starting) the algorithm.•When the data has overlapping clusters, k-means can improve the results of the initialization technique.•When the data has well separated clusters, the performance of k-means depends completely on the goodness of the initialization.•Initialization using simple furthest point heuristic (Maxmin) reduces the clustering error of k-means from 15% to 6%, on average. In this paper, we study what are the most important factors that deteriorate the performance of the k-means algorithm, and how much this deterioration can be overcome either by using a better initialization technique, or by repeating (restarting) the algorithm. Our main finding is that when the clusters overlap, k-means can be significantly improved using these two tricks. Simple furthest point heuristic (Maxmin) reduces the number of erroneous clusters from 15% to 6%, on average, with our clustering benchmark. Repeating the algorithm 100 times reduces it further down to 1%. This accuracy is more than enough for most pattern recognition applications. However, when the data has well separated clusters, the performance of k-means depends completely on the goodness of the initialization. Therefore, if high clustering accuracy is needed, a better algorithm should be used instead.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2019.04.014