WeDIV – An improved k-means clustering algorithm with a weighted distance and a novel internal validation index

Designing appropriate similarity metrics (distance) and estimating the optimal number of clusters have been two important issues in cluster analysis. This study proposed an improved k-means clustering algorithm involving a Weighted Distance and a novel Internal Validation index (WeDIV). The weighted...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Egyptian informatics journal Ročník 23; číslo 4; s. 133 - 144
Hlavní autori: Ning, Zilan, Chen, Jin, Huang, Jianjun, Sabo, Umar Jlbrilla, Yuan, Zheming, Dai, Zhijun
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 01.12.2022
Elsevier
Predmet:
ISSN:1110-8665, 2090-4754
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Designing appropriate similarity metrics (distance) and estimating the optimal number of clusters have been two important issues in cluster analysis. This study proposed an improved k-means clustering algorithm involving a Weighted Distance and a novel Internal Validation index (WeDIV). The weighted distance, EP_dis, was designed by considering the relative contribution between Euclidean and Pearson distances with a weighted strategy. This strategy can effectively capture information reflecting the globally spatial correlation and locally variable trend simultaneously in high-dimensional space. The new internal validation index,RCH, inspired by the Calinski-Harabasz (CH) index and the analysis of variance, was developed to automatically estimate the optimal number of clusters. The EP_dis was proved reliable in mathematics and was validated on two simulated datasets. Four simulated datasets representing different properties were used to validate the effectiveness of RCH. Furthermore, We compared the clustering performance of WeDIV with 12 prevailing clustering algorithms on 16 UCI datasets. The results demonstrated that WeDIV outperforms the others regardless of specifying the number of clusters or not.
ISSN:1110-8665
2090-4754
DOI:10.1016/j.eij.2022.09.002