An Improved K-means Clustering Algorithm Towards an Efficient Data-Driven Modeling

K-means algorithm is one of the well-known unsupervised machine learning algorithms. The algorithm typically finds out distinct non-overlapping clusters in which each point is assigned to a group. The minimum squared distance technique distributes each point to the nearest clusters or subgroups. One...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Annals of data science Ročník 11; číslo 5; s. 1525 - 1544
Hlavní autoři:	Zubair, Md, Iqbal, MD. Asif, Shil, Avijeet, Chowdhury, M. J. M., Moni, Mohammad Ali, Sarker, Iqbal H.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Berlin/Heidelberg Springer Berlin Heidelberg 01.10.2024 Springer Nature B.V
Témata:	Algorithms Artificial Intelligence Business and Management Centroids Cluster analysis Clustering Datasets Economics Finance Insurance Machine learning Management Optimization Statistics for Business Subgroups Synthetic data Unsupervised learning Vector quantization Data Science Percentile Principal Component Analysis Machine Learning Unsupervised Algorithm K-means Clustering
ISSN:	2198-5804, 2198-5812, 2198-5812
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	K-means algorithm is one of the well-known unsupervised machine learning algorithms. The algorithm typically finds out distinct non-overlapping clusters in which each point is assigned to a group. The minimum squared distance technique distributes each point to the nearest clusters or subgroups. One of the K-means algorithm’s main concerns is to find out the initial optimal centroids of clusters. It is the most challenging task to determine the optimum position of the initial clusters’ centroids at the very first iteration. This paper proposes an approach to find the optimal initial centroids efficiently to reduce the number of iterations and execution time . To analyze the effectiveness of our proposed method, we have utilized different real-world datasets to conduct experiments. We have first analyzed COVID-19 and patient datasets to show our proposed method’s efficiency. A synthetic dataset of 10M instances with 8 dimensions is also used to estimate the performance of the proposed algorithm. Experimental results show that our proposed method outperforms traditional kmeans++ and random centroids initialization methods regarding the computation time and the number of iterations.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2198-5804 2198-5812 2198-5812
DOI:	10.1007/s40745-022-00428-2