Research on Improvement and Parallelization of Canopy-KMeans Clustering Algorithm

For the traditional K-Means algorithm in the face of large-scale data, due to the randomness of the initial cluster centers selected lead to local optima, slow speed clustering and other issues, this paper presents an improved algorithm Canopy-KMeans. First use the "minimum maximum principle&qu...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2021 International Conference on Electronic Information Engineering and Computer Science (EIECS) s. 455 - 458
Hlavní autor: Zhao, Huiling
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 23.09.2021
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:For the traditional K-Means algorithm in the face of large-scale data, due to the randomness of the initial cluster centers selected lead to local optima, slow speed clustering and other issues, this paper presents an improved algorithm Canopy-KMeans. First use the "minimum maximum principle" to optimize the selection of the center point of the Canopy algorithm, obtain a more accurate set of cluster center points and the number of clusters k; then use the "triangular inequality principle" to optimize the K-Means algorithm, reduce unnecessary distance calculation in the iterative process, finally combined with the Hadoop platform Mapreduce calculation framework for the parallel design and implementation of the algorithm. By experimental tests show, K-Means optimized analysis and comparison before optimization algorithm, fast convergence, you can get better clustering quality, suitable for large-scale data clustering.
DOI:10.1109/EIECS53707.2021.9588045