Research on Improvement and Parallelization of Canopy-KMeans Clustering Algorithm
For the traditional K-Means algorithm in the face of large-scale data, due to the randomness of the initial cluster centers selected lead to local optima, slow speed clustering and other issues, this paper presents an improved algorithm Canopy-KMeans. First use the "minimum maximum principle&qu...
Uloženo v:
| Vydáno v: | 2021 International Conference on Electronic Information Engineering and Computer Science (EIECS) s. 455 - 458 |
|---|---|
| Hlavní autor: | |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
23.09.2021
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | For the traditional K-Means algorithm in the face of large-scale data, due to the randomness of the initial cluster centers selected lead to local optima, slow speed clustering and other issues, this paper presents an improved algorithm Canopy-KMeans. First use the "minimum maximum principle" to optimize the selection of the center point of the Canopy algorithm, obtain a more accurate set of cluster center points and the number of clusters k; then use the "triangular inequality principle" to optimize the K-Means algorithm, reduce unnecessary distance calculation in the iterative process, finally combined with the Hadoop platform Mapreduce calculation framework for the parallel design and implementation of the algorithm. By experimental tests show, K-Means optimized analysis and comparison before optimization algorithm, fast convergence, you can get better clustering quality, suitable for large-scale data clustering. |
|---|---|
| DOI: | 10.1109/EIECS53707.2021.9588045 |