Research on Distributed Parallelization of Improved Spectral Clustering Algorithm for Big Data

In the field of data mining, clustering algorithms play a key role in extracting valuable insights from vast datasets without incorporating learning mechanisms. One such classical clustering approach is the spectral clustering algorithm. This algorithm effectively converts a clustering challenge int...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA) s. 544 - 549
Hlavný autor:	Yang, Han
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 27.02.2024
Predmet:	Big Data Clustering algorithms Clustering methods data partitioning density-sensitive similarity distributed parallelizatlon Electrical engineering Euclidean distance Learning systems Partitioning algorithms Spectral clustering
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In the field of data mining, clustering algorithms play a key role in extracting valuable insights from vast datasets without incorporating learning mechanisms. One such classical clustering approach is the spectral clustering algorithm. This algorithm effectively converts a clustering challenge into the segmentation of an undirected graph, enabling it to handle intricate non-convex datasets adeptly and avoid getting trapped in local optimization pitfalls. Nevertheless, the conventional spectral clustering technique relies on the Gaussian kernel function, which uses Euclidean distance to determine sample similarities. This method proves overly sensitive to the Gaussian kernel's parameters and fails to accurately represent inter-sample relationships. To address the drawbacks related to similarity measurement and the computational inefficiencies inherent in the traditional spectral clustering method, enhancements have been made to refine the clustering outcomes. The enhanced spectral clustering algorithm has been redesigned to be distributed and parallelized, a strategic move intended to bolster the processing ability when handling enormous datasets.
DOI:	10.1109/EEBDA60612.2024.10485912