Initial Seed Selection for Mixed Data Using Modified K-means Clustering Algorithm

Data sets to which clustering is applied may be homogeneous (numerical or categorical) or heterogeneous (numerical and categorical) in nature. Handling homogeneous data is easier than heterogeneous data. We propose a novel technique for identifying initial seeds for heterogeneous data clustering, th...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Arabian journal for science and engineering (2011) Ročník 45; číslo 4; s. 2685 - 2703
Hlavní autoři:	Sajidha, S. A., Desikan, Kalyani, Chodnekar, Siddha Prabhu
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2020 Springer Nature B.V
Témata:	Algorithms Cluster analysis Clustering Datasets Distance measurement Engineering Humanities and Social Sciences multidisciplinary Research Article - Computer Engineering and Computer Science Science Statistical tests Vector quantization means Mixed attributes Initial seed points prototypes Clustering
ISSN:	2193-567X, 1319-8025, 2191-4281
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Data sets to which clustering is applied may be homogeneous (numerical or categorical) or heterogeneous (numerical and categorical) in nature. Handling homogeneous data is easier than heterogeneous data. We propose a novel technique for identifying initial seeds for heterogeneous data clustering, through the introduction of a unique distance measure where the distance of the numerical attributes is scaled such that it is comparable to that of categorical attributes. The proposed initial seed selection algorithm ensures selection of initial seed points from different clusters of the clustering solution which are then given as input to the modified K -means clustering algorithm along with the data set. This technique is independent of any user-defined parameter and thus can be easily applied to clusterable data sets with mixed attributes. We have also modified the K -means clustering algorithm to handle mixed attributes by incorporating our novel distance measure to handle numerical data and assigned the value one or zero when categorical data is dissimilar or similar. Finally, a comparison has been made with existing algorithms to bring out the significance of our approach. We also perform a statistical test to evaluate the statistical significance of our proposed technique.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2193-567X 1319-8025 2191-4281
DOI:	10.1007/s13369-019-04121-0