Identifying Effective Algorithms and Measures for Enhanced Clustering Quality: A Comprehensive Examination of Arbitrary Decisions in Hierarchical Clustering Algorithms

Hierarchical clustering algorithms are widely used in various applications to group similar samples. However, a common challenge arises during the merging process when two or more clusters have equal values, with no clear criterion to determine which clusters should be merged next. This leads to arb...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of classification Ročník 42; číslo 2; s. 457 - 489
Hlavní autori: Behzadidoost, Rashid, Izadkhah, Habib
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York Springer US 01.07.2025
Springer Nature B.V
Predmet:
ISSN:0176-4268, 1432-1343
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Hierarchical clustering algorithms are widely used in various applications to group similar samples. However, a common challenge arises during the merging process when two or more clusters have equal values, with no clear criterion to determine which clusters should be merged next. This leads to arbitrary decisions, which can negatively impact the quality of clustering results. The issue of arbitrary decisions has been highlighted in previous studies, emphasizing the need for algorithms and measures that minimize their occurrence. This study provides a comprehensive analysis of arbitrary decisions generated by nine popular hierarchical clustering algorithms across 100 measures, including similarities, distances, and entropy. In total, 737 unique combinations of clustering algorithms and measures were evaluated, many of which are novel and have not been previously explored. The results show that the Agglomerative Information Bottleneck algorithm, when paired with measures such as cross-entropy and Jensen difference, the combined algorithm with Soergel and Fidelity measures, the weighted combined algorithm with cosine similarity and fidelity measures, and the median algorithm with covariance similarity and squared chord distance measures, exhibited minimal arbitrary decisions for binary data. For non-binary data, the agglomerative information bottleneck algorithm with cross-entropy and Kullback-Leibler measures, the centroid algorithm with Hellinger and squared chord distance measures, and the median algorithm with Hellinger and Jeffries-Matusita distance measures showed fewer arbitrary decisions. This study provides valuable insights for researchers and practitioners by identifying specific clustering algorithms and measures that are less prone to arbitrary decisions, thereby enhancing the quality of clustering outcomes. Overall, this paper contributes to the field of clustering by evaluating the effectiveness of new combinations of algorithms and measures in reducing arbitrary decisions.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0176-4268
1432-1343
DOI:10.1007/s00357-025-09506-5