Identifying Effective Algorithms and Measures for Enhanced Clustering Quality: A Comprehensive Examination of Arbitrary Decisions in Hierarchical Clustering Algorithms
Hierarchical clustering algorithms are widely used in various applications to group similar samples. However, a common challenge arises during the merging process when two or more clusters have equal values, with no clear criterion to determine which clusters should be merged next. This leads to arb...
Uložené v:
| Vydané v: | Journal of classification Ročník 42; číslo 2; s. 457 - 489 |
|---|---|
| Hlavní autori: | , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
New York
Springer US
01.07.2025
Springer Nature B.V |
| Predmet: | |
| ISSN: | 0176-4268, 1432-1343 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | Hierarchical clustering algorithms are widely used in various applications to group similar samples. However, a common challenge arises during the merging process when two or more clusters have equal values, with no clear criterion to determine which clusters should be merged next. This leads to arbitrary decisions, which can negatively impact the quality of clustering results. The issue of arbitrary decisions has been highlighted in previous studies, emphasizing the need for algorithms and measures that minimize their occurrence. This study provides a comprehensive analysis of arbitrary decisions generated by nine popular hierarchical clustering algorithms across 100 measures, including similarities, distances, and entropy. In total, 737 unique combinations of clustering algorithms and measures were evaluated, many of which are novel and have not been previously explored. The results show that the Agglomerative Information Bottleneck algorithm, when paired with measures such as cross-entropy and Jensen difference, the combined algorithm with Soergel and Fidelity measures, the weighted combined algorithm with cosine similarity and fidelity measures, and the median algorithm with covariance similarity and squared chord distance measures, exhibited minimal arbitrary decisions for binary data. For non-binary data, the agglomerative information bottleneck algorithm with cross-entropy and Kullback-Leibler measures, the centroid algorithm with Hellinger and squared chord distance measures, and the median algorithm with Hellinger and Jeffries-Matusita distance measures showed fewer arbitrary decisions. This study provides valuable insights for researchers and practitioners by identifying specific clustering algorithms and measures that are less prone to arbitrary decisions, thereby enhancing the quality of clustering outcomes. Overall, this paper contributes to the field of clustering by evaluating the effectiveness of new combinations of algorithms and measures in reducing arbitrary decisions. |
|---|---|
| Bibliografia: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0176-4268 1432-1343 |
| DOI: | 10.1007/s00357-025-09506-5 |