k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations
One of the key challenges of -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. Ho...
Uloženo v:
| Vydáno v: | Journal of chemical theory and computation Ročník 20; číslo 13; s. 5583 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
United States
09.07.2024
|
| ISSN: | 1549-9626, 1549-9626 |
| On-line přístup: | Zjistit podrobnosti o přístupu |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | One of the key challenges of
-means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as
-means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex data sets such as those obtained from molecular simulation,
-means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of
-means++ will lead to a lack of reproducibility.
-means
-Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient
-ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping
-means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse data sets and be used as a standalone tool or as part of our MDANCE clustering package. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 1549-9626 1549-9626 |
| DOI: | 10.1021/acs.jctc.4c00308 |