k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations

One of the key challenges of -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. Ho...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of chemical theory and computation Ročník 20; číslo 13; s. 5583
Hlavní autoři: Chen, Lexin, Roe, Daniel R, Kochert, Matthew, Simmerling, Carlos, Miranda-Quintana, Ramón Alain
Médium: Journal Article
Jazyk:angličtina
Vydáno: United States 09.07.2024
ISSN:1549-9626, 1549-9626
On-line přístup:Zjistit podrobnosti o přístupu
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:One of the key challenges of -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex data sets such as those obtained from molecular simulation, -means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of -means++ will lead to a lack of reproducibility. -means -Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient -ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping -means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse data sets and be used as a standalone tool or as part of our MDANCE clustering package.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1549-9626
1549-9626
DOI:10.1021/acs.jctc.4c00308