Accurate Molecular-Orbital-Based Machine Learning Energies via Unsupervised Clustering of Chemical Space

We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine learning (MOB-ML). This work determines clusters via the Gaussian mixture model (GMM) in an entirely automatic manner and simplifies an earlier s...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of chemical theory and computation Ročník 18; číslo 8; s. 4826
Hlavní autori: Cheng, Lixue, Sun, Jiace, Miller, Thomas F
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: 09.08.2022
ISSN:1549-9626, 1549-9626
On-line prístup:Zistit podrobnosti o prístupe
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine learning (MOB-ML). This work determines clusters via the Gaussian mixture model (GMM) in an entirely automatic manner and simplifies an earlier supervised clustering approach [ J. Chem. Theory Comput. 2019, 15, 6668] by eliminating both the necessity for user-specified parameters and the training of an additional classifier. Unsupervised clustering results from GMM have the advantages of accurately reproducing chemically intuitive groupings of frontier molecular orbitals and exhibiting improved performance with an increasing number of training examples. The resulting clusters from supervised or unsupervised clustering are further combined with scalable Gaussian process regression (GPR) or linear regression (LR) to learn molecular energies accurately by generating a local regression model in each cluster. Among all four combinations of regressors and clustering methods, GMM combined with scalable exact GPR (GMM/GPR) is the most efficient training protocol for MOB-ML. The numerical tests of molecular energy learning on thermalized data sets of drug-like molecules demonstrate the improved accuracy, transferability, and learning efficiency of GMM/GPR over other training protocols for MOB-ML, i.e., supervised regression clustering combined with GPR (RC/GPR) and GPR without clustering. GMM/GPR also provides the best molecular energy predictions compared with ones from the literature on the same benchmark data sets. With a lower scaling, GMM/GPR has a 10.4-fold speedup in wall-clock training time compared with scalable exact GPR with a training size of 6500 QM7b-T molecules.We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine learning (MOB-ML). This work determines clusters via the Gaussian mixture model (GMM) in an entirely automatic manner and simplifies an earlier supervised clustering approach [ J. Chem. Theory Comput. 2019, 15, 6668] by eliminating both the necessity for user-specified parameters and the training of an additional classifier. Unsupervised clustering results from GMM have the advantages of accurately reproducing chemically intuitive groupings of frontier molecular orbitals and exhibiting improved performance with an increasing number of training examples. The resulting clusters from supervised or unsupervised clustering are further combined with scalable Gaussian process regression (GPR) or linear regression (LR) to learn molecular energies accurately by generating a local regression model in each cluster. Among all four combinations of regressors and clustering methods, GMM combined with scalable exact GPR (GMM/GPR) is the most efficient training protocol for MOB-ML. The numerical tests of molecular energy learning on thermalized data sets of drug-like molecules demonstrate the improved accuracy, transferability, and learning efficiency of GMM/GPR over other training protocols for MOB-ML, i.e., supervised regression clustering combined with GPR (RC/GPR) and GPR without clustering. GMM/GPR also provides the best molecular energy predictions compared with ones from the literature on the same benchmark data sets. With a lower scaling, GMM/GPR has a 10.4-fold speedup in wall-clock training time compared with scalable exact GPR with a training size of 6500 QM7b-T molecules.
AbstractList We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine learning (MOB-ML). This work determines clusters via the Gaussian mixture model (GMM) in an entirely automatic manner and simplifies an earlier supervised clustering approach [ J. Chem. Theory Comput. 2019, 15, 6668] by eliminating both the necessity for user-specified parameters and the training of an additional classifier. Unsupervised clustering results from GMM have the advantages of accurately reproducing chemically intuitive groupings of frontier molecular orbitals and exhibiting improved performance with an increasing number of training examples. The resulting clusters from supervised or unsupervised clustering are further combined with scalable Gaussian process regression (GPR) or linear regression (LR) to learn molecular energies accurately by generating a local regression model in each cluster. Among all four combinations of regressors and clustering methods, GMM combined with scalable exact GPR (GMM/GPR) is the most efficient training protocol for MOB-ML. The numerical tests of molecular energy learning on thermalized data sets of drug-like molecules demonstrate the improved accuracy, transferability, and learning efficiency of GMM/GPR over other training protocols for MOB-ML, i.e., supervised regression clustering combined with GPR (RC/GPR) and GPR without clustering. GMM/GPR also provides the best molecular energy predictions compared with ones from the literature on the same benchmark data sets. With a lower scaling, GMM/GPR has a 10.4-fold speedup in wall-clock training time compared with scalable exact GPR with a training size of 6500 QM7b-T molecules.We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine learning (MOB-ML). This work determines clusters via the Gaussian mixture model (GMM) in an entirely automatic manner and simplifies an earlier supervised clustering approach [ J. Chem. Theory Comput. 2019, 15, 6668] by eliminating both the necessity for user-specified parameters and the training of an additional classifier. Unsupervised clustering results from GMM have the advantages of accurately reproducing chemically intuitive groupings of frontier molecular orbitals and exhibiting improved performance with an increasing number of training examples. The resulting clusters from supervised or unsupervised clustering are further combined with scalable Gaussian process regression (GPR) or linear regression (LR) to learn molecular energies accurately by generating a local regression model in each cluster. Among all four combinations of regressors and clustering methods, GMM combined with scalable exact GPR (GMM/GPR) is the most efficient training protocol for MOB-ML. The numerical tests of molecular energy learning on thermalized data sets of drug-like molecules demonstrate the improved accuracy, transferability, and learning efficiency of GMM/GPR over other training protocols for MOB-ML, i.e., supervised regression clustering combined with GPR (RC/GPR) and GPR without clustering. GMM/GPR also provides the best molecular energy predictions compared with ones from the literature on the same benchmark data sets. With a lower scaling, GMM/GPR has a 10.4-fold speedup in wall-clock training time compared with scalable exact GPR with a training size of 6500 QM7b-T molecules.
Author Miller, Thomas F
Sun, Jiace
Cheng, Lixue
Author_xml – sequence: 1
  givenname: Lixue
  surname: Cheng
  fullname: Cheng, Lixue
– sequence: 2
  givenname: Jiace
  surname: Sun
  fullname: Sun, Jiace
– sequence: 3
  givenname: Thomas F
  surname: Miller
  fullname: Miller, Thomas F
BookMark eNpNzD1vwjAUhWGraqUC7d7RY5fQXDt24pEi-iGBGFpmdOPcgJFxqJ3w-1vUDp3OOzw6Y3YdukCMPUA-hVzAE9o0PdjeToXNc2n0FRuBKkxmtNDX__qWjVM6_BBZCDli-5m1Q8Se-KrzZAePMVvH2vXos2dM1PAV2r0LxJeEMbiw44tAceco8bNDvglpOFE8uwud-yH1FC-oa_l8T0dn0fOPE1q6Yzct-kT3fzthm5fF5_wtW65f3-ezZYZSV31GRd3koBSYirDQNZYS6qqVEqtWE0gN1gpjVNvIQmJTA5akQDcKSrAaQUzY4-_vKXZfA6V-e3TJkvcYqBvSVmgjSqUMSPENfTNe8Q
CitedBy_id crossref_primary_10_1016_j_chroma_2025_465816
crossref_primary_10_1002_wcms_1706
crossref_primary_10_1021_acs_jctc_4c01261
crossref_primary_10_1021_acs_jpca_4c05718
ContentType Journal Article
DBID 7X8
DOI 10.1021/acs.jctc.2c00396
DatabaseName MEDLINE - Academic
DatabaseTitle MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
Database_xml – sequence: 1
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Chemistry
EISSN 1549-9626
GroupedDBID 4.4
53G
55A
5GY
5VS
7X8
7~N
AABXI
ABBLG
ABJNI
ABLBI
ABMVS
ABQRX
ABUCX
ACGFS
ACIWK
ACS
ADHLV
AEESW
AENEX
AFEFF
AHGAQ
ALMA_UNASSIGNED_HOLDINGS
AQSVZ
BAANH
CS3
CUPRZ
D0L
DU5
EBS
ED~
F5P
GGK
GNL
IH9
J9A
JG~
P2P
RNS
ROL
UI2
VF5
VG9
W1F
ID FETCH-LOGICAL-a368t-e4bd0155198ea46ba731b8f33a8f6e1361cc2995fd343adb1a7e516d5171c6a12
IEDL.DBID 7X8
ISICitedReferencesCount 13
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000886936400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1549-9626
IngestDate Fri Jul 11 00:01:53 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 8
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a368t-e4bd0155198ea46ba731b8f33a8f6e1361cc2995fd343adb1a7e516d5171c6a12
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PQID 2692755913
PQPubID 23479
ParticipantIDs proquest_miscellaneous_2692755913
PublicationCentury 2000
PublicationDate 2022-08-09
PublicationDateYYYYMMDD 2022-08-09
PublicationDate_xml – month: 08
  year: 2022
  text: 2022-08-09
  day: 09
PublicationDecade 2020
PublicationTitle Journal of chemical theory and computation
PublicationYear 2022
SSID ssj0033423
Score 2.4662476
Snippet We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine...
SourceID proquest
SourceType Aggregation Database
StartPage 4826
Title Accurate Molecular-Orbital-Based Machine Learning Energies via Unsupervised Clustering of Chemical Space
URI https://www.proquest.com/docview/2692755913
Volume 18
WOSCitedRecordID wos000886936400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwELaAIsHCG_GWkVhN8SN2MqFStWJpQYJK3SrHdqAIJaVp-vvxmQQGFiT2G-yTfffZd_d9CF2xG50obgWxJrZEeMRBUmtj4phi9kZYzYwNYhNqOIzH4-Sx_nAr67bKJiaGQG0LA3_kbSYTpjz8pfx29kFANQqqq7WExipqcQ9loKVLjb-rCBzY7QJfqgAWStaUKX1aa2tTXr8ZIDE0MJ4qf4XikF_62_9d2Q7aqpEl7nwdhV204vI9tNFtBN320WvHmAqYIfCg0cQlD_MUVEPInU9mFg9CZ6XDNenqC-7BZKB_S-PlVONRXlYzCC1g2n2vgGIBjIoMN7QD-Mk_wd0BGvV7z917UgstEM1lvCBOpDZgpyR2WshUK07TOONcx5l0lEtqjE9bUWa54NqmVCsXUWkjqqiRmrJDtJYXuTtCWGaRo6lSVrJUKGYSrrUynNPMMWdccowuGydO_PahOqFzV1Tl5MeNJ3-wOUWbDAYRQvPGGWpl_rK6c7RulotpOb8I5-ATgvK_Tw
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Accurate+Molecular-Orbital-Based+Machine+Learning+Energies+via+Unsupervised+Clustering+of+Chemical+Space&rft.jtitle=Journal+of+chemical+theory+and+computation&rft.au=Cheng%2C+Lixue&rft.au=Sun%2C+Jiace&rft.au=Miller%2C+Thomas+F&rft.date=2022-08-09&rft.issn=1549-9626&rft.eissn=1549-9626&rft.volume=18&rft.issue=8&rft.spage=4826&rft_id=info:doi/10.1021%2Facs.jctc.2c00396&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1549-9626&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1549-9626&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1549-9626&client=summon