Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context
Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooke...
Uložené v:
| Vydané v: | Journal of chemical information and modeling Ročník 65; číslo 13; s. 7013 |
|---|---|
| Hlavní autori: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
United States
14.07.2025
|
| Predmet: | |
| ISSN: | 1549-960X, 1549-960X |
| On-line prístup: | Zistit podrobnosti o prístupe |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average
from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average
0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context. |
|---|---|
| AbstractList | Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average R2 from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average R2 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context.Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average R2 from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average R2 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context. Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context. |
| Author | Sastrokarijo, Enzo G van Westen, Gerard J P Schoenmaker, Linde Jespers, Willem Beltman, Joost B Heitman, Laura H |
| Author_xml | – sequence: 1 givenname: Linde orcidid: 0000-0001-9879-1004 surname: Schoenmaker fullname: Schoenmaker, Linde organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands – sequence: 2 givenname: Enzo G surname: Sastrokarijo fullname: Sastrokarijo, Enzo G organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands – sequence: 3 givenname: Laura H orcidid: 0000-0002-1381-8464 surname: Heitman fullname: Heitman, Laura H organization: Oncode Institute, 2333 CC Leiden, The Netherlands – sequence: 4 givenname: Joost B surname: Beltman fullname: Beltman, Joost B organization: Division of Cell Systems and Drug Safety, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands – sequence: 5 givenname: Willem orcidid: 0000-0002-4951-9220 surname: Jespers fullname: Jespers, Willem organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands – sequence: 6 givenname: Gerard J P orcidid: 0000-0003-0717-1817 surname: van Westen fullname: van Westen, Gerard J P organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/40586820$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkL1PAjEchhuDkQ_dnUxHHA77cS2tGxBFEwwLJm6X0v6OlNxd8VpU_nsxYuL0PsOTZ3j7qNOEBhC6pmRECaN3xsbR1vp6JCwhkvAz1KMi15mW5K3zj7uoH-OWEM61ZBeomxOhpGKkh5ar8GlahycxmkM2OTLgqQ_GJv_h0wG_BAfVENrbeI_nkJJvNtjgeet3ODQ_ZhU23poKz0KT4CtdovPSVBGuTjtAr48Pq9lTtljOn2eTRWZymaeMC-ASaGms45pTo5wUuaJaKm2VcrDmeiyMo6VQjqwp0yURunSghFWaUMcGaPjb3bXhfQ8xFbWPFqrKNBD2seCMCTXOc6aO6s1J3a9rcMWu9bVpD8XfCewbA2pgog |
| CitedBy_id | crossref_primary_10_1021_acs_jcim_5c01203 |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1021/acs.jcim.5c00603 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Chemistry |
| EISSN | 1549-960X |
| ExternalDocumentID | 40586820 |
| Genre | Journal Article |
| GroupedDBID | --- -~X 4.4 55A 5GY 5VS 7~N AABXI ABBLG ABJNI ABLBI ABMVS ABQRX ABUCX ACGFS ACIWK ACNCT ACS ADHLV AEESW AENEX AFEFF AHGAQ ALMA_UNASSIGNED_HOLDINGS AQSVZ CGR CUPRZ CUY CVF D0L DU5 EBS ECM ED~ EIF F5P GGK GNL IH9 JG~ NPM P2P PQQKQ RNS ROL UI2 VF5 VG9 W1F 7X8 |
| ID | FETCH-LOGICAL-a464t-35e36e1facd3931a8d654819689c88deb3975ad1f58d0b129f059fde85c8901d2 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 2 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001520221300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1549-960X |
| IngestDate | Wed Jul 02 01:45:09 EDT 2025 Sat Jul 19 01:30:24 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 13 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a464t-35e36e1facd3931a8d654819689c88deb3975ad1f58d0b129f059fde85c8901d2 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ORCID | 0000-0003-0717-1817 0000-0002-1381-8464 0000-0002-4951-9220 0000-0001-9879-1004 |
| OpenAccessLink | https://doi.org/10.1021/acs.jcim.5c00603 |
| PMID | 40586820 |
| PQID | 3225874428 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_3225874428 pubmed_primary_40586820 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-07-14 |
| PublicationDateYYYYMMDD | 2025-07-14 |
| PublicationDate_xml | – month: 07 year: 2025 text: 2025-07-14 day: 14 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Journal of chemical information and modeling |
| PublicationTitleAlternate | J Chem Inf Model |
| PublicationYear | 2025 |
| SSID | ssj0033962 |
| Score | 2.4785776 |
| Snippet | Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 7013 |
| SubjectTerms | Biological Assay Drug Discovery - methods Humans Ligands Proteins - chemistry Proteins - metabolism |
| Title | Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/40586820 https://www.proquest.com/docview/3225874428 |
| Volume | 65 |
| WOSCitedRecordID | wos001520221300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV05T8MwFLaAIsHCfZRLRmKAwW0S53BYUKloGaB0KFK2yPEhFUFSmoLg3_PspDAhIbFEWSIlz-_44vf8fQid-SIW2heCKBlJ4kOBItxhimgeiCzjDILEUubfRYMBS5J4WG-4lfVY5Twn2kQtC2H2yNvG8QxVu8euJq_EqEaZ7motobGIGhSgjBnpipLvLgKlsRUUNSxkBJB6Urcpoay1uShbT2L80gqEoSShvwNMW2h66_99xQ20VkNM3Kl8YhMtqHwLrXTnym7b6GFkZ2UxLA3_JB24V_h6XJgjDkZJAht9tOdzNb0oL3Ff2cFozHEf0gsuclypV5q1xZba6mO2gx57N6PuLamFFQj3Q39GaKBoqFzNhaQxdTmTRj0eYpHFgjEJ_9dxFHDp6oBJJwNEoAGEaalYIBjgB-ntoqW8yNU-wo7DtZIs41RBLggYDz2uATZKKL6ucL0mOp3bKoWvNN0InqvirUx_rNVEe5XB00nFsJECimQhYJODPzx9iFY9o8lryC79I9TQELbqGC2L99m4nJ5Yj4DrYHj_BZ4TwXk |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Toward+Assay-Aware+Bioactivity+Model%28er%29s%3A+Getting+a+Grip+on+Biological+Context&rft.jtitle=Journal+of+chemical+information+and+modeling&rft.au=Schoenmaker%2C+Linde&rft.au=Sastrokarijo%2C+Enzo+G&rft.au=Heitman%2C+Laura+H&rft.au=Beltman%2C+Joost+B&rft.date=2025-07-14&rft.eissn=1549-960X&rft.volume=65&rft.issue=13&rft.spage=7013&rft_id=info:doi/10.1021%2Facs.jcim.5c00603&rft_id=info%3Apmid%2F40586820&rft_id=info%3Apmid%2F40586820&rft.externalDocID=40586820 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1549-960X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1549-960X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1549-960X&client=summon |