Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context

Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooke...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of chemical information and modeling Ročník 65; číslo 13; s. 7013
Hlavní autori: Schoenmaker, Linde, Sastrokarijo, Enzo G, Heitman, Laura H, Beltman, Joost B, Jespers, Willem, van Westen, Gerard J P
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: United States 14.07.2025
Predmet:
ISSN:1549-960X, 1549-960X
On-line prístup:Zistit podrobnosti o prístupe
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context.
AbstractList Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average R2 from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average R2 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context.Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average R2 from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average R2 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context.
Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context.
Author Sastrokarijo, Enzo G
van Westen, Gerard J P
Schoenmaker, Linde
Jespers, Willem
Beltman, Joost B
Heitman, Laura H
Author_xml – sequence: 1
  givenname: Linde
  orcidid: 0000-0001-9879-1004
  surname: Schoenmaker
  fullname: Schoenmaker, Linde
  organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
– sequence: 2
  givenname: Enzo G
  surname: Sastrokarijo
  fullname: Sastrokarijo, Enzo G
  organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
– sequence: 3
  givenname: Laura H
  orcidid: 0000-0002-1381-8464
  surname: Heitman
  fullname: Heitman, Laura H
  organization: Oncode Institute, 2333 CC Leiden, The Netherlands
– sequence: 4
  givenname: Joost B
  surname: Beltman
  fullname: Beltman, Joost B
  organization: Division of Cell Systems and Drug Safety, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
– sequence: 5
  givenname: Willem
  orcidid: 0000-0002-4951-9220
  surname: Jespers
  fullname: Jespers, Willem
  organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
– sequence: 6
  givenname: Gerard J P
  orcidid: 0000-0003-0717-1817
  surname: van Westen
  fullname: van Westen, Gerard J P
  organization: Division of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40586820$$D View this record in MEDLINE/PubMed
BookMark eNpNkL1PAjEchhuDkQ_dnUxHHA77cS2tGxBFEwwLJm6X0v6OlNxd8VpU_nsxYuL0PsOTZ3j7qNOEBhC6pmRECaN3xsbR1vp6JCwhkvAz1KMi15mW5K3zj7uoH-OWEM61ZBeomxOhpGKkh5ar8GlahycxmkM2OTLgqQ_GJv_h0wG_BAfVENrbeI_nkJJvNtjgeet3ODQ_ZhU23poKz0KT4CtdovPSVBGuTjtAr48Pq9lTtljOn2eTRWZymaeMC-ASaGms45pTo5wUuaJaKm2VcrDmeiyMo6VQjqwp0yURunSghFWaUMcGaPjb3bXhfQ8xFbWPFqrKNBD2seCMCTXOc6aO6s1J3a9rcMWu9bVpD8XfCewbA2pgog
CitedBy_id crossref_primary_10_1021_acs_jcim_5c01203
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1021/acs.jcim.5c00603
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Chemistry
EISSN 1549-960X
ExternalDocumentID 40586820
Genre Journal Article
GroupedDBID ---
-~X
4.4
55A
5GY
5VS
7~N
AABXI
ABBLG
ABJNI
ABLBI
ABMVS
ABQRX
ABUCX
ACGFS
ACIWK
ACNCT
ACS
ADHLV
AEESW
AENEX
AFEFF
AHGAQ
ALMA_UNASSIGNED_HOLDINGS
AQSVZ
CGR
CUPRZ
CUY
CVF
D0L
DU5
EBS
ECM
ED~
EIF
F5P
GGK
GNL
IH9
JG~
NPM
P2P
PQQKQ
RNS
ROL
UI2
VF5
VG9
W1F
7X8
ID FETCH-LOGICAL-a464t-35e36e1facd3931a8d654819689c88deb3975ad1f58d0b129f059fde85c8901d2
IEDL.DBID 7X8
ISICitedReferencesCount 2
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001520221300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1549-960X
IngestDate Wed Jul 02 01:45:09 EDT 2025
Sat Jul 19 01:30:24 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 13
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a464t-35e36e1facd3931a8d654819689c88deb3975ad1f58d0b129f059fde85c8901d2
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0003-0717-1817
0000-0002-1381-8464
0000-0002-4951-9220
0000-0001-9879-1004
OpenAccessLink https://doi.org/10.1021/acs.jcim.5c00603
PMID 40586820
PQID 3225874428
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3225874428
pubmed_primary_40586820
PublicationCentury 2000
PublicationDate 2025-07-14
PublicationDateYYYYMMDD 2025-07-14
PublicationDate_xml – month: 07
  year: 2025
  text: 2025-07-14
  day: 14
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of chemical information and modeling
PublicationTitleAlternate J Chem Inf Model
PublicationYear 2025
SSID ssj0033962
Score 2.4785776
Snippet Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 7013
SubjectTerms Biological Assay
Drug Discovery - methods
Humans
Ligands
Proteins - chemistry
Proteins - metabolism
Title Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context
URI https://www.ncbi.nlm.nih.gov/pubmed/40586820
https://www.proquest.com/docview/3225874428
Volume 65
WOSCitedRecordID wos001520221300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV05T8MwFLaAIsHCfZRLRmKAwW0S53BYUKloGaB0KFK2yPEhFUFSmoLg3_PspDAhIbFEWSIlz-_44vf8fQid-SIW2heCKBlJ4kOBItxhimgeiCzjDILEUubfRYMBS5J4WG-4lfVY5Twn2kQtC2H2yNvG8QxVu8euJq_EqEaZ7motobGIGhSgjBnpipLvLgKlsRUUNSxkBJB6Urcpoay1uShbT2L80gqEoSShvwNMW2h66_99xQ20VkNM3Kl8YhMtqHwLrXTnym7b6GFkZ2UxLA3_JB24V_h6XJgjDkZJAht9tOdzNb0oL3Ff2cFozHEf0gsuclypV5q1xZba6mO2gx57N6PuLamFFQj3Q39GaKBoqFzNhaQxdTmTRj0eYpHFgjEJ_9dxFHDp6oBJJwNEoAGEaalYIBjgB-ntoqW8yNU-wo7DtZIs41RBLggYDz2uATZKKL6ucL0mOp3bKoWvNN0InqvirUx_rNVEe5XB00nFsJECimQhYJODPzx9iFY9o8lryC79I9TQELbqGC2L99m4nJ5Yj4DrYHj_BZ4TwXk
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Toward+Assay-Aware+Bioactivity+Model%28er%29s%3A+Getting+a+Grip+on+Biological+Context&rft.jtitle=Journal+of+chemical+information+and+modeling&rft.au=Schoenmaker%2C+Linde&rft.au=Sastrokarijo%2C+Enzo+G&rft.au=Heitman%2C+Laura+H&rft.au=Beltman%2C+Joost+B&rft.date=2025-07-14&rft.eissn=1549-960X&rft.volume=65&rft.issue=13&rft.spage=7013&rft_id=info:doi/10.1021%2Facs.jcim.5c00603&rft_id=info%3Apmid%2F40586820&rft_id=info%3Apmid%2F40586820&rft.externalDocID=40586820
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1549-960X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1549-960X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1549-960X&client=summon