Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context

Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooke...

Full description

Saved in:
Bibliographic Details
Published in:Journal of chemical information and modeling Vol. 65; no. 13; p. 7013
Main Authors: Schoenmaker, Linde, Sastrokarijo, Enzo G, Heitman, Laura H, Beltman, Joost B, Jespers, Willem, van Westen, Gerard J P
Format: Journal Article
Language:English
Published: United States 14.07.2025
Subjects:
ISSN:1549-960X, 1549-960X
Online Access:Get more information
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average 0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1549-960X
1549-960X
DOI:10.1021/acs.jcim.5c00603