Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context
Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooke...
Uloženo v:
| Vydáno v: | Journal of chemical information and modeling Ročník 65; číslo 13; s. 7013 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
United States
14.07.2025
|
| Témata: | |
| ISSN: | 1549-960X, 1549-960X |
| On-line přístup: | Zjistit podrobnosti o přístupu |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways, and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, while this could help increase understanding of the origin of data points, improve data curation, and lead to better models that are both more accurate and make predictions specific to the readout of interest. To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that text embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays, and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute deviation of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average
from 0.67 to 0.69 across different data sets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average
0.66). Overall, models that integrate assay embeddings yield more accurate predictions and give the user the option to train their model on all available data yet still predict specific end points. In addition, the novel method for assay categorization described here facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 1549-960X 1549-960X |
| DOI: | 10.1021/acs.jcim.5c00603 |