Learning protein fitness models from evolutionary and assay-labeled data

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sour...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Nature biotechnology Jg. 40; H. 7; S. 1114 - 1122
Hauptverfasser: Hsu, Chloe, Nisonoff, Hunter, Fannjiang, Clara, Listgarten, Jennifer
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York Nature Publishing Group US 01.07.2022
Nature Publishing Group
Springer Nature
Schlagworte:
ISSN:1087-0156, 1546-1696, 1546-1696
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines. A simple machine learning algorithm combines evolutionary and experimental data for improved protein fitness prediction.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
AC52-07NA27344
USDOE Office of Science (SC), Biological and Environmental Research (BER)
ISSN:1087-0156
1546-1696
1546-1696
DOI:10.1038/s41587-021-01146-5