Large language models generate functional protein sequences across diverse families

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatical...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Nature biotechnology Ročník 41; číslo 8; s. 1099 - 1106
Hlavní autoři: Madani, Ali, Krause, Ben, Greene, Eric R., Subramanian, Subu, Mohr, Benjamin P., Holton, James M., Olmos, Jose Luis, Xiong, Caiming, Sun, Zachary Z., Socher, Richard, Fraser, James S., Naik, Nikhil
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York Nature Publishing Group US 01.08.2023
Nature Publishing Group
Springer Nature
Témata:
ISSN:1087-0156, 1546-1696, 1546-1696
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase. A generative deep-learning model designs artificial proteins with desired enzymatic activities.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
AC02-05CH11231
USDOE Office of Science (SC), Basic Energy Sciences (BES)
National Institutes of Health (NIH)
USDOE Office of Science (SC), Biological and Environmental Research (BER)
A.M. conceived and designed the study in collaboration with S.S. A.M. and B.K. designed and performed machine learning modeling, generation, and scoring. B.P.M. performed the cell-free expression and activity assay and was supervised by Z.Z.S. E.G. performed the cell-based expression and kinetics assay and was supervised by J.S.F. J.H., J.L.O., J.S.F. performed the structure determination. A.M., S.S., B.K., and N.N. performed computational analysis, and were advised by C.X. R.S. provided advice on machine learning and computational methods. A.M., J.S.F., and N.N. wrote the manuscript with feedback and contributions from all authors, in particular from E.G. and B.K. N.N. supervised and managed the project.
denotes equal contribution
Author Contributions
ISSN:1087-0156
1546-1696
1546-1696
DOI:10.1038/s41587-022-01618-2