Learning protein constitutive motifs from sequence data

Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families f...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:eLife Ročník 8
Hlavní autori: Tubiana, Jérôme, Cocco, Simona, Monasson, Rémi
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: England eLife Science Publications, Ltd 12.03.2019
eLife Sciences Publications Ltd
eLife Sciences Publication
eLife Sciences Publications, Ltd
Predmet:
ISSN:2050-084X, 2050-084X
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and 'turning up' or 'turning down' the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. Almost every process that keeps a cell alive depends on the activities of several proteins. All proteins are made from chains of smaller molecules called amino acids, and the specific sequence of amino acids determines the protein’s overall shape, which in turn controls what the protein can do. Yet, the relationships between a protein’s structure and its function are complex, and it remains unclear how the sequence of amino acids in a protein actually determine its features and properties. Machine learning is a computational approach that is often applied to understand complex issues in biology. It uses computer algorithms to spot statistical patterns in large amounts of data and, after 'learning' from the data, the algorithms can then provide new insights, make predictions or even generate new data. Tubiana et al. have now used a relatively simple form of machine learning to study the amino acid sequences of 20 different families of proteins. First, frameworks of algorithms –known as Restricted Boltzmann Machines, RBM for short – were trained to read some amino-acid sequence data that coded for similar proteins. After ‘learning’ from the data, the RBM could then infer statistical patterns that were common to the sequences. Tubiana et al. saw that many of these inferred patterns could be interpreted in a meaningful way and related to properties of the proteins. For example, some were related to known twists and loops that are commonly found in proteins; others could be linked to specific activities. This level of interpretability is somewhat at odds with the results of other common methods used in machine learning, which tend to behave more like a ‘black box’. Using their RBM, Tubiana et al. then proposed how to design new proteins that may prove to have interesting features. In the future, similar methods could be applied across computational biology as a way to make sense of complex data in an understandable way.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
PMCID: PMC6436896
ISSN:2050-084X
2050-084X
DOI:10.7554/eLife.39397