Amino Acid Encoding for Deep Learning Applications

Background: The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in l...

Full description

Saved in:

Bibliographic Details
Published in:	BMC bioinformatics Vol. 21; no. 1; pp. 235 - 14
Main Authors:	ElAbd, Hesham, Bromberg, Yana, Hoarfrost, Adrienne, Lenz, Tobias, Wendorff, Mareike
Format:	Journal Article
Language:	English
Published:	2230 Support BMC 09.06.2020 BioMed Central BioMed Central Ltd Springer Nature B.V
Subjects:	Algorithms Amino acid encoding Amino acids Amino Acids - metabolism Amino acids embedding Artificial neural networks Bioinformatics Biomedical and Life Sciences Cable television broadcasting industry Computational Biology - methods Computational Biology/Bioinformatics Computer Appl. in Life Sciences Computer applications Convoluted-neural network (CNN) Cybernetics, Artificial Intelligence and Robotics Data mining Deep learning Deep Learning - standards Embedding HLA-II peptide interaction Humans Iterative methods Learning algorithms Life Sciences Machine learning Machine Learning and Artificial Intelligence in Bioinformatics Mathematical analysis Matrix algebra Matrix methods Methodology Methodology Article Microarrays Neural networks Nucleotides Observational learning Peptides Protein-protein interaction (PPI) Protein-protein interactions Proteins Recurrent neural networks Technology application Training United States Recurrent Neural Network (rnn) Human-Leukocyte Antigen (hla) Amino Acids Embedding Amino Acid Encoding Hla-Ii Peptide Interaction Deep-Learning Convoluted-Neural Network (cnn) Machine-Learning (ml) Protein-Protein Interaction (ppi) Protein-protein interaction (PPI) HLA-II peptide interaction Deep-learning Convoluted-neural network (CNN) Recurrent neural network (RNN) Human-leukocyte antigen (HLA) Machine-learning (ML) Amino acid encoding Amino acids embedding
ISSN:	1471-2105, 1471-2105
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Background: The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction – a process called ‘end-to-end learning’ – has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. Results: By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. Conclusion: Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.
Bibliography:	2230 Support 2230 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-020-03546-x