SPACE: STRING proteins as complementary embeddings

Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but they have proven challenges being used in machine learning, especially in a cross-species set...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:bioRxiv
Hauptverfasser: Hu, Dewei, Szklarczyk, Damian, Christian Von Mering, Lars Juhl Jensen
Format: Paper
Sprache:Englisch
Veröffentlicht: Cold Spring Harbor Cold Spring Harbor Laboratory Press 26.11.2024
Cold Spring Harbor Laboratory
Ausgabe:1.1
Schlagworte:
ISSN:2692-8205, 2692-8205
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but they have proven challenges being used in machine learning, especially in a cross-species setting. To address this, we leveraged the STRING database of protein networks and orthology relations for 1,322 eukaryotes to generate network-based cross-species protein embeddings. We did this by first creating species-specific network embeddings and subsequently aligning them based on orthology relations to facilitate direct cross-species comparisons. We show that these aligned network embeddings ensure consistency across species without sacrificing quality compared to species-specific network embeddings. We also show that the aligned network embeddings are complementary to sequence embedding techniques, despite the use of seqeuence-based orthology relations in the alignment process. Finally, we demonstrate the utility and quality of the embeddings by using them for two well-established tasks: subcellular localization prediction and protein function prediction. Training logistic regression classifiers on aligned network embeddings and sequence embeddings improved the accuracy over using sequence alone, reaching performance numbers close to state-of-the-art deep-learning methods. A set of precomputed cross-species network embeddings and ProtT5 embeddings for all eukaryotic proteins have been included in STRING version 12.0.Competing Interest StatementThe authors have declared no competing interest.Footnotes* https://github.com/deweihu96/SPACE
Bibliographie:SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
Competing Interest Statement: The authors have declared no competing interest.
ISSN:2692-8205
2692-8205
DOI:10.1101/2024.11.25.625140