Empathi: embedding-based phage protein annotation tool by hierarchical assignment

Bacteriophages, viruses infecting bacteria, are estimated to outnumber their cellular hosts by 10-fold, acting as key players in all microbial ecosystems. Under evolutionary pressure by their host, they evolve rapidly and encode a large diversity of protein sequences. Consequently, the majority of f...

Full description

Saved in:
Bibliographic Details
Published in:Nature communications Vol. 16; no. 1; pp. 9114 - 9
Main Authors: Boulay, Alexandre, Leprince, Audrey, Enault, François, Rousseau, Elsa, Galiez, Clovis
Format: Journal Article
Language:English
Published: London Nature Publishing Group UK 14.10.2025
Nature Publishing Group
Nature Portfolio
Subjects:
ISSN:2041-1723, 2041-1723
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Bacteriophages, viruses infecting bacteria, are estimated to outnumber their cellular hosts by 10-fold, acting as key players in all microbial ecosystems. Under evolutionary pressure by their host, they evolve rapidly and encode a large diversity of protein sequences. Consequently, the majority of functions carried by phage proteins remain elusive. Current tools to comprehensively identify phage protein functions from their sequence either lack sensitivity (those relying on homology for instance) or specificity (assigning a single coarse grain function to a protein). Here, we introduce Empathi, a protein-embedding-based classifier that assigns functions in a hierarchical manner. New categories were specifically elaborated for phage protein functions and organized such that molecular-level functions are respected in each category, making them well suited for training machine learning classifiers based on protein embeddings. Empathi outperforms homology-based methods on a dataset of cultured phage genomes, tripling the number of annotated homologous groups. On the EnVhogDB database, the most recent and extensive database of metagenomically-sourced phage proteins, Empathi doubled the annotated fraction of protein families from 16% to 33%. Having a more global view of the repertoire of functions a phage possesses will assuredly help to understand them and their interactions with bacteria better. Bacteriophages (the viruses that infect bacteria) play key roles in microbial communities, but the functions of most of their genes remain unknown. Here, Boulay et al. present a machine-learning classifier that uses protein language models to assign functions to bacteriophage proteins more accurately than existing approaches.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2041-1723
2041-1723
DOI:10.1038/s41467-025-64177-5