SORFPP: Enhancing rich sequence-driven information to identify SEPs based on fused framework on validation datasets

Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods hav...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	PloS one Ročník 20; číslo 4; s. e0320314
Hlavní autori:	Feng, Hongqi, Nie, Qi, Yang, Sen
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	United States Public Library of Science 28.04.2025 Public Library of Science (PLoS)
Predmet:	Algorithms Amino acid sequence Amino acids Analysis Biology and Life Sciences Computational Biology - methods Computer and Information Sciences Computer applications Correlation coefficient Correlation coefficients Datasets Deep learning DNA sequencing Ensemble learning Gene expression Gene sequencing Genomes Genomics Humans Identification Identification and classification Machine learning Methods MicroRNAs Nucleotide sequencing Open reading frames Open Reading Frames - genetics Peptides Peptides - genetics Proteins Regression models Research and Analysis Methods RNA, Long Noncoding - genetics Social Sciences Software Whole genome sequencing Taiwan China
ISSN:	1932-6203, 1932-6203
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods have been proposed. However, there is a lack of contributive features and effective models. Therefore, a high-throughput computational method to predict SEPs is needed. We propose a computational method, SORFPP, to predict SEPs by mining feature information from multiple perspectives in an experimentally validated dataset from TranLnc. SORFPP fully extracts SEP sequence information using the protein language model ESM-2 and curated traditional encoding, including QSOrder, k-mer, etc. SORFPP uses CatBoost to solve the sparsity problem of traditional encoding. SORFPP also analyzes ESM-2 pre-training characterization information with the Self-attention model. Finally, an ensemble learning framework combines the two models and their results are fed into Logistic Regression model for accurate and robust predictions. For comparison, SORFPP outperforms other state-of-the-art models in Matthew correlation coefficient by 12.2%-24.2% on three benchmark datasets. Integrating the ensemble learning strategy with contributive traditional features and the protein language encoding methods shows better performance. Datasets and codes are accessible at https://doi.org/10.6084/m9.figshare.28079897 and http://111.229.198.94:5000/.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0320314