Identification of long non-coding transcripts with feature selection: a comparative study

Background The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of se...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	BMC bioinformatics Ročník 18; číslo 1; s. 187
Hlavní autori:	Ventola, Giovanna M. M., Noviello, Teresa M. R., D’Aniello, Salvatore, Spagnuolo, Antonietta, Ceccarelli, Michele, Cerulo, Luigi
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	London BioMed Central 23.03.2017 Springer Nature B.V
Predmet:	Algorithms Annotations Art exhibits Bioinformatics Biomedical and Life Sciences Comparative studies Computational Biology/Bioinformatics Computer Appl. in Life Sciences Computer applications Danio rerio Feature extraction Feature selection Gene expression Genomes Humans Identification methods Learning algorithms Life Sciences Machine learning Microarrays Non-coding RNA Nucleotide sequence Nucleotides Predictions Protein structure Proteins Proteins - genetics Reading Research Article Ribonucleic acid RNA RNA, Long Noncoding - genetics Secondary structure Signatures Species Stability analysis Support vector machines Transcriptome analysis Zebrafish Feature selection lncRNA Classification
ISSN:	1471-2105, 1471-2105
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Background The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs. Results In this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature. Conclusions Understanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/ .
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-017-1594-z