Indexing k -mers in linear space for quality value compression

Many bioinformatics tools heavily rely on -mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive -mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each -mer. This pr...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of bioinformatics and computational biology Ročník 17; číslo 5; s. 1940011
Hlavní autori: Shibuya, Yoshihiro, Comin, Matteo
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Singapore 01.10.2019
Predmet:
ISSN:1757-6334, 1757-6334
On-line prístup:Zistit podrobnosti o prístupe
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Many bioinformatics tools heavily rely on -mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive -mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each -mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input -mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant -mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1757-6334
1757-6334
DOI:10.1142/S0219720019400110