High efficiency referential genome compression algorithm

With the development and the gradually popularized application of next-generation sequencing technologies (NGS), genome sequencing has been becoming faster and cheaper, creating a massive amount of genome sequence data which still grows at an explosive rate. The time and cost of transmission, storag...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Bioinformatics (Oxford, England) Jg. 35; H. 12; S. 2058 - 2065
Hauptverfasser:	Shi, Wei, Chen, Jianhua, Luo, Mao, Chen, Min
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	England 01.06.2019
Schlagworte:	algorithms bioinformatics genome humans medicine nucleotide sequences
ISSN:	1367-4803, 1367-4811, 1367-4811
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the development and the gradually popularized application of next-generation sequencing technologies (NGS), genome sequencing has been becoming faster and cheaper, creating a massive amount of genome sequence data which still grows at an explosive rate. The time and cost of transmission, storage, processing and analysis of these genetic data have become bottlenecks that hinder the development of genetics and biomedicine. Although there are many common data compression algorithms, they are not effective for genome sequences due to their inability to consider and exploit the inherent characteristics of genome sequence data. Therefore, the development of a fast and efficient compression algorithm specific to genome data is an important and pressing issue. We have developed a referential lossless genome data compression algorithm with better performance than previous algorithms. According to a carefully designed matching strategy selection mechanism, the advantages of local matching and global matching are reasonably combined together to improve the description efficiency of the matched sub-strings. The effects of the length and the position of matched sub-strings to the compression efficiency are jointly taken into consideration. The proposed algorithm can compress the FASTA data of complete human genomes, each of which is about 3 GB, in about 18 min. The compressed file sizes are ranging from a few megabytes to about forty megabytes. The averaged compression ratio is higher than that of the state-of-the-art genome compression algorithms, the time complexity is at the same order of the best-known algorithms. https://github.com/jhchen5/SCCG. Supplementary data are available at Bioinformatics online.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1367-4803 1367-4811 1367-4811
DOI:	10.1093/bioinformatics/bty934