A novel lossless encoding algorithm for data compression–genomics data as an exemplar

Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and ass...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Frontiers in bioinformatics Ročník 4; s. 1489704
Hlavní autoři: Al-okaily, Anas, Tbakhi, Abdelghani
Médium: Journal Article
Jazyk:angličtina
Vydáno: Switzerland Frontiers Media S.A 23.01.2025
Témata:
ISSN:2673-7647, 2673-7647
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarities in their content, and binning similar subsequences together. The data is then compressed into each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform-based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes. The results showed a considerable improvement in the compression of each genome, preserving several megabytes compared to state-of-the-art tools. Moreover, the algorithm can be applied to the compression of other data types include mainly text, numbers, images, audio, and video which are being generated daily and unprecedentedly in massive volumes.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Reviewed by: Soham Sengupta, St. Jude Children’s Research Hospital, United States
Edited by: Ali Kadhum Idrees, University of Babylon, Iraq
Bhavika Mam, Independent Researcher, Palo Alto, CA, United States
ISSN:2673-7647
2673-7647
DOI:10.3389/fbinf.2024.1489704