CREMSA: compressed indexing of (ultra) large multiple sequence alignments

Uloženo v:
Podrobná bibliografie
Název: CREMSA: compressed indexing of (ultra) large multiple sequence alignments
Autoři: Salson, Mikaël, Boddaert, Arthur, Gueye, Awa Bousso, Bulteau, Laurent, Hernandez--Courbevoie, Yohan, Marchet, Camille, Pan, Nan, Will, Sebastian, Ponty, Yann
Přispěvatelé: Salson, Mikaël
Zdroj: Bioinformatics
Informace o vydavateli: Oxford University Press (OUP), 2025.
Rok vydání: 2025
Témata: Genome Sequence Analysis, Compression, Indexing, Multiple sequence alignment, [INFO.INFO-BI] Computer Science [cs]/Bioinformatics [q-bio.QM]
Popis: Motivation Recent viral outbreaks motivate the systematic collection of pathogenic genomes in order to accelerate their study and monitor the apparition/spread of variants. Due to their limited length and temporal proximity of their sequencing, viral genomes are usually organized, and analyzed as oversized Multiple Sequence Alignments (MSAs). Such MSAs are largely ungapped, and mostly homogeneous on a column-wise level but not at a sequential level due to local variations, hindering the performances of sequential compression algorithms. Results In order to enable an efficient handling of MSAs, including subsequent statistical analyses, we introduce CREMSA (Column-wise Run-length Encoding for MSAs), a new index that builds on sparse bitvector representations to compress an existing or streamed MSA, all the while allowing for an expressive set of accelerated requests to query the alignment without prior decompression. Using CREMSA, a 65 GB MSA consisting of 1.9M SARS-CoV 2 genomes could be compressed into 22 MB using less than half a gigabyte of main memory, while executing access requests in the order of 100 ns. Such a speed up enables a comprehensive analysis of covariation over this very large MSA. We further assess the impact of the sequence ordering on the compressibility of MSAs and propose a resorting strategy that, despite the proven NP-hardness of an optimal sort, induces greatly increased compression ratios at a marginal computational cost. Availability and implementation CREMSA is freely accessible at https://gitlab.univ-lille.fr/cremsa/cremsa. The Snakemake workflow for the benchmarks is available at: https://gitlab.univ-lille.fr/cremsa/bench. The data used in the paper is on Zenodo at https://zenodo.org/records/14698859 and https://zenodo.org/records/15100011.
Druh dokumentu: Article
Other literature type
Conference object
Popis souboru: application/pdf
Jazyk: English
ISSN: 1367-4811
1367-4803
DOI: 10.1093/bioinformatics/btaf211
Přístupová URL adresa: https://hal.science/hal-04910677v2/document
https://hal.science/hal-04910677v2
https://doi.org/10.1093/bioinformatics/btaf211
Rights: CC BY
URL: http://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Přístupové číslo: edsair.doi.dedup.....ee6910c6a5732ea73c8cc08768a19038
Databáze: OpenAIRE
Popis
Abstrakt:Motivation Recent viral outbreaks motivate the systematic collection of pathogenic genomes in order to accelerate their study and monitor the apparition/spread of variants. Due to their limited length and temporal proximity of their sequencing, viral genomes are usually organized, and analyzed as oversized Multiple Sequence Alignments (MSAs). Such MSAs are largely ungapped, and mostly homogeneous on a column-wise level but not at a sequential level due to local variations, hindering the performances of sequential compression algorithms. Results In order to enable an efficient handling of MSAs, including subsequent statistical analyses, we introduce CREMSA (Column-wise Run-length Encoding for MSAs), a new index that builds on sparse bitvector representations to compress an existing or streamed MSA, all the while allowing for an expressive set of accelerated requests to query the alignment without prior decompression. Using CREMSA, a 65 GB MSA consisting of 1.9M SARS-CoV 2 genomes could be compressed into 22 MB using less than half a gigabyte of main memory, while executing access requests in the order of 100 ns. Such a speed up enables a comprehensive analysis of covariation over this very large MSA. We further assess the impact of the sequence ordering on the compressibility of MSAs and propose a resorting strategy that, despite the proven NP-hardness of an optimal sort, induces greatly increased compression ratios at a marginal computational cost. Availability and implementation CREMSA is freely accessible at https://gitlab.univ-lille.fr/cremsa/cremsa. The Snakemake workflow for the benchmarks is available at: https://gitlab.univ-lille.fr/cremsa/bench. The data used in the paper is on Zenodo at https://zenodo.org/records/14698859 and https://zenodo.org/records/15100011.
ISSN:13674811
13674803
DOI:10.1093/bioinformatics/btaf211