FroM Superstring to Indexing: a space-efficient index for unconstrained k -mer sets using the Masked Burrows-Wheeler Transform (MBWT)

The growing volumes and heterogeneity of genomic data call for scalable and versatile k-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k, sampled data, or high-diversity settin...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Bioinformatics advances s. 1 - 10
Hlavní autoři: Sladký, Ondřej, Veselý, Pavel, Břinda, Karel
Médium: Journal Article
Jazyk:angličtina
Vydáno: Oxford academic 12.11.2025
Témata:
ISSN:2635-0041, 2635-0041
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The growing volumes and heterogeneity of genomic data call for scalable and versatile k-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k, sampled data, or high-diversity settings. Here, we introduce FMSI, a superstring-based index for arbitrary k-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in k-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of k values and dataset types – including genomic, pangenomic, and metagenomic – FMSI consistently achieves superior query space efficiency, using up to 2–3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI’s footprint in some cases, but then FMSI is 2–3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary k-mer sets across diverse bioinformatics applications.
ISSN:2635-0041
2635-0041
DOI:10.1093/bioadv/vbaf290