FroM Superstring to Indexing: a space-efficient index for unconstrained k -mer sets using the Masked Burrows-Wheeler Transform (MBWT)
The growing volumes and heterogeneity of genomic data call for scalable and versatile k-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k, sampled data, or high-diversity settin...
Saved in:
| Published in: | Bioinformatics advances pp. 1 - 10 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Oxford academic
12.11.2025
|
| Subjects: | |
| ISSN: | 2635-0041, 2635-0041 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | The growing volumes and heterogeneity of genomic data call for scalable and versatile k-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k, sampled data, or high-diversity settings. Here, we introduce FMSI, a superstring-based index for arbitrary k-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in k-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of k values and dataset types – including genomic, pangenomic, and metagenomic – FMSI consistently achieves superior query space efficiency, using up to 2–3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI’s footprint in some cases, but then FMSI is 2–3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary k-mer sets across diverse bioinformatics applications. |
|---|---|
| ISSN: | 2635-0041 2635-0041 |
| DOI: | 10.1093/bioadv/vbaf290 |