An Efficient Parallel Sketch-Based Algorithmic Workflow for Mapping Long Reads

Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE Transactions on Computational Biology and Bioinformatics Ročník 22; číslo 1; s. 13 - 26
Hlavní autoři: Rahman, Tazin, Bhowmik, Oieswarya, Kalyanaraman, Ananth
Médium: Journal Article
Jazyk:angličtina
Vydáno: United States IEEE 01.01.2025
Témata:
ISSN:2998-4165, 2998-4165
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper , that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2998-4165
2998-4165
DOI:10.1109/TCBB.2024.3489478