An Efficient Parallel Sketch-Based Algorithmic Workflow for Mapping Long Reads

Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE Transactions on Computational Biology and Bioinformatics Jg. 22; H. 1; S. 13 - 26
Hauptverfasser: Rahman, Tazin, Bhowmik, Oieswarya, Kalyanaraman, Ananth
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States IEEE 01.01.2025
Schlagworte:
ISSN:2998-4165, 2998-4165
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper , that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper.
AbstractList Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper , that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper.
Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper, that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper.Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper, that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper.
Author Rahman, Tazin
Kalyanaraman, Ananth
Bhowmik, Oieswarya
Author_xml – sequence: 1
  givenname: Tazin
  surname: Rahman
  fullname: Rahman, Tazin
  email: tazin.rahman@wsu.edu
  organization: Washington State University, Pullman, WA, USA
– sequence: 2
  givenname: Oieswarya
  orcidid: 0009-0005-0041-9914
  surname: Bhowmik
  fullname: Bhowmik, Oieswarya
  email: oieswarya.bhowmik@wsu.edu
  organization: Washington State University, Pullman, WA, USA
– sequence: 3
  givenname: Ananth
  orcidid: 0000-0001-6721-233X
  surname: Kalyanaraman
  fullname: Kalyanaraman, Ananth
  email: ananth@wsu.edu
  organization: Washington State University, Pullman, WA, USA
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40811229$$D View this record in MEDLINE/PubMed
BookMark eNpNkMlOwzAURS1URMvwAUgIZckmxY7txF62VRmkMohBLCPHfqZW07jYqRB_T6oW1M27b3HuXZxj1Gt8AwidEzwkBMvrt8l4PMxwxoaUCckKcYAGmZQiZSTnvb2_j85idBXmhcwppfwI9RkWhGSZHKDHUZNMrXXaQdMmzyqouoY6eV1Aq-fpWEUwyaj-9MG186XTyYcPC1v778T6kDyo1co1n8nMd-cFlImn6NCqOsLZLk_Q-830bXKXzp5u7yejWaopy9u00kxhK3glpakEM5pwam0OygLmhOncYlmBqJSxTBgjuBS4yJm0RFtjuaIn6Gq7uwr-aw2xLZcuaqhr1YBfx5JmVBaCcSw79HKHrqslmHIV3FKFn_LPQQeQLaCDjzGA_UcILjeqy43qcqO63KnuOhfbjgOAPb7gHBNCfwGVyHmd
CODEN ITCBDZ
Cites_doi 10.1089/cmb.2023.0094
10.1186/s12859-021-04451-7
10.1038/s41592-018-0046-7
10.1093/bioinformatics/btac244
10.1186/1471-2105-13-238
10.1101/gr.215087.116
10.1186/s13059-021-02551-4
10.1093/bioinformatics/btr708
10.1109/99.660313
10.1038/nmeth.4432
10.1093/bioinformatics/bty191
10.1093/nar/gks1195
10.1101/gr.216465.116
10.1371/journal.pcbi.1009860
10.1007/978-3-662-44753-6_5
10.1007/978-0-387-77491-6_28
10.1093/bioinformatics/bth408
10.1093/bioinformatics/bty597
10.1007/978-3-319-56970-3_5
10.1126/science.1178534
10.1093/bioinformatics/btv688
10.1002/0471250953.bi0410s05
10.1093/bioinformatics/btx717
10.1093/bioinformatics/btz354
10.1186/s12859-017-1911-6
10.1038/s41597-020-00743-4
10.3390/genes12010048
10.1186/s13059-019-1841-x
10.1093/bioinformatics/btx235
10.1186/1748-7188-8-22
10.1186/s13059-016-0997-x
10.1186/gb-2009-10-3-r25
10.1016/j.isci.2020.101389
10.1109/IPDPSW59300.2023.00037
10.1038/nbt.3423
10.1038/nbt.3238
10.1109/MC.2013.418
10.1038/s41598-020-80757-5
10.1109/SEQUEN.1997.666900
10.1093/hmg/ddy177
10.1137/1.9781611976830.12
10.1089/cmb.2018.0068
10.1016/0167-8191(87)90010-X
10.1186/gb-2012-13-3-314
10.1093/bioinformatics/btw152
10.1007/978-3-319-05269-4_4
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
NPM
7X8
DOI 10.1109/TCBB.2024.3489478
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
PubMed
MEDLINE - Academic
DatabaseTitle CrossRef
PubMed
MEDLINE - Academic
DatabaseTitleList
PubMed
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
EISSN 2998-4165
EndPage 26
ExternalDocumentID 40811229
10_1109_TCBB_2024_3489478
10755011
Genre orig-research
Journal Article
GrantInformation_xml – fundername: NSF
  grantid: OAC 1910213; CCF 1919122; CCF 2316160
GroupedDBID 97E
ALMA_UNASSIGNED_HOLDINGS
M43
RIA
RIE
AAYXX
CITATION
NPM
7X8
ID FETCH-LOGICAL-c346t-bc4a0f85b99db84dc153ff6eafe0514c6f09be8badf48dd859807649f1cfdf5a3
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001485414800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2998-4165
IngestDate Fri Aug 15 19:39:44 EDT 2025
Sat Aug 16 01:30:43 EDT 2025
Sat Nov 29 08:05:59 EST 2025
Thu Apr 10 08:19:10 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c346t-bc4a0f85b99db84dc153ff6eafe0514c6f09be8badf48dd859807649f1cfdf5a3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0001-6721-233X
0009-0005-0041-9914
PMID 40811229
PQID 3239784509
PQPubID 23479
PageCount 14
ParticipantIDs crossref_primary_10_1109_TCBB_2024_3489478
proquest_miscellaneous_3239784509
pubmed_primary_40811229
ieee_primary_10755011
PublicationCentury 2000
PublicationDate 2025-01-01
PublicationDateYYYYMMDD 2025-01-01
PublicationDate_xml – month: 01
  year: 2025
  text: 2025-01-01
  day: 01
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle IEEE Transactions on Computational Biology and Bioinformatics
PublicationTitleAbbrev TCBBIO
PublicationTitleAlternate IEEE Trans Comput Biol Bioinform
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
References ref13
ref35
ref12
ref34
ref15
ref37
ref14
ref36
ref31
ref30
ref11
ref33
ref10
ref32
ref2
ref1
ref17
ref39
ref16
ref38
ref19
ref18
ref24
ref46
ref23
ref45
ref26
ref25
ref47
ref20
ref42
ref41
ref22
ref44
ref21
ref28
ref27
ref29
ref8
ref7
ref9
ref4
Biosciences (ref43) 2021
ref3
ref6
ref5
ref40
References_xml – ident: ref30
  doi: 10.1089/cmb.2023.0094
– ident: ref25
  doi: 10.1186/s12859-021-04451-7
– ident: ref42
  doi: 10.1038/s41592-018-0046-7
– ident: ref29
  doi: 10.1093/bioinformatics/btac244
– ident: ref34
  doi: 10.1186/1471-2105-13-238
– ident: ref6
  doi: 10.1101/gr.215087.116
– ident: ref41
  doi: 10.1186/s13059-021-02551-4
– ident: ref45
  doi: 10.1093/bioinformatics/btr708
– ident: ref46
  doi: 10.1109/99.660313
– ident: ref8
  doi: 10.1038/nmeth.4432
– ident: ref16
  doi: 10.1093/bioinformatics/bty191
– ident: ref44
  doi: 10.1093/nar/gks1195
– ident: ref7
  doi: 10.1101/gr.216465.116
– ident: ref17
  doi: 10.1371/journal.pcbi.1009860
– ident: ref20
  doi: 10.1007/978-3-662-44753-6_5
– ident: ref37
  doi: 10.1007/978-0-387-77491-6_28
– ident: ref26
  doi: 10.1093/bioinformatics/bth408
– ident: ref19
  doi: 10.1093/bioinformatics/bty597
– ident: ref21
  doi: 10.1007/978-3-319-56970-3_5
– ident: ref36
  doi: 10.1126/science.1178534
– ident: ref11
  doi: 10.1093/bioinformatics/btv688
– ident: ref38
  doi: 10.1002/0471250953.bi0410s05
– ident: ref35
  doi: 10.1093/bioinformatics/btx717
– ident: ref24
  doi: 10.1093/bioinformatics/btz354
– ident: ref14
  doi: 10.1186/s12859-017-1911-6
– ident: ref4
  doi: 10.1038/s41597-020-00743-4
– ident: ref33
  doi: 10.3390/genes12010048
– ident: ref23
  doi: 10.1186/s13059-019-1841-x
– ident: ref32
  doi: 10.1093/bioinformatics/btx235
– ident: ref18
  doi: 10.1186/1748-7188-8-22
– ident: ref22
  doi: 10.1186/s13059-016-0997-x
– ident: ref15
  doi: 10.1186/gb-2009-10-3-r25
– year: 2021
  ident: ref43
  article-title: PacBio real-world HiFi long reads for O sativa
– ident: ref12
  doi: 10.1016/j.isci.2020.101389
– ident: ref13
  doi: 10.1109/IPDPSW59300.2023.00037
– ident: ref2
  doi: 10.1038/nbt.3423
– ident: ref10
  doi: 10.1038/nbt.3238
– ident: ref47
  doi: 10.1109/MC.2013.418
– ident: ref5
  doi: 10.1038/s41598-020-80757-5
– ident: ref27
  doi: 10.1109/SEQUEN.1997.666900
– ident: ref3
  doi: 10.1093/hmg/ddy177
– ident: ref9
  doi: 10.1137/1.9781611976830.12
– ident: ref31
  doi: 10.1089/cmb.2018.0068
– ident: ref40
  doi: 10.1016/0167-8191(87)90010-X
– ident: ref1
  doi: 10.1186/gb-2012-13-3-314
– ident: ref28
  doi: 10.1093/bioinformatics/btw152
– ident: ref39
  doi: 10.1007/978-3-319-05269-4_4
SSID ssib057963335
ssib057913686
Score 2.381698
Snippet Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high...
SourceID proquest
pubmed
crossref
ieee
SourceType Aggregation Database
Index Database
Publisher
StartPage 13
SubjectTerms Accuracy
Alignment-free
Assembly
Bioinformatics
Computational biology
Genomics
hybrid assembly
Hybrid power systems
long read mapping
MinHashing
parallel algorithms
Parallel processing
Pipelines
Sequential analysis
sketching
Title An Efficient Parallel Sketch-Based Algorithmic Workflow for Mapping Long Reads
URI https://ieeexplore.ieee.org/document/10755011
https://www.ncbi.nlm.nih.gov/pubmed/40811229
https://www.proquest.com/docview/3239784509
Volume 22
WOSCitedRecordID wos001485414800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 2998-4165
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssib057913686
  issn: 2998-4165
  databaseCode: RIE
  dateStart: 20250101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA66ePDiA1_riwiehGqbR5Mcd0XxoIugwt5Kmocurq3sQ_--mbSrXjx4KS2kpXwz7XzJZOZD6FSUREmZ24Rxm4OEWZpI4-BSWMUyLQgvo9iEGAzkcKju22L1WAvjnIubz9w5nMZcvq3NHJbKwhcuAqGGSt5lIURTrLVwHi5URvOfXmhQY0kp5W0mM0vVxeNlvx9mhISdUyYVA2W1X7Eoiqv8zTNjvLle_-ebbqC1lljiXuMJm2jJVVto0KvwVWwSEYbjez0B5ZQxfngFYyX9EMEs7o2f68lo9vI2MhiWzv24_sSByuI7Db0bnvFtHQ6w2X66jZ6urx4vb5JWQSExlOWzpDRMp17yUilbSmZN-L95nzvtHfQ9N7lPVelkqa1n0lrJlUxFzpTPjLeea7qDOlVduT2EZaCGweCEWANyxV4ayoUOk7mUS2sI6aKzBZbFe9Moo4gTjFQVAHwBwBct8F20DZj9GtjA1UUnC_iL4OaQu9CVq-fTgpJAnCQL9KaLdhu7fN_NAq3JCFH7fzz1AK0SUO2NCyeHqDObzN0RWjEfs9F0chx8aSiPoy99AZwGxO0
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA6igl584Gt9RvAkVNs82uS4K4riugiu4K2keeji2so-9O-bSbvqxYOX0kJaypdp881MZj6ETrKCSCFSEzFuUpAwiyOhLVxmRrJEZYQXQWwi6_XE05O8b4rVQy2MtTZsPrNncBpy-abSUwiV-S8884QaKnkXOGMkqcu1ZubDM5nQ9KcbGlRZUkp5k8tMYnnev-h0vE9I2BllQjLQVvu1GgV5lb-ZZlhxrlb_-a5raKWhlrhd28I6mrPlBuq1S3wZ2kT44fhejUA7ZYgfXmG6oo5fwwxuD5-r0WDy8jbQGILnblh9Yk9m8Z2C7g3PuFv5A2y3H2-ix6vL_sV11GgoRJqydBIVmqnYCV5IaQrBjPZ_OOdSq5yFzuc6dbEsrCiUcUwYI7gUcZYy6RLtjOOKbqH5sirtDsLCk0M_5YQYDYLFTmjKM-XduZgLowlpodMZlvl73SojDy5GLHMAPgfg8wb4FtoEzH4NrOFqoeMZ_Lk3dMheqNJW03FOiadOgnmC00Lb9bx83808sUkIkbt_PPUILV3377p596Z3u4eWCWj4hjDKPpqfjKb2AC3qj8lgPDoMFvUFX-3HTA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Efficient+Parallel+Sketch-Based+Algorithmic+Workflow+for+Mapping+Long+Reads&rft.jtitle=IEEE+Transactions+on+Computational+Biology+and+Bioinformatics&rft.au=Rahman%2C+Tazin&rft.au=Bhowmik%2C+Oieswarya&rft.au=Kalyanaraman%2C+Ananth&rft.date=2025-01-01&rft.issn=2998-4165&rft.eissn=2998-4165&rft.volume=22&rft.issue=1&rft.spage=13&rft.epage=26&rft_id=info:doi/10.1109%2FTCBB.2024.3489478&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCBB_2024_3489478
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2998-4165&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2998-4165&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2998-4165&client=summon