An Efficient Parallel Sketch-Based Algorithmic Workflow for Mapping Long Reads
Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially...
Gespeichert in:
| Veröffentlicht in: | IEEE Transactions on Computational Biology and Bioinformatics Jg. 22; H. 1; S. 13 - 26 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
United States
IEEE
01.01.2025
|
| Schlagworte: | |
| ISSN: | 2998-4165, 2998-4165 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper , that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper. |
|---|---|
| AbstractList | Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper , that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper. Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper, that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper.Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper, that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper. |
| Author | Rahman, Tazin Kalyanaraman, Ananth Bhowmik, Oieswarya |
| Author_xml | – sequence: 1 givenname: Tazin surname: Rahman fullname: Rahman, Tazin email: tazin.rahman@wsu.edu organization: Washington State University, Pullman, WA, USA – sequence: 2 givenname: Oieswarya orcidid: 0009-0005-0041-9914 surname: Bhowmik fullname: Bhowmik, Oieswarya email: oieswarya.bhowmik@wsu.edu organization: Washington State University, Pullman, WA, USA – sequence: 3 givenname: Ananth orcidid: 0000-0001-6721-233X surname: Kalyanaraman fullname: Kalyanaraman, Ananth email: ananth@wsu.edu organization: Washington State University, Pullman, WA, USA |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/40811229$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkMlOwzAURS1URMvwAUgIZckmxY7txF62VRmkMohBLCPHfqZW07jYqRB_T6oW1M27b3HuXZxj1Gt8AwidEzwkBMvrt8l4PMxwxoaUCckKcYAGmZQiZSTnvb2_j85idBXmhcwppfwI9RkWhGSZHKDHUZNMrXXaQdMmzyqouoY6eV1Aq-fpWEUwyaj-9MG186XTyYcPC1v778T6kDyo1co1n8nMd-cFlImn6NCqOsLZLk_Q-830bXKXzp5u7yejWaopy9u00kxhK3glpakEM5pwam0OygLmhOncYlmBqJSxTBgjuBS4yJm0RFtjuaIn6Gq7uwr-aw2xLZcuaqhr1YBfx5JmVBaCcSw79HKHrqslmHIV3FKFn_LPQQeQLaCDjzGA_UcILjeqy43qcqO63KnuOhfbjgOAPb7gHBNCfwGVyHmd |
| CODEN | ITCBDZ |
| Cites_doi | 10.1089/cmb.2023.0094 10.1186/s12859-021-04451-7 10.1038/s41592-018-0046-7 10.1093/bioinformatics/btac244 10.1186/1471-2105-13-238 10.1101/gr.215087.116 10.1186/s13059-021-02551-4 10.1093/bioinformatics/btr708 10.1109/99.660313 10.1038/nmeth.4432 10.1093/bioinformatics/bty191 10.1093/nar/gks1195 10.1101/gr.216465.116 10.1371/journal.pcbi.1009860 10.1007/978-3-662-44753-6_5 10.1007/978-0-387-77491-6_28 10.1093/bioinformatics/bth408 10.1093/bioinformatics/bty597 10.1007/978-3-319-56970-3_5 10.1126/science.1178534 10.1093/bioinformatics/btv688 10.1002/0471250953.bi0410s05 10.1093/bioinformatics/btx717 10.1093/bioinformatics/btz354 10.1186/s12859-017-1911-6 10.1038/s41597-020-00743-4 10.3390/genes12010048 10.1186/s13059-019-1841-x 10.1093/bioinformatics/btx235 10.1186/1748-7188-8-22 10.1186/s13059-016-0997-x 10.1186/gb-2009-10-3-r25 10.1016/j.isci.2020.101389 10.1109/IPDPSW59300.2023.00037 10.1038/nbt.3423 10.1038/nbt.3238 10.1109/MC.2013.418 10.1038/s41598-020-80757-5 10.1109/SEQUEN.1997.666900 10.1093/hmg/ddy177 10.1137/1.9781611976830.12 10.1089/cmb.2018.0068 10.1016/0167-8191(87)90010-X 10.1186/gb-2012-13-3-314 10.1093/bioinformatics/btw152 10.1007/978-3-319-05269-4_4 |
| ContentType | Journal Article |
| DBID | 97E RIA RIE AAYXX CITATION NPM 7X8 |
| DOI | 10.1109/TCBB.2024.3489478 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef PubMed MEDLINE - Academic |
| DatabaseTitle | CrossRef PubMed MEDLINE - Academic |
| DatabaseTitleList | PubMed MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| EISSN | 2998-4165 |
| EndPage | 26 |
| ExternalDocumentID | 40811229 10_1109_TCBB_2024_3489478 10755011 |
| Genre | orig-research Journal Article |
| GrantInformation_xml | – fundername: NSF grantid: OAC 1910213; CCF 1919122; CCF 2316160 |
| GroupedDBID | 97E ALMA_UNASSIGNED_HOLDINGS M43 RIA RIE AAYXX CITATION NPM 7X8 |
| ID | FETCH-LOGICAL-c346t-bc4a0f85b99db84dc153ff6eafe0514c6f09be8badf48dd859807649f1cfdf5a3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001485414800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2998-4165 |
| IngestDate | Fri Aug 15 19:39:44 EDT 2025 Sat Aug 16 01:30:43 EDT 2025 Sat Nov 29 08:05:59 EST 2025 Thu Apr 10 08:19:10 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c346t-bc4a0f85b99db84dc153ff6eafe0514c6f09be8badf48dd859807649f1cfdf5a3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ORCID | 0000-0001-6721-233X 0009-0005-0041-9914 |
| PMID | 40811229 |
| PQID | 3239784509 |
| PQPubID | 23479 |
| PageCount | 14 |
| ParticipantIDs | crossref_primary_10_1109_TCBB_2024_3489478 proquest_miscellaneous_3239784509 pubmed_primary_40811229 ieee_primary_10755011 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-01-01 |
| PublicationDateYYYYMMDD | 2025-01-01 |
| PublicationDate_xml | – month: 01 year: 2025 text: 2025-01-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | IEEE Transactions on Computational Biology and Bioinformatics |
| PublicationTitleAbbrev | TCBBIO |
| PublicationTitleAlternate | IEEE Trans Comput Biol Bioinform |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| References | ref13 ref35 ref12 ref34 ref15 ref37 ref14 ref36 ref31 ref30 ref11 ref33 ref10 ref32 ref2 ref1 ref17 ref39 ref16 ref38 ref19 ref18 ref24 ref46 ref23 ref45 ref26 ref25 ref47 ref20 ref42 ref41 ref22 ref44 ref21 ref28 ref27 ref29 ref8 ref7 ref9 ref4 Biosciences (ref43) 2021 ref3 ref6 ref5 ref40 |
| References_xml | – ident: ref30 doi: 10.1089/cmb.2023.0094 – ident: ref25 doi: 10.1186/s12859-021-04451-7 – ident: ref42 doi: 10.1038/s41592-018-0046-7 – ident: ref29 doi: 10.1093/bioinformatics/btac244 – ident: ref34 doi: 10.1186/1471-2105-13-238 – ident: ref6 doi: 10.1101/gr.215087.116 – ident: ref41 doi: 10.1186/s13059-021-02551-4 – ident: ref45 doi: 10.1093/bioinformatics/btr708 – ident: ref46 doi: 10.1109/99.660313 – ident: ref8 doi: 10.1038/nmeth.4432 – ident: ref16 doi: 10.1093/bioinformatics/bty191 – ident: ref44 doi: 10.1093/nar/gks1195 – ident: ref7 doi: 10.1101/gr.216465.116 – ident: ref17 doi: 10.1371/journal.pcbi.1009860 – ident: ref20 doi: 10.1007/978-3-662-44753-6_5 – ident: ref37 doi: 10.1007/978-0-387-77491-6_28 – ident: ref26 doi: 10.1093/bioinformatics/bth408 – ident: ref19 doi: 10.1093/bioinformatics/bty597 – ident: ref21 doi: 10.1007/978-3-319-56970-3_5 – ident: ref36 doi: 10.1126/science.1178534 – ident: ref11 doi: 10.1093/bioinformatics/btv688 – ident: ref38 doi: 10.1002/0471250953.bi0410s05 – ident: ref35 doi: 10.1093/bioinformatics/btx717 – ident: ref24 doi: 10.1093/bioinformatics/btz354 – ident: ref14 doi: 10.1186/s12859-017-1911-6 – ident: ref4 doi: 10.1038/s41597-020-00743-4 – ident: ref33 doi: 10.3390/genes12010048 – ident: ref23 doi: 10.1186/s13059-019-1841-x – ident: ref32 doi: 10.1093/bioinformatics/btx235 – ident: ref18 doi: 10.1186/1748-7188-8-22 – ident: ref22 doi: 10.1186/s13059-016-0997-x – ident: ref15 doi: 10.1186/gb-2009-10-3-r25 – year: 2021 ident: ref43 article-title: PacBio real-world HiFi long reads for O sativa – ident: ref12 doi: 10.1016/j.isci.2020.101389 – ident: ref13 doi: 10.1109/IPDPSW59300.2023.00037 – ident: ref2 doi: 10.1038/nbt.3423 – ident: ref10 doi: 10.1038/nbt.3238 – ident: ref47 doi: 10.1109/MC.2013.418 – ident: ref5 doi: 10.1038/s41598-020-80757-5 – ident: ref27 doi: 10.1109/SEQUEN.1997.666900 – ident: ref3 doi: 10.1093/hmg/ddy177 – ident: ref9 doi: 10.1137/1.9781611976830.12 – ident: ref31 doi: 10.1089/cmb.2018.0068 – ident: ref40 doi: 10.1016/0167-8191(87)90010-X – ident: ref1 doi: 10.1186/gb-2012-13-3-314 – ident: ref28 doi: 10.1093/bioinformatics/btw152 – ident: ref39 doi: 10.1007/978-3-319-05269-4_4 |
| SSID | ssib057963335 ssib057913686 |
| Score | 2.381698 |
| Snippet | Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high... |
| SourceID | proquest pubmed crossref ieee |
| SourceType | Aggregation Database Index Database Publisher |
| StartPage | 13 |
| SubjectTerms | Accuracy Alignment-free Assembly Bioinformatics Computational biology Genomics hybrid assembly Hybrid power systems long read mapping MinHashing parallel algorithms Parallel processing Pipelines Sequential analysis sketching |
| Title | An Efficient Parallel Sketch-Based Algorithmic Workflow for Mapping Long Reads |
| URI | https://ieeexplore.ieee.org/document/10755011 https://www.ncbi.nlm.nih.gov/pubmed/40811229 https://www.proquest.com/docview/3239784509 |
| Volume | 22 |
| WOSCitedRecordID | wos001485414800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 2998-4165 dateEnd: 99991231 omitProxy: false ssIdentifier: ssib057913686 issn: 2998-4165 databaseCode: RIE dateStart: 20250101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA66ePDiA1_riwiehGqbR5Mcd0XxoIugwt5Kmocurq3sQ_--mbSrXjx4KS2kpXwz7XzJZOZD6FSUREmZ24Rxm4OEWZpI4-BSWMUyLQgvo9iEGAzkcKju22L1WAvjnIubz9w5nMZcvq3NHJbKwhcuAqGGSt5lIURTrLVwHi5URvOfXmhQY0kp5W0mM0vVxeNlvx9mhISdUyYVA2W1X7Eoiqv8zTNjvLle_-ebbqC1lljiXuMJm2jJVVto0KvwVWwSEYbjez0B5ZQxfngFYyX9EMEs7o2f68lo9vI2MhiWzv24_sSByuI7Db0bnvFtHQ6w2X66jZ6urx4vb5JWQSExlOWzpDRMp17yUilbSmZN-L95nzvtHfQ9N7lPVelkqa1n0lrJlUxFzpTPjLeea7qDOlVduT2EZaCGweCEWANyxV4ayoUOk7mUS2sI6aKzBZbFe9Moo4gTjFQVAHwBwBct8F20DZj9GtjA1UUnC_iL4OaQu9CVq-fTgpJAnCQL9KaLdhu7fN_NAq3JCFH7fzz1AK0SUO2NCyeHqDObzN0RWjEfs9F0chx8aSiPoy99AZwGxO0 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA6igl584Gt9RvAkVNs82uS4K4riugiu4K2keeji2so-9O-bSbvqxYOX0kJaypdp881MZj6ETrKCSCFSEzFuUpAwiyOhLVxmRrJEZYQXQWwi6_XE05O8b4rVQy2MtTZsPrNncBpy-abSUwiV-S8884QaKnkXOGMkqcu1ZubDM5nQ9KcbGlRZUkp5k8tMYnnev-h0vE9I2BllQjLQVvu1GgV5lb-ZZlhxrlb_-a5raKWhlrhd28I6mrPlBuq1S3wZ2kT44fhejUA7ZYgfXmG6oo5fwwxuD5-r0WDy8jbQGILnblh9Yk9m8Z2C7g3PuFv5A2y3H2-ix6vL_sV11GgoRJqydBIVmqnYCV5IaQrBjPZ_OOdSq5yFzuc6dbEsrCiUcUwYI7gUcZYy6RLtjOOKbqH5sirtDsLCk0M_5YQYDYLFTmjKM-XduZgLowlpodMZlvl73SojDy5GLHMAPgfg8wb4FtoEzH4NrOFqoeMZ_Lk3dMheqNJW03FOiadOgnmC00Lb9bx83808sUkIkbt_PPUILV3377p596Z3u4eWCWj4hjDKPpqfjKb2AC3qj8lgPDoMFvUFX-3HTA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Efficient+Parallel+Sketch-Based+Algorithmic+Workflow+for+Mapping+Long+Reads&rft.jtitle=IEEE+Transactions+on+Computational+Biology+and+Bioinformatics&rft.au=Rahman%2C+Tazin&rft.au=Bhowmik%2C+Oieswarya&rft.au=Kalyanaraman%2C+Ananth&rft.date=2025-01-01&rft.issn=2998-4165&rft.eissn=2998-4165&rft.volume=22&rft.issue=1&rft.spage=13&rft.epage=26&rft_id=info:doi/10.1109%2FTCBB.2024.3489478&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCBB_2024_3489478 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2998-4165&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2998-4165&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2998-4165&client=summon |