A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade de...
Gespeichert in:
| Veröffentlicht in: | Journal of computational biology Jg. 25; H. 7; S. 766 |
|---|---|
| Hauptverfasser: | , , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
United States
01.07.2018
|
| Schlagworte: | |
| ISSN: | 1557-8666, 1557-8666 |
| Online-Zugang: | Weitere Angaben |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes. |
|---|---|
| AbstractList | Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes. Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes. |
| Author | Aluru, Srinivas Koren, Sergey Dilthey, Alexander Phillippy, Adam M Jain, Chirag |
| Author_xml | – sequence: 1 givenname: Chirag surname: Jain fullname: Jain, Chirag organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland – sequence: 2 givenname: Alexander surname: Dilthey fullname: Dilthey, Alexander organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland – sequence: 3 givenname: Sergey surname: Koren fullname: Koren, Sergey organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland – sequence: 4 givenname: Srinivas surname: Aluru fullname: Aluru, Srinivas organization: 1 School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, Georgia – sequence: 5 givenname: Adam M surname: Phillippy fullname: Phillippy, Adam M organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/29708767$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkM1LxDAQxYOsuB969Co5euk6-WjSHsu6q0JFET2XNJ2slX7ZdEH_ewOu4OXN78FjmHlLMuv6Dgm5ZLBmkKQ3ti3XHFiyBhDqhCxYHOsoUUrN_vGcLL3_AGBCgT4jc55qSLTSC_Kc0Z3xE82GYey_6tZMSLNm34_19N5S14_00QxD3e1p3gd5QVN5OvU0N-Meg3U4YmeR3prJlMajPyenzjQeL45zRd5229fNfZQ_3T1ssjyykvMpcsCUkKlmcaV5oJg5bZGJtEyctKlSVlUxoOS2LDmiACMDCMDYSVYJx1fk-ndvuPvzgH4q2tpbbBrTYX_wBQchRAqJjEP06hg9lC1WxTCGP8fv4q8F_gPx7l8G |
| CitedBy_id | crossref_primary_10_1186_s40246_024_00684_8 crossref_primary_10_1186_s12859_022_05014_0 crossref_primary_10_1007_s12275_021_0632_8 crossref_primary_10_1093_ve_veac031 crossref_primary_10_1186_s12859_021_04089_5 crossref_primary_10_1109_TKDE_2022_3231780 crossref_primary_10_1093_gigascience_giab063 crossref_primary_10_1089_cmb_2023_0094 crossref_primary_10_1128_jb_00422_24 crossref_primary_10_1093_gigascience_giaa029 crossref_primary_10_1038_s41467_019_10934_2 crossref_primary_10_1093_bib_bbaf267 crossref_primary_10_1099_mgen_0_000849 crossref_primary_10_1128_msystems_01661_24 crossref_primary_10_1177_11769343221150585 crossref_primary_10_1371_journal_pone_0321246 crossref_primary_10_1038_s41385_020_00372_5 crossref_primary_10_1038_s41592_022_01457_8 crossref_primary_10_1093_gigascience_giaa061 crossref_primary_10_1101_gr_275648_121 crossref_primary_10_1146_annurev_animal_020518_115005 crossref_primary_10_1186_s13059_021_02443_7 crossref_primary_10_1186_s13059_023_02914_z crossref_primary_10_7717_peerj_10906 crossref_primary_10_12688_f1000research_26930_1 crossref_primary_10_1093_bioinformatics_btae493 crossref_primary_10_1186_s13059_024_03414_4 crossref_primary_10_1093_nar_gkaa265 crossref_primary_10_1101_gr_280383_124 |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1089/cmb.2018.0036 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Biology Mathematics |
| EISSN | 1557-8666 |
| ExternalDocumentID | 29708767 |
| Genre | Research Support, U.S. Gov't, Non-P.H.S Journal Article Research Support, N.I.H., Intramural |
| GroupedDBID | --- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1 7X8 J8X SAUOL SCNPE SFC |
| ID | FETCH-LOGICAL-c422t-f016349715d7263451f7ce139b8f4c966c6d50e42cbb2ee30a4bb230e5f41d3f2 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 36 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000431101400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1557-8666 |
| IngestDate | Sun Nov 09 11:26:01 EST 2025 Thu Apr 03 06:58:39 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 7 |
| Keywords | Jaccard minimizers sketching winnowing MinHash long-read mapping |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c422t-f016349715d7263451f7ce139b8f4c966c6d50e42cbb2ee30a4bb230e5f41d3f2 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| OpenAccessLink | https://www.ncbi.nlm.nih.gov/pmc/articles/6067103 |
| PMID | 29708767 |
| PQID | 2033390845 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_2033390845 pubmed_primary_29708767 |
| PublicationCentury | 2000 |
| PublicationDate | 2018-07-00 20180701 |
| PublicationDateYYYYMMDD | 2018-07-01 |
| PublicationDate_xml | – month: 07 year: 2018 text: 2018-07-00 |
| PublicationDecade | 2010 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Journal of computational biology |
| PublicationTitleAlternate | J Comput Biol |
| PublicationYear | 2018 |
| SSID | ssj0013607 |
| Score | 2.4235694 |
| Snippet | Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms.... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 766 |
| SubjectTerms | Algorithms Databases, Factual Genome, Human - genetics High-Throughput Nucleotide Sequencing - methods Humans Sequence Alignment - methods Sequence Analysis, DNA Software |
| Title | A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/29708767 https://www.proquest.com/docview/2033390845 |
| Volume | 25 |
| WOSCitedRecordID | wos000431101400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEB7UKujBR33VFyt4Dab76CYnCWrx0JYeFHorm82uFmxSTRT9984mafUiCF6WJZBlmczOY-fLNwAXvkatkSLxXE8aj1sl8Mwx68WhECbW6AKFLptNyMEgGI3CYX3hltewyrlNLA11kml3R45JOmOYnwdcXM1ePNc1ylVX6xYay9BgGMo4SJcc_agidMrfpXEbaIkxTq85Nv0gvNTT2OG6HJCSdX6PLksv09367_62YbOOL0lUKcQOLJm0CWtVx8nPJmz0FzSt-S4MI9JVeUEiRyz-McHHhkTPj7hs8TQlGM6SvnL8DY-kl-HgAPc5KTLSc_BxsuCoJTeqUM4d5nvw0L29v77z6hYLnuaUFp7FiI_xULZFIinORNtKbVCUcWC5xlRIdxLhG051HFNjmK84TphvhOXthFm6DytplppDIEmiAm5sgh4_4JxyJXEFFcQ-54oppVtwPhfcGFXY1SVUarK3fPwtuhYcVNIfzyqujTENpSPNk0d_ePsY1t0nrcC0J9CweIDNKazq92KSv56VuoHjYNj_Anhbwms |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Fast+Approximate+Algorithm+for+Mapping+Long+Reads+to+Large+Reference+Databases&rft.jtitle=Journal+of+computational+biology&rft.au=Jain%2C+Chirag&rft.au=Dilthey%2C+Alexander&rft.au=Koren%2C+Sergey&rft.au=Aluru%2C+Srinivas&rft.date=2018-07-01&rft.eissn=1557-8666&rft.volume=25&rft.issue=7&rft.spage=766&rft_id=info:doi/10.1089%2Fcmb.2018.0036&rft_id=info%3Apmid%2F29708767&rft_id=info%3Apmid%2F29708767&rft.externalDocID=29708767 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1557-8666&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1557-8666&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1557-8666&client=summon |