A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade de...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computational biology Jg. 25; H. 7; S. 766
Hauptverfasser: Jain, Chirag, Dilthey, Alexander, Koren, Sergey, Aluru, Srinivas, Phillippy, Adam M
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States 01.07.2018
Schlagworte:
ISSN:1557-8666, 1557-8666
Online-Zugang:Weitere Angaben
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
AbstractList Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
Author Aluru, Srinivas
Koren, Sergey
Dilthey, Alexander
Phillippy, Adam M
Jain, Chirag
Author_xml – sequence: 1
  givenname: Chirag
  surname: Jain
  fullname: Jain, Chirag
  organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland
– sequence: 2
  givenname: Alexander
  surname: Dilthey
  fullname: Dilthey, Alexander
  organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland
– sequence: 3
  givenname: Sergey
  surname: Koren
  fullname: Koren, Sergey
  organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland
– sequence: 4
  givenname: Srinivas
  surname: Aluru
  fullname: Aluru, Srinivas
  organization: 1 School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, Georgia
– sequence: 5
  givenname: Adam M
  surname: Phillippy
  fullname: Phillippy, Adam M
  organization: 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland
BackLink https://www.ncbi.nlm.nih.gov/pubmed/29708767$$D View this record in MEDLINE/PubMed
BookMark eNpNkM1LxDAQxYOsuB969Co5euk6-WjSHsu6q0JFET2XNJ2slX7ZdEH_ewOu4OXN78FjmHlLMuv6Dgm5ZLBmkKQ3ti3XHFiyBhDqhCxYHOsoUUrN_vGcLL3_AGBCgT4jc55qSLTSC_Kc0Z3xE82GYey_6tZMSLNm34_19N5S14_00QxD3e1p3gd5QVN5OvU0N-Meg3U4YmeR3prJlMajPyenzjQeL45zRd5229fNfZQ_3T1ssjyykvMpcsCUkKlmcaV5oJg5bZGJtEyctKlSVlUxoOS2LDmiACMDCMDYSVYJx1fk-ndvuPvzgH4q2tpbbBrTYX_wBQchRAqJjEP06hg9lC1WxTCGP8fv4q8F_gPx7l8G
CitedBy_id crossref_primary_10_1186_s40246_024_00684_8
crossref_primary_10_1186_s12859_022_05014_0
crossref_primary_10_1007_s12275_021_0632_8
crossref_primary_10_1093_ve_veac031
crossref_primary_10_1186_s12859_021_04089_5
crossref_primary_10_1109_TKDE_2022_3231780
crossref_primary_10_1093_gigascience_giab063
crossref_primary_10_1089_cmb_2023_0094
crossref_primary_10_1128_jb_00422_24
crossref_primary_10_1093_gigascience_giaa029
crossref_primary_10_1038_s41467_019_10934_2
crossref_primary_10_1093_bib_bbaf267
crossref_primary_10_1099_mgen_0_000849
crossref_primary_10_1128_msystems_01661_24
crossref_primary_10_1177_11769343221150585
crossref_primary_10_1371_journal_pone_0321246
crossref_primary_10_1038_s41385_020_00372_5
crossref_primary_10_1038_s41592_022_01457_8
crossref_primary_10_1093_gigascience_giaa061
crossref_primary_10_1101_gr_275648_121
crossref_primary_10_1146_annurev_animal_020518_115005
crossref_primary_10_1186_s13059_021_02443_7
crossref_primary_10_1186_s13059_023_02914_z
crossref_primary_10_7717_peerj_10906
crossref_primary_10_12688_f1000research_26930_1
crossref_primary_10_1093_bioinformatics_btae493
crossref_primary_10_1186_s13059_024_03414_4
crossref_primary_10_1093_nar_gkaa265
crossref_primary_10_1101_gr_280383_124
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1089/cmb.2018.0036
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Biology
Mathematics
EISSN 1557-8666
ExternalDocumentID 29708767
Genre Research Support, U.S. Gov't, Non-P.H.S
Journal Article
Research Support, N.I.H., Intramural
GroupedDBID ---
0R~
29K
34G
39C
4.4
53G
5GY
ABBKN
ABEFU
ACGFO
ADBBV
AENEX
AFOSN
AI.
ALMA_UNASSIGNED_HOLDINGS
BAWUL
BNQNF
CAG
CGR
COF
CS3
CUY
CVF
D-I
DIK
DU5
EBS
ECM
EIF
EJD
F5P
IAO
IER
IGS
IHR
IM4
ITC
MV1
NPM
NQHIM
O9-
P2P
R.V
RIG
RML
RMSOB
RNS
TN5
TR2
UE5
VH1
7X8
J8X
SAUOL
SCNPE
SFC
ID FETCH-LOGICAL-c422t-f016349715d7263451f7ce139b8f4c966c6d50e42cbb2ee30a4bb230e5f41d3f2
IEDL.DBID 7X8
ISICitedReferencesCount 36
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000431101400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1557-8666
IngestDate Sun Nov 09 11:26:01 EST 2025
Thu Apr 03 06:58:39 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 7
Keywords Jaccard
minimizers
sketching
winnowing
MinHash
long-read mapping
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c422t-f016349715d7263451f7ce139b8f4c966c6d50e42cbb2ee30a4bb230e5f41d3f2
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
OpenAccessLink https://www.ncbi.nlm.nih.gov/pmc/articles/6067103
PMID 29708767
PQID 2033390845
PQPubID 23479
ParticipantIDs proquest_miscellaneous_2033390845
pubmed_primary_29708767
PublicationCentury 2000
PublicationDate 2018-07-00
20180701
PublicationDateYYYYMMDD 2018-07-01
PublicationDate_xml – month: 07
  year: 2018
  text: 2018-07-00
PublicationDecade 2010
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of computational biology
PublicationTitleAlternate J Comput Biol
PublicationYear 2018
SSID ssj0013607
Score 2.4235694
Snippet Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms....
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 766
SubjectTerms Algorithms
Databases, Factual
Genome, Human - genetics
High-Throughput Nucleotide Sequencing - methods
Humans
Sequence Alignment - methods
Sequence Analysis, DNA
Software
Title A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
URI https://www.ncbi.nlm.nih.gov/pubmed/29708767
https://www.proquest.com/docview/2033390845
Volume 25
WOSCitedRecordID wos000431101400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEB7UKujBR33VFyt4Dab76CYnCWrx0JYeFHorm82uFmxSTRT9984mafUiCF6WJZBlmczOY-fLNwAXvkatkSLxXE8aj1sl8Mwx68WhECbW6AKFLptNyMEgGI3CYX3hltewyrlNLA11kml3R45JOmOYnwdcXM1ePNc1ylVX6xYay9BgGMo4SJcc_agidMrfpXEbaIkxTq85Nv0gvNTT2OG6HJCSdX6PLksv09367_62YbOOL0lUKcQOLJm0CWtVx8nPJmz0FzSt-S4MI9JVeUEiRyz-McHHhkTPj7hs8TQlGM6SvnL8DY-kl-HgAPc5KTLSc_BxsuCoJTeqUM4d5nvw0L29v77z6hYLnuaUFp7FiI_xULZFIinORNtKbVCUcWC5xlRIdxLhG051HFNjmK84TphvhOXthFm6DytplppDIEmiAm5sgh4_4JxyJXEFFcQ-54oppVtwPhfcGFXY1SVUarK3fPwtuhYcVNIfzyqujTENpSPNk0d_ePsY1t0nrcC0J9CweIDNKazq92KSv56VuoHjYNj_Anhbwms
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Fast+Approximate+Algorithm+for+Mapping+Long+Reads+to+Large+Reference+Databases&rft.jtitle=Journal+of+computational+biology&rft.au=Jain%2C+Chirag&rft.au=Dilthey%2C+Alexander&rft.au=Koren%2C+Sergey&rft.au=Aluru%2C+Srinivas&rft.date=2018-07-01&rft.eissn=1557-8666&rft.volume=25&rft.issue=7&rft.spage=766&rft_id=info:doi/10.1089%2Fcmb.2018.0036&rft_id=info%3Apmid%2F29708767&rft_id=info%3Apmid%2F29708767&rft.externalDocID=29708767
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1557-8666&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1557-8666&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1557-8666&client=summon