Weighted minimizer sampling improves long read mapping
Abstract Motivation In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the...
Saved in:
| Published in: | Bioinformatics (Oxford, England) Vol. 36; no. Supplement_1; pp. i111 - i118 |
|---|---|
| Main Authors: | , , , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
England
Oxford University Press
01.07.2020
|
| Subjects: | |
| ISSN: | 1367-4803, 1367-4811, 1367-4811 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Abstract
Motivation
In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.
Results
We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.
Availability and implementation
Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap. |
|---|---|
| AbstractList | In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.MOTIVATIONIn this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.RESULTSWe introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.AVAILABILITY AND IMPLEMENTATIONWinnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap. In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap. Abstract Motivation In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. Results We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. Availability and implementation Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap. |
| Author | Chu, Claudia Koren, Sergey Rhie, Arang Phillippy, Adam M Jain, Chirag Zhang, Haowen Walenz, Brian P |
| AuthorAffiliation | b2 College of Computing, Georgia Institute of Technology , Atlanta, GA 30332, USA b1 National Human Genome Research Institute, National Institutes of Health , Bethesda, MD 20892, USA |
| AuthorAffiliation_xml | – name: b2 College of Computing, Georgia Institute of Technology , Atlanta, GA 30332, USA – name: b1 National Human Genome Research Institute, National Institutes of Health , Bethesda, MD 20892, USA |
| Author_xml | – sequence: 1 givenname: Chirag surname: Jain fullname: Jain, Chirag email: chirag.jain@nih.gov organization: National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA – sequence: 2 givenname: Arang surname: Rhie fullname: Rhie, Arang organization: National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA – sequence: 3 givenname: Haowen surname: Zhang fullname: Zhang, Haowen organization: College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA – sequence: 4 givenname: Claudia surname: Chu fullname: Chu, Claudia organization: College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA – sequence: 5 givenname: Brian P surname: Walenz fullname: Walenz, Brian P organization: National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA – sequence: 6 givenname: Sergey surname: Koren fullname: Koren, Sergey organization: National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA – sequence: 7 givenname: Adam M surname: Phillippy fullname: Phillippy, Adam M organization: National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/32657365$$D View this record in MEDLINE/PubMed |
| BookMark | eNqNkc9LwzAcxYNMnJv-C6NHL3NJmqQNiCDDXzDwongMSZpukbapSTvQv96MzeG86CEk4ft574W8ERg0rjEATBC8RJCnM2WdbUrna9lZHWaqk5Kk9AicopRlU5IjNNifYToEoxDeIIQUUnYChilmNEsZPQXs1djlqjNFUtvG1vbT-CTIuq1ss0xs3Xq3NiGpXLx5IyMl2zaOzsBxKatgznf7GLzc3T7PH6aLp_vH-c1iqinG3RTjTPFSkQxDornUTBZZjijjSimqeE5IyRgvmC5LGBcvkC4lNFmmlUJRkI7B9da37VVtCm2azstKtN7W0n8IJ604nDR2JZZuLbKUUpyTaHCxM_DuvTehE7UN2lSVbIzrg8AEE84RyXlEJz-z9iHfnxUBtgW0dyF4U-4RBMWmFXHYiti1EoVXv4TadhFxmzfb6m852spd3_438gvJ_K-r |
| CitedBy_id | crossref_primary_10_1093_nargab_lqab115 crossref_primary_10_1186_s13059_024_03324_5 crossref_primary_10_1089_cmb_2024_0544 crossref_primary_10_1038_s41467_024_46614_z crossref_primary_10_1186_s13059_022_02831_7 crossref_primary_10_1109_TKDE_2022_3231780 crossref_primary_10_1038_s41467_024_55762_1 crossref_primary_10_1126_science_ads3484 crossref_primary_10_1101_gr_276871_122 crossref_primary_10_1007_s00521_021_06188_z crossref_primary_10_1089_cmb_2023_0094 crossref_primary_10_1016_j_csbj_2022_08_019 crossref_primary_10_1038_s41467_024_55195_w crossref_primary_10_1007_s11427_024_2742_y crossref_primary_10_1016_j_gpb_2021_08_003 crossref_primary_10_1093_gigascience_giaf009 crossref_primary_10_1186_s13059_021_02283_5 crossref_primary_10_1093_molbev_msad122 crossref_primary_10_3389_fpls_2025_1580779 crossref_primary_10_1002_ece3_71153 crossref_primary_10_1038_s41592_022_01457_8 crossref_primary_10_1093_nargab_lqad004 crossref_primary_10_1186_s13059_024_03414_4 crossref_primary_10_1126_science_abj6987 crossref_primary_10_1186_s13059_025_03606_6 crossref_primary_10_1101_gr_280149_124 crossref_primary_10_1111_eva_13653 crossref_primary_10_1038_s41576_024_00718_w crossref_primary_10_1038_s41586_021_03420_7 crossref_primary_10_1038_s41592_022_01445_y crossref_primary_10_1126_science_abi7489 crossref_primary_10_1371_journal_pgen_1010306 crossref_primary_10_1093_bioadv_vbaf081 crossref_primary_10_1038_s41467_024_52384_5 crossref_primary_10_1093_plcell_koac305 crossref_primary_10_1093_hr_uhad103 crossref_primary_10_1089_cmb_2022_0275 crossref_primary_10_1093_hr_uhaf127 crossref_primary_10_1126_science_abl3533 crossref_primary_10_1089_cmb_2024_0483 crossref_primary_10_1038_s41597_025_04943_8 crossref_primary_10_1038_s41597_025_05741_y crossref_primary_10_1089_cmb_2021_0599 crossref_primary_10_1101_gr_275648_121 crossref_primary_10_1093_hr_uhae022 crossref_primary_10_1093_gigascience_giaf059 crossref_primary_10_1093_hr_uhaf079 crossref_primary_10_1016_j_hpj_2024_02_002 crossref_primary_10_1093_nar_gkae842 crossref_primary_10_1126_science_abl4178 crossref_primary_10_12688_f1000research_154432_2 crossref_primary_10_1128_spectrum_00895_23 crossref_primary_10_1038_s41592_022_01440_3 crossref_primary_10_12688_f1000research_154432_1 crossref_primary_10_3390_ijms252011066 crossref_primary_10_1038_s41467_024_53294_2 crossref_primary_10_1186_s12859_022_05014_0 crossref_primary_10_1093_gigascience_giaa153 crossref_primary_10_1186_s12859_024_05807_5 crossref_primary_10_1089_cmb_2023_0212 crossref_primary_10_1016_j_molp_2025_03_005 crossref_primary_10_1016_j_semcdb_2022_04_022 crossref_primary_10_1371_journal_pcbi_1009078 crossref_primary_10_1093_icesjms_fsaf105 crossref_primary_10_3389_fbioe_2021_734023 crossref_primary_10_3390_genes12010048 crossref_primary_10_1038_s41586_024_07278_3 crossref_primary_10_1093_hr_uhad209 crossref_primary_10_1016_j_pld_2024_11_001 crossref_primary_10_1186_s13059_023_02972_3 crossref_primary_10_7717_peerj_16515 crossref_primary_10_1016_j_xgen_2025_100808 crossref_primary_10_7717_peerj_10805 crossref_primary_10_1038_s41576_022_00551_z crossref_primary_10_1038_s41576_023_00590_0 crossref_primary_10_1101_gr_277637_122 crossref_primary_10_1093_nar_gkaa829 crossref_primary_10_1038_s41597_024_04350_5 crossref_primary_10_1093_gigascience_giab063 crossref_primary_10_1111_1755_0998_13783 crossref_primary_10_1111_mec_16468 crossref_primary_10_1093_hr_uhae071 crossref_primary_10_1038_s41598_023_34257_x crossref_primary_10_1093_gigascience_giae053 crossref_primary_10_1371_journal_pcbi_1010638 crossref_primary_10_1038_s41597_024_03717_y crossref_primary_10_1101_gr_280041_124 crossref_primary_10_1093_dnares_dsaf017 crossref_primary_10_1109_TCBBIO_2025_3545285 crossref_primary_10_1089_cmb_2023_0186 crossref_primary_10_1371_journal_pcbi_1012885 crossref_primary_10_1101_gr_278203_123 crossref_primary_10_1126_science_abj6965 crossref_primary_10_1038_s41467_025_57722_9 crossref_primary_10_1038_s41597_024_04093_3 crossref_primary_10_1007_s00438_024_02158_x crossref_primary_10_1093_nar_gkaf298 |
| Cites_doi | 10.1089/cmb.2014.0160 10.1093/bioinformatics/btw152 10.1038/ncomms15311 10.1101/gr.215087.116 10.1186/gb-2004-5-2-r12 10.1101/gr.213611.116 10.1016/0022-2836(81)90087-5 10.1089/cmb.2018.0036 10.1093/bioinformatics/bty191 10.1186/s13059-019-1809-x 10.1016/j.cels.2015.08.004 10.1093/bioinformatics/bty258 10.1038/nmeth.1923 10.1038/s41467-019-10934-2 10.1186/s13059-016-0997-x 10.1038/nbt.3238 10.1371/journal.pone.0028819 10.1093/bioinformatics/btx235 10.1093/bioinformatics/bth408 10.1093/bioinformatics/bts649 10.1093/nar/25.17.3389 10.1007/978-3-319-43681-4_21 10.1146/annurev-biodatasci-072018-021156 |
| ContentType | Journal Article |
| Copyright | Published by Oxford University Press 2020. 2020 Published by Oxford University Press 2020. |
| Copyright_xml | – notice: Published by Oxford University Press 2020. 2020 – notice: Published by Oxford University Press 2020. |
| DBID | AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 5PM |
| DOI | 10.1093/bioinformatics/btaa435 |
| DatabaseName | CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic PubMed Central (Full Participant titles) |
| DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Biology |
| DocumentTitleAlternate | ISMB 2020 Proceedings |
| EISSN | 1367-4811 |
| EndPage | i118 |
| ExternalDocumentID | PMC7355284 32657365 10_1093_bioinformatics_btaa435 10.1093/bioinformatics/btaa435 |
| Genre | Journal Article Research Support, N.I.H., Intramural |
| GrantInformation_xml | – fundername: ; ; |
| GroupedDBID | --- -E4 -~X .-4 .2P .DC .GJ .I3 0R~ 1TH 23N 2WC 4.4 48X 53G 5GY 5WA 70D AAIJN AAIMJ AAJKP AAJQQ AAKPC AAMDB AAMVS AAOGV AAPQZ AAPXW AAUQX AAVAP AAVLN ABEFU ABEJV ABEUO ABGNP ABIXL ABNGD ABNKS ABPQP ABPTD ABQLI ABQTQ ABWST ABXVV ABZBJ ACGFS ACIWK ACPRK ACUFI ACUKT ACUXJ ACYTK ADBBV ADEYI ADEZT ADFTL ADGKP ADGZP ADHKW ADHZD ADMLS ADOCK ADPDF ADRDM ADRTK ADVEK ADYVW ADZTZ ADZXQ AECKG AEGPL AEJOX AEKKA AEKSI AELWJ AEMDU AENEX AENZO AEPUE AETBJ AEWNT AFFNX AFFZL AFGWE AFIYH AFOFC AFRAH AGINJ AGKEF AGQXC AGSYK AHMBA AHXPO AI. AIJHB AJEEA AJEUX AKHUL AKWXX ALMA_UNASSIGNED_HOLDINGS ALTZX ALUQC AMNDL APIBT APWMN AQDSO ARIXL ASPBG ATTQO AVWKF AXUDD AYOIW AZFZN AZVOD BAWUL BAYMD BHONS BQDIO BQUQU BSWAC BTQHN C1A C45 CAG CDBKE COF CS3 CZ4 DAKXR DIK DILTD DU5 D~K EBD EBS EE~ EJD ELUNK EMOBN F5P F9B FEDTE FHSFR FLIZI FLUFQ FOEOM FQBLK GAUVT GJXCC GROUPED_DOAJ GX1 H13 H5~ HAR HVGLF HW0 HZ~ IOX J21 JXSIZ KAQDR KOP KQ8 KSI KSN M-Z M49 MK~ ML0 N9A NGC NLBLG NMDNZ NOMLY NTWIH NU- NVLIB O0~ O9- OAWHX ODMLO OJQWA OK1 OVD OVEED O~Y P2P PAFKI PB- PEELM PQQKQ Q1. Q5Y R44 RD5 RIG RNI RNS ROL RPM RUSNO RW1 RXO RZF RZO SV3 TEORI TJP TLC TOX TR2 VH1 W8F WOQ X7H YAYTL YKOAZ YXANX ZGI ZKX ~91 ~KM AAYXX CITATION ROX CGR CUY CVF ECM EIF NPM 7X8 5PM |
| ID | FETCH-LOGICAL-c522t-227b9fb47204c9ac6ad781569bbb5b9844f669d6cff0cff9d1cfa0e77cbb1c9a3 |
| IEDL.DBID | TOX |
| ISICitedReferencesCount | 124 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000579894600014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1367-4803 1367-4811 |
| IngestDate | Thu Aug 21 18:00:51 EDT 2025 Thu Jul 10 18:17:15 EDT 2025 Mon Jul 21 06:02:24 EDT 2025 Sat Nov 29 03:49:17 EST 2025 Tue Nov 18 21:05:58 EST 2025 Wed Apr 02 07:03:53 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | Supplement_1 |
| Language | English |
| License | This work is written by US Government employees and is in the public domain in the US. Published by Oxford University Press 2020. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c522t-227b9fb47204c9ac6ad781569bbb5b9844f669d6cff0cff9d1cfa0e77cbb1c9a3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| OpenAccessLink | https://www.ncbi.nlm.nih.gov/pmc/articles/7355284 |
| PMID | 32657365 |
| PQID | 2424991489 |
| PQPubID | 23479 |
| ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_7355284 proquest_miscellaneous_2424991489 pubmed_primary_32657365 crossref_primary_10_1093_bioinformatics_btaa435 crossref_citationtrail_10_1093_bioinformatics_btaa435 oup_primary_10_1093_bioinformatics_btaa435 |
| PublicationCentury | 2000 |
| PublicationDate | 2020-07-01 |
| PublicationDateYYYYMMDD | 2020-07-01 |
| PublicationDate_xml | – month: 07 year: 2020 text: 2020-07-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | England |
| PublicationPlace_xml | – name: England |
| PublicationTitle | Bioinformatics (Oxford, England) |
| PublicationTitleAlternate | Bioinformatics |
| PublicationYear | 2020 |
| Publisher | Oxford University Press |
| Publisher_xml | – name: Oxford University Press |
| References | Roberts (2024021913324726800_btaa435-B27) 2004; 20 Marçais (2024021913324726800_btaa435-B18) 2017; 33 Rowe (2024021913324726800_btaa435-B28) 2019; 20 Shafin (2024021913324726800_btaa435-B33) Smith (2024021913324726800_btaa435-B34) 2011 Chin (2024021913324726800_btaa435-B5) 2019 Ono (2024021913324726800_btaa435-B23) 2013; 29 Chikhi (2024021913324726800_btaa435-B4) 2015; 22 Xin (2024021913324726800_btaa435-B36) 2018 Li (2024021913324726800_btaa435-B17) 2018; 34 Schleimer (2024021913324726800_btaa435-B31) 2003 Sahlin (2024021913324726800_btaa435-B30) 2020 Rhie (2024021913324726800_btaa435-B26) 2020 Altschul (2024021913324726800_btaa435-B1) 1997; 25 Langmead (2024021913324726800_btaa435-B14) 2012; 9 Smith (2024021913324726800_btaa435-B35) 1981; 147 Broder (2024021913324726800_btaa435-B3) 1997 Orenstein (2024021913324726800_btaa435-B24) 2016 Yu (2024021913324726800_btaa435-B37) 2015; 1 Li (2024021913324726800_btaa435-B15) 2016; 32 Dilthey (2024021913324726800_btaa435-B8) 2019; 10 Kurtz (2024021913324726800_btaa435-B13) 2004; 5 Kundu (2024021913324726800_btaa435-B12) 2019 Koren (2024021913324726800_btaa435-B11) 2017; 27 Jain (2024021913324726800_btaa435-B10) 2018; 25 Berlin (2024021913324726800_btaa435-B2) 2015; 33 Miga (2024021913324726800_btaa435-B21) 2019 Ondov (2024021913324726800_btaa435-B22) 2016; 17 Marçais (2024021913324726800_btaa435-B19) 2018; 34 Sahlin (2024021913324726800_btaa435-B29) 2020 Li (2024021913324726800_btaa435-B16) 2018 Popic (2024021913324726800_btaa435-B25) 2017; 8 2024021913324726800_btaa435-B38 DeBlasio (2024021913324726800_btaa435-B7) 2019 Chum (2024021913324726800_btaa435-B6) 2008; 810 Frith (2024021913324726800_btaa435-B9) 2011; 6 Marçais (2024021913324726800_btaa435-B20) 2019; 2 Schneider (2024021913324726800_btaa435-B32) 2017; 27 |
| References_xml | – volume: 22 start-page: 336 year: 2015 ident: 2024021913324726800_btaa435-B4 article-title: On the representation of de Bruijn graphs publication-title: J. Comput. Biol doi: 10.1089/cmb.2014.0160 – volume: 32 start-page: 2103 year: 2016 ident: 2024021913324726800_btaa435-B15 article-title: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/btw152 – volume: 8 start-page: 15311 year: 2017 ident: 2024021913324726800_btaa435-B25 article-title: A hybrid cloud read aligner based on minhash and kmer voting that preserves privacy publication-title: Nat. Commun doi: 10.1038/ncomms15311 – start-page: 167 year: 2019 ident: 2024021913324726800_btaa435-B7 – volume: 27 start-page: 722 year: 2017 ident: 2024021913324726800_btaa435-B11 article-title: Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation publication-title: Genome Res doi: 10.1101/gr.215087.116 – volume: 5 start-page: R12 year: 2004 ident: 2024021913324726800_btaa435-B13 article-title: Versatile and open software for comparing large genomes publication-title: Genome Biol doi: 10.1186/gb-2004-5-2-r12 – start-page: 735928 year: 2019 ident: 2024021913324726800_btaa435-B21 – volume: 27 start-page: 849 year: 2017 ident: 2024021913324726800_btaa435-B32 article-title: Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly publication-title: Genome Res doi: 10.1101/gr.213611.116 – volume: 147 start-page: 195 year: 1981 ident: 2024021913324726800_btaa435-B35 article-title: Identification of common molecular subsequences publication-title: J. Mol. Biol doi: 10.1016/0022-2836(81)90087-5 – ident: 2024021913324726800_btaa435-B33 – volume: 25 start-page: 766 year: 2018 ident: 2024021913324726800_btaa435-B10 article-title: A fast approximate algorithm for mapping long reads to large reference databases publication-title: J. Comput. Biol doi: 10.1089/cmb.2018.0036 – start-page: 76 year: 2003 ident: 2024021913324726800_btaa435-B31 – ident: 2024021913324726800_btaa435-B38 – start-page: 472 year: 2020 ident: 2024021913324726800_btaa435-B29 – volume: 34 start-page: 3094 year: 2018 ident: 2024021913324726800_btaa435-B17 article-title: Minimap2: pairwise alignment for nucleotide sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/bty191 – year: 2019 ident: 2024021913324726800_btaa435-B12 – volume: 20 start-page: 199 year: 2019 ident: 2024021913324726800_btaa435-B28 article-title: When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data publication-title: Genome Biol doi: 10.1186/s13059-019-1809-x – volume: 1 start-page: 130 year: 2015 ident: 2024021913324726800_btaa435-B37 article-title: Entropy-scaling search of massive biological data publication-title: Cell Syst doi: 10.1016/j.cels.2015.08.004 – volume: 34 start-page: i13 year: 2018 ident: 2024021913324726800_btaa435-B19 article-title: Asymptotically optimal minimizers schemes publication-title: Bioinformatics doi: 10.1093/bioinformatics/bty258 – start-page: 21 year: 1997 ident: 2024021913324726800_btaa435-B3 – volume: 9 start-page: 357 year: 2012 ident: 2024021913324726800_btaa435-B14 article-title: Fast gapped-read alignment with bowtie 2 publication-title: Nat. Methods doi: 10.1038/nmeth.1923 – volume: 10 start-page: 1 year: 2019 ident: 2024021913324726800_btaa435-B8 article-title: Strain-level metagenomic assignment and compositional estimation for long reads with metamaps publication-title: Nat. Commun doi: 10.1038/s41467-019-10934-2 – volume: 17 start-page: 132 year: 2016 ident: 2024021913324726800_btaa435-B22 article-title: Mash: fast genome and metagenome distance estimation using minhash publication-title: Genome Biol doi: 10.1186/s13059-016-0997-x – year: 2019 ident: 2024021913324726800_btaa435-B5 – volume: 810 start-page: 812 year: 2008 ident: 2024021913324726800_btaa435-B6 article-title: Near duplicate image detection: min-Hash and tf-idf weighting publication-title: BMVC – year: 2020 ident: 2024021913324726800_btaa435-B30 – year: 2011 ident: 2024021913324726800_btaa435-B34 – year: 2018 ident: 2024021913324726800_btaa435-B36 – year: 2018 ident: 2024021913324726800_btaa435-B16 – year: 2020 ident: 2024021913324726800_btaa435-B26 – volume: 33 start-page: 623 year: 2015 ident: 2024021913324726800_btaa435-B2 article-title: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing publication-title: Nat. Biotechnol doi: 10.1038/nbt.3238 – volume: 6 start-page: e28819 year: 2011 ident: 2024021913324726800_btaa435-B9 article-title: Gentle masking of low-complexity sequences improves homology search publication-title: PLoS One doi: 10.1371/journal.pone.0028819 – volume: 33 start-page: i110 year: 2017 ident: 2024021913324726800_btaa435-B18 article-title: Improving the performance of minimizers and winnowing schemes publication-title: Bioinformatics doi: 10.1093/bioinformatics/btx235 – volume: 20 start-page: 3363 year: 2004 ident: 2024021913324726800_btaa435-B27 article-title: Reducing storage requirements for biological sequence comparison publication-title: Bioinformatics doi: 10.1093/bioinformatics/bth408 – volume: 29 start-page: 119 year: 2013 ident: 2024021913324726800_btaa435-B23 article-title: PBSIM: PacBio reads simulator-toward accurate genome assembly publication-title: Bioinformatics doi: 10.1093/bioinformatics/bts649 – volume: 25 start-page: 3389 year: 1997 ident: 2024021913324726800_btaa435-B1 article-title: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs publication-title: Nucleic Acids Res doi: 10.1093/nar/25.17.3389 – start-page: 257 volume-title: International Workshop on Algorithms in Bioinformatics year: 2016 ident: 2024021913324726800_btaa435-B24 doi: 10.1007/978-3-319-43681-4_21 – volume: 2 start-page: 93 year: 2019 ident: 2024021913324726800_btaa435-B20 article-title: Sketching and sublinear data structures in genomics publication-title: Annu. Rev. Biomed. Data Sci doi: 10.1146/annurev-biodatasci-072018-021156 |
| SSID | ssj0005056 |
| Score | 2.6501396 |
| Snippet | Abstract
Motivation
In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence... In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique... |
| SourceID | pubmedcentral proquest pubmed crossref oup |
| SourceType | Open Access Repository Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | i111 |
| SubjectTerms | Algorithms Comparative and Functional Genomics Data Compression Genomics High-Throughput Nucleotide Sequencing Humans Sequence Analysis, DNA Software |
| Title | Weighted minimizer sampling improves long read mapping |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/32657365 https://www.proquest.com/docview/2424991489 https://pubmed.ncbi.nlm.nih.gov/PMC7355284 |
| Volume | 36 |
| WOSCitedRecordID | wos000579894600014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVASL databaseName: Oxford Journals Open Access Collection customDbUrl: eissn: 1367-4811 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0005056 issn: 1367-4803 databaseCode: TOX dateStart: 19850101 isFulltext: true titleUrlDefault: https://academic.oup.com/journals/ providerName: Oxford University Press – providerCode: PRVASL databaseName: Oxford Journals Open Access Collection customDbUrl: eissn: 1367-4811 dateEnd: 20220930 omitProxy: false ssIdentifier: ssj0005056 issn: 1367-4803 databaseCode: TOX dateStart: 19850101 isFulltext: true titleUrlDefault: https://academic.oup.com/journals/ providerName: Oxford University Press |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEB5qUfDi-1EfJYInITSPzT6OIhYPUj1U7C3sbpIa0ET6EPTXO5uktSmIj0MOITvLsrOb-XZn5huAc4KY2uPGw-pIapOIMFt6ibYloY4OqK-VLEhcb1mvxwcDcd8Ad5YLs-zCF35HpXlFImqIiztqIiXaePzrugE3NQv6d4OvoA6nqNdqeMhswh1_lhP8bTc1c1RLcVtAmssBkwsWqLv5j7FvwUYFN63Lcn1sQyPOdmCtLED5vgv0sbgZjSPLUIy8pB_xyBpLE2SeDa20uG-Ix9Zzjm8ILrGVNHQOwz146F73r27sqpKCrRFfTWzPY0okipiKNFpITWXEDE2MUEoFSnBCEkpFRHWSOPiIyNWJdGLGtFIuCvj70MzyLD4Ei6uIoDaJxnMTCZhUhtSOe3gqdDXxHa8FwWxCQ13RjJtqF89h6e72w_qchNWctKAzl3stiTZ-lLhAff268dlMrSFuIOMVkVmcT8ehyY9BkEy4aMFBqeZ5n4htA-ZTlGa1BTBvYMi561-y9Kkg6WYI5ND0H_1lkMew7pnjfBENfALNyWgan8Kqfpuk41EbVtiAt4v1_glj4QiA |
| linkProvider | Oxford University Press |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Weighted+minimizer+sampling+improves+long+read+mapping&rft.jtitle=Bioinformatics+%28Oxford%2C+England%29&rft.au=Jain%2C+Chirag&rft.au=Rhie%2C+Arang&rft.au=Zhang%2C+Haowen&rft.au=Chu%2C+Claudia&rft.date=2020-07-01&rft.pub=Oxford+University+Press&rft.issn=1367-4803&rft.eissn=1367-4811&rft.volume=36&rft.issue=Suppl+1&rft.spage=i111&rft.epage=i118&rft_id=info:doi/10.1093%2Fbioinformatics%2Fbtaa435&rft_id=info%3Apmid%2F32657365&rft.externalDocID=PMC7355284 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1367-4803&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1367-4803&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1367-4803&client=summon |