Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes
This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the...
Uloženo v:
| Vydáno v: | Journal of computational biology Ročník 31; číslo 10; s. 975 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
United States
01.10.2024
|
| Témata: | |
| ISSN: | 1557-8666, 1557-8666 |
| On-line přístup: | Zjistit podrobnosti o přístupu |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of
errors. While existing literature offers designed schemes for up to
= 4 errors, designing search schemes for larger
values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to
= 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher
values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset. |
|---|---|
| AbstractList | This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of
errors. While existing literature offers designed schemes for up to
= 4 errors, designing search schemes for larger
values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to
= 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher
values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset. This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of k errors. While existing literature offers designed schemes for up to k = 4 errors, designing search schemes for larger k values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to k = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher k values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of k errors. While existing literature offers designed schemes for up to k = 4 errors, designing search schemes for larger k values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to k = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher k values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset. |
| Author | Renders, Luca Depuydt, Lore Fostier, Jan Rahmann, Sven |
| Author_xml | – sequence: 1 givenname: Luca orcidid: 0000-0002-2244-1427 surname: Renders fullname: Renders, Luca organization: Internet Technology and Data Science Lab, Ghent University, Ghent, Belgium – sequence: 2 givenname: Lore orcidid: 0000-0001-8517-0479 surname: Depuydt fullname: Depuydt, Lore organization: Internet Technology and Data Science Lab, Ghent University, Ghent, Belgium – sequence: 3 givenname: Sven orcidid: 0000-0002-8536-6065 surname: Rahmann fullname: Rahmann, Sven organization: Center for Bioinformatics Saar, Saarland University, Saarbrücken, Germany – sequence: 4 givenname: Jan orcidid: 0000-0002-9994-8269 surname: Fostier fullname: Fostier, Jan organization: Internet Technology and Data Science Lab, Ghent University, Ghent, Belgium |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39344875$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkEtLw0AUhQep2Icu3cos3aTOOxN3pdYHRBSq6zCduWkjyaRmJqD_3hQruDqXcz4u954pGvnWA0KXlMwp0dmNbTZzRpiYE6XECZpQKdNEK6VG_-YxmobwQQjliqRnaMwzLoRO5QSt8zaEGkLAi_2-a7-qxkTAryZG6Dx-NtHuKr-9xYs-tofI4TsI1dbjtsSrsqxsBT7iNZjO7vDa7qCBcI5OS1MHuDjqDL3fr96Wj0n-8vC0XOSJ5SmLCTCuXCYkZwCGcFBCcas22oHTUgCXKRlso7VmpTKlyrS02qnMEMYtcRmboevfvcPhnz2EWDRVsFDXxkPbh4JTSg_VMD6gV0e03zTgin03PNp9F39FsB8kVGDS |
| CitedBy_id | crossref_primary_10_1093_nargab_lqaf025 crossref_primary_10_1089_cmb_2024_0925 crossref_primary_10_1186_s13015_025_00281_x |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1089/cmb.2024.0664 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Biology Mathematics |
| EISSN | 1557-8666 |
| ExternalDocumentID | 39344875 |
| Genre | Research Support, Non-U.S. Gov't Journal Article |
| GroupedDBID | --- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1 7X8 J8X SAUOL SCNPE SFC |
| ID | FETCH-LOGICAL-c372t-e236d94532eea03e6463c6b8ded854e3570a03a8882f6af6985c8d69a023c0d92 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 3 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001321960100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1557-8666 |
| IngestDate | Sun Nov 09 13:11:06 EST 2025 Thu Apr 03 07:00:11 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 10 |
| Keywords | approximate pattern matching integer linear program sequence alignment search schemes |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c372t-e236d94532eea03e6463c6b8ded854e3570a03a8882f6af6985c8d69a023c0d92 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ORCID | 0000-0002-2244-1427 0000-0002-9994-8269 0000-0001-8517-0479 0000-0002-8536-6065 |
| PMID | 39344875 |
| PQID | 3111202423 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_3111202423 pubmed_primary_39344875 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-10-00 20241001 |
| PublicationDateYYYYMMDD | 2024-10-01 |
| PublicationDate_xml | – month: 10 year: 2024 text: 2024-10-00 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Journal of computational biology |
| PublicationTitleAlternate | J Comput Biol |
| PublicationYear | 2024 |
| SSID | ssj0013607 |
| Score | 2.4338253 |
| Snippet | This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 975 |
| SubjectTerms | Algorithms Computational Biology - methods Humans Pattern Recognition, Automated - methods Programming, Linear Software |
| Title | Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/39344875 https://www.proquest.com/docview/3111202423 |
| Volume | 31 |
| WOSCitedRecordID | wos001321960100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3dS8MwEA_qFPTBj_k1v4jga7cuSZPGFxm64YMbgynsrbRJCgPXTruJ_vde0k59EQRfSmnaEJLL_e5y198hdJVYlm-aco8FgnqMK-lJokLwUtpBKhKmY6ldsQkxGITjsRxWB25FlVa51IlOUetc2TPyFoVNSSyg0JvZi2erRtnoalVCYxXVKJgyNqVLjH9EEbj7XRogEzQx2OkVx6YfypaaJk3bXRMgl_1uXTqU6e38d3y7aLuyL3GnFIg9tGKyOtooK05-1NFW_4umtdhHoweAyGfQdbhjqcXfJ9Bg8NBRbmYYXnWJlte4s5jntknjO5fwgfMUdx33BEAWLjOW8QiWf2qKA_TU6z7e3ntVlQVPUUHmniGUa8kCSoyJfWo441TxJNRGhwEzNBA-PI7BUyYpj1Muw0CFmssY0F75WpJDtJblmTlG2Oex8JWUcZsJpsF48BkjGjwqsKIIdNFAl8u5i0CKbWgizky-KKLv2Wugo3IBollJtxFRSZl1q07-8PUp2rQ3ZbbdGaqlsIfNOVpXb_NJ8XrhxAOug2H_E4y4wi8 |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Lossless+Approximate+Pattern+Matching%3A+Automated+Design+of+Efficient+Search+Schemes&rft.jtitle=Journal+of+computational+biology&rft.au=Renders%2C+Luca&rft.au=Depuydt%2C+Lore&rft.au=Rahmann%2C+Sven&rft.au=Fostier%2C+Jan&rft.date=2024-10-01&rft.eissn=1557-8666&rft.volume=31&rft.issue=10&rft.spage=975&rft_id=info:doi/10.1089%2Fcmb.2024.0664&rft_id=info%3Apmid%2F39344875&rft_id=info%3Apmid%2F39344875&rft.externalDocID=39344875 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1557-8666&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1557-8666&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1557-8666&client=summon |