Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes

This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of computational biology Ročník 31; číslo 10; s. 975
Hlavní autoři: Renders, Luca, Depuydt, Lore, Rahmann, Sven, Fostier, Jan
Médium: Journal Article
Jazyk:angličtina
Vydáno: United States 01.10.2024
Témata:
ISSN:1557-8666, 1557-8666
On-line přístup:Zjistit podrobnosti o přístupu
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of errors. While existing literature offers designed schemes for up to = 4 errors, designing search schemes for larger values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.
AbstractList This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of errors. While existing literature offers designed schemes for up to = 4 errors, designing search schemes for larger values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.
This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of k errors. While existing literature offers designed schemes for up to k = 4 errors, designing search schemes for larger k values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to k = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher k values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of k errors. While existing literature offers designed schemes for up to k = 4 errors, designing search schemes for larger k values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to k = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher k values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.
Author Renders, Luca
Depuydt, Lore
Fostier, Jan
Rahmann, Sven
Author_xml – sequence: 1
  givenname: Luca
  orcidid: 0000-0002-2244-1427
  surname: Renders
  fullname: Renders, Luca
  organization: Internet Technology and Data Science Lab, Ghent University, Ghent, Belgium
– sequence: 2
  givenname: Lore
  orcidid: 0000-0001-8517-0479
  surname: Depuydt
  fullname: Depuydt, Lore
  organization: Internet Technology and Data Science Lab, Ghent University, Ghent, Belgium
– sequence: 3
  givenname: Sven
  orcidid: 0000-0002-8536-6065
  surname: Rahmann
  fullname: Rahmann, Sven
  organization: Center for Bioinformatics Saar, Saarland University, Saarbrücken, Germany
– sequence: 4
  givenname: Jan
  orcidid: 0000-0002-9994-8269
  surname: Fostier
  fullname: Fostier, Jan
  organization: Internet Technology and Data Science Lab, Ghent University, Ghent, Belgium
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39344875$$D View this record in MEDLINE/PubMed
BookMark eNpNkEtLw0AUhQep2Icu3cos3aTOOxN3pdYHRBSq6zCduWkjyaRmJqD_3hQruDqXcz4u954pGvnWA0KXlMwp0dmNbTZzRpiYE6XECZpQKdNEK6VG_-YxmobwQQjliqRnaMwzLoRO5QSt8zaEGkLAi_2-a7-qxkTAryZG6Dx-NtHuKr-9xYs-tofI4TsI1dbjtsSrsqxsBT7iNZjO7vDa7qCBcI5OS1MHuDjqDL3fr96Wj0n-8vC0XOSJ5SmLCTCuXCYkZwCGcFBCcas22oHTUgCXKRlso7VmpTKlyrS02qnMEMYtcRmboevfvcPhnz2EWDRVsFDXxkPbh4JTSg_VMD6gV0e03zTgin03PNp9F39FsB8kVGDS
CitedBy_id crossref_primary_10_1093_nargab_lqaf025
crossref_primary_10_1089_cmb_2024_0925
crossref_primary_10_1186_s13015_025_00281_x
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1089/cmb.2024.0664
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Biology
Mathematics
EISSN 1557-8666
ExternalDocumentID 39344875
Genre Research Support, Non-U.S. Gov't
Journal Article
GroupedDBID ---
0R~
29K
34G
39C
4.4
53G
5GY
ABBKN
ABEFU
ACGFO
ADBBV
AENEX
AFOSN
AI.
ALMA_UNASSIGNED_HOLDINGS
BAWUL
BNQNF
CAG
CGR
COF
CS3
CUY
CVF
D-I
DIK
DU5
EBS
ECM
EIF
EJD
F5P
IAO
IER
IGS
IHR
IM4
ITC
MV1
NPM
NQHIM
O9-
P2P
R.V
RIG
RML
RMSOB
RNS
TN5
TR2
UE5
VH1
7X8
J8X
SAUOL
SCNPE
SFC
ID FETCH-LOGICAL-c372t-e236d94532eea03e6463c6b8ded854e3570a03a8882f6af6985c8d69a023c0d92
IEDL.DBID 7X8
ISICitedReferencesCount 3
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001321960100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1557-8666
IngestDate Sun Nov 09 13:11:06 EST 2025
Thu Apr 03 07:00:11 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 10
Keywords approximate pattern matching
integer linear program
sequence alignment
search schemes
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c372t-e236d94532eea03e6463c6b8ded854e3570a03a8882f6af6985c8d69a023c0d92
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0002-2244-1427
0000-0002-9994-8269
0000-0001-8517-0479
0000-0002-8536-6065
PMID 39344875
PQID 3111202423
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3111202423
pubmed_primary_39344875
PublicationCentury 2000
PublicationDate 2024-10-00
20241001
PublicationDateYYYYMMDD 2024-10-01
PublicationDate_xml – month: 10
  year: 2024
  text: 2024-10-00
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of computational biology
PublicationTitleAlternate J Comput Biol
PublicationYear 2024
SSID ssj0013607
Score 2.433907
Snippet This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 975
SubjectTerms Algorithms
Computational Biology - methods
Humans
Pattern Recognition, Automated - methods
Programming, Linear
Software
Title Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes
URI https://www.ncbi.nlm.nih.gov/pubmed/39344875
https://www.proquest.com/docview/3111202423
Volume 31
WOSCitedRecordID wos001321960100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEF7UKujBR33VFyt43Zpmk314kaItHtpSqEJvYZPdQMEm1bSi_97ZTapeBMFLCNkkhH3M983O5BuEroCDaBXKlEjjCRIoxUmcckpaKdcs9pjgsata0uODgRiP5bDacCuqtMqlTXSGWueJ3SO_prAofQso9Hb2QmzVKBtdrUporKIaBSpjU7r4-EcUgbnfpQEywRIDT680Nj0hr5Np3LSvawLkBr-zS4cy3Z3_ft8u2q74JW6XE2IPrZisjjbKipMfdbTV_5JpLfbRqAcQ-Qy2DrettPj7BBoMHjrJzQzDrS7R8ga3F_PcNml87xI-cJ7ijtOeAMjCZcYyHsHwT01xgJ66nce7B1JVWSAJ5f6cGJ8yLYOQ-sYojxoWMJqwWGijRRgYGnIPLivwlP2UqZRJESZCM6kA7RNPS_8QrWV5Zo4RTrQSxldaUKOBmIGfHoO7FWjeiq1QYNhAl8u-i2AW29CEyky-KKLv3mugo3IAolkptxFRSQPrVp384elTtGlPymy7M1RLYQ2bc7SevM0nxeuFmx5wHAz7n9AkxAs
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Lossless+Approximate+Pattern+Matching%3A+Automated+Design+of+Efficient+Search+Schemes&rft.jtitle=Journal+of+computational+biology&rft.au=Renders%2C+Luca&rft.au=Depuydt%2C+Lore&rft.au=Rahmann%2C+Sven&rft.au=Fostier%2C+Jan&rft.date=2024-10-01&rft.issn=1557-8666&rft.eissn=1557-8666&rft.volume=31&rft.issue=10&rft.spage=975&rft_id=info:doi/10.1089%2Fcmb.2024.0664&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1557-8666&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1557-8666&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1557-8666&client=summon