All-pairs suffix/prefix in optimal time using Aho-Corasick space
The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S1,…,Sk} of k strings of total length n, we are asked to find, for each string Si, i∈[1,k], its longest suffix that is a prefix of string Sj, for all j≠i, j∈[1,k...
Saved in:
| Published in: | Information processing letters Vol. 178; p. 106275 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier B.V
01.11.2022
Elsevier |
| Subjects: | |
| ISSN: | 0020-0190, 1872-6119 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S1,…,Sk} of k strings of total length n, we are asked to find, for each string Si, i∈[1,k], its longest suffix that is a prefix of string Sj, for all j≠i, j∈[1,k]. Several algorithms running in the optimal O(n+k2) time for solving APSP are known. All of these algorithms are based on suffix sorting and thus require space Ω(n) in any case. We consider the parameterized version of the APSP problem, denoted by ℓ-APSP, in which we are asked to output only the pairs whose suffix/prefix overlap is of length at least ℓ. We give an algorithm for solving ℓ-APSP that runs in the optimal O(n+|OUTPUTℓ|) time using O(n) space, where OUTPUTℓ is the set of output pairs. Our algorithm is thus optimal for the APSP problem as well by setting ℓ=0. Notably, our algorithm is fundamentally different from all optimal algorithms solving the APSP problem: it does not rely on sorting the suffixes of all input strings but on a novel traversal of the Aho-Corasick machine, and it thus requires space linear in the size of the machine.
•All-pairs suffix/prefix of length at least ℓ in optimal time.•All-pairs suffix/prefix in optimal time without suffix sorting.•All-pairs suffix/prefix in optimal time using Aho-Corasick space. |
|---|---|
| AbstractList | The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S1,…,Sk} of k strings of total length n, we are asked to find, for each string Si, i∈[1,k], its longest suffix that is a prefix of string Sj, for all j≠i, j∈[1,k]. Several algorithms running in the optimal O(n+k2) time for solving APSP are known. All of these algorithms are based on suffix sorting and thus require space Ω(n) in any case. We consider the parameterized version of the APSP problem, denoted by ℓ-APSP, in which we are asked to output only the pairs whose suffix/prefix overlap is of length at least ℓ. We give an algorithm for solving ℓ-APSP that runs in the optimal O(n+|OUTPUTℓ|) time using O(n) space, where OUTPUTℓ is the set of output pairs. Our algorithm is thus optimal for the APSP problem as well by setting ℓ=0. Notably, our algorithm is fundamentally different from all optimal algorithms solving the APSP problem: it does not rely on sorting the suffixes of all input strings but on a novel traversal of the Aho-Corasick machine, and it thus requires space linear in the size of the machine.
•All-pairs suffix/prefix of length at least ℓ in optimal time.•All-pairs suffix/prefix in optimal time without suffix sorting.•All-pairs suffix/prefix in optimal time using Aho-Corasick space. The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S 1 ,. .. , S k } of k strings of total length n, we are asked to find, for each string S i , i ∈ [1, k], its longest suffix that is a prefix of string S j , for all j = i, j ∈ [1, k]. Several algorithms running in the optimal O(n + k 2) time for solving APSP are known. All of these algorithms are based on suffix sorting and thus require space (n) in any case. We consider the parameterized version of the APSP problem, denoted by-APSP, in which we are asked to output only the pairs whose suffix/prefix overlap is of length at least. We give an algorithm for solving-APSP that runs in the optimal O(n + |OUTPUT |) time using O(n) space, where OUTPUT is the set of output pairs. Our algorithm is thus optimal for the APSP problem as well by setting = 0. Notably, our algorithm is fundamentally different from all optimal algorithms solving the APSP problem: it does not rely on sorting the suffixes of all input strings but on a novel traversal of the Aho-Corasick machine, and it thus requires space linear in the size of the machine. |
| ArticleNumber | 106275 |
| Author | Pissis, Solon P. Loukides, Grigorios |
| Author_xml | – sequence: 1 givenname: Grigorios surname: Loukides fullname: Loukides, Grigorios email: grigorios.loukides@kcl.ac.uk organization: Department of Informatics, King's College London, London, UK – sequence: 2 givenname: Solon P. orcidid: 0000-0002-1445-1932 surname: Pissis fullname: Pissis, Solon P. email: solon.pissis@cwi.nl organization: CWI, Amsterdam, the Netherlands |
| BackLink | https://inria.hal.science/hal-03832860$$DView record in HAL |
| BookMark | eNp9UEFOwzAQtFCRaAsP4OYrh7S7TmLH4kJUUYpUiQucLcdxqEtIIrut4Pe4CuLIZUezmlntzIxMur6zhNwiLBCQL_cLN7QLBoxFzpnIL8gUC8ESjignZArAIAGUcEVmIewBgGepmJKHsm2TQTsfaDg2jftaDt5GoK6j_XBwn7qlcVp6DK57p-WuT1a918GZDxoGbew1uWx0G-zNL87J2_rxdbVJti9Pz6tym5iU8UMiZF7wCmQlNdNYCSG1aFDLtOaYGYFNhrmEyuZQYBap5AWvteVVrTNTmSqdk7vx7k63avDxMf-teu3Uptyq8w7SImUFhxNGLY5a4_sQYp4_A4I616X2KtalznWpsa7ouR89NoY4OetVMM52xtbOW3NQde_-cf8AkoBy9g |
| Cites_doi | 10.1016/j.ipl.2005.11.019 10.1186/1471-2105-13-82 10.1137/0222058 10.1016/0020-0190(92)90176-V 10.1016/j.ipl.2009.10.015 10.1145/360825.360855 10.1016/j.ipl.2019.105862 10.1007/BF01840391 10.1016/j.tcs.2017.07.013 10.1016/j.jda.2016.04.002 |
| ContentType | Journal Article |
| Copyright | 2022 The Authors Distributed under a Creative Commons Attribution 4.0 International License |
| Copyright_xml | – notice: 2022 The Authors – notice: Distributed under a Creative Commons Attribution 4.0 International License |
| DBID | 6I. AAFTH AAYXX CITATION 1XC VOOES |
| DOI | 10.1016/j.ipl.2022.106275 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1872-6119 |
| ExternalDocumentID | oai:HAL:hal-03832860v1 10_1016_j_ipl_2022_106275 S0020019022000321 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1RT 1~. 1~5 29I 4.4 457 4G. 5GY 5VS 6I. 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAFTH AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABFSI ABJNI ABMAC ABTAH ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD AEBSH AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BKOJK BKOMP BLXMC CS3 DU5 E.L EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q G8K GBLVA GBOLZ HLZ HMJ HVGLF HZ~ IHE J1W KOM LG9 M26 M41 MO0 MS~ O-L O9- OAUVE OZT P-8 P-9 P2P PC. PQQKQ Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SEW SME SPC SPCBC SSV SSZ T5K TN5 UQL WH7 WUQ XPP ZMT ZY4 ~G- 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD 1XC VOOES |
| ID | FETCH-LOGICAL-c326t-79586b09b9a2a1b779a7f1a93d614c71f41590be5081471f9686dae6bda4cbcb3 |
| ISICitedReferencesCount | 3 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000860690400004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0020-0190 |
| IngestDate | Tue Oct 14 20:16:52 EDT 2025 Sat Nov 29 07:28:19 EST 2025 Fri Feb 23 02:39:44 EST 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | String algorithms Data structures Failure transition tree Algorithms Aho-Corasick machine |
| Language | English |
| License | This is an open access article under the CC BY license. Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c326t-79586b09b9a2a1b779a7f1a93d614c71f41590be5081471f9686dae6bda4cbcb3 |
| ORCID | 0000-0002-1445-1932 |
| OpenAccessLink | https://inria.hal.science/hal-03832860 |
| ParticipantIDs | hal_primary_oai_HAL_hal_03832860v1 crossref_primary_10_1016_j_ipl_2022_106275 elsevier_sciencedirect_doi_10_1016_j_ipl_2022_106275 |
| PublicationCentury | 2000 |
| PublicationDate | November 2022 2022-11-00 2022-11 |
| PublicationDateYYYYMMDD | 2022-11-01 |
| PublicationDate_xml | – month: 11 year: 2022 text: November 2022 |
| PublicationDecade | 2020 |
| PublicationTitle | Information processing letters |
| PublicationYear | 2022 |
| Publisher | Elsevier B.V Elsevier |
| Publisher_xml | – name: Elsevier B.V – name: Elsevier |
| References | Dori, Landau (br0040) 2006; 98 Khan (br0080) 2021 Park, Park, Cazaux, Park, Rivals (br0130) 2021 Cazaux, Rivals (br0020) 2020; 155 Gusfield (br0060) 1997 Rachid, Malluhi (br0140) 2015; 2015 Crochemore, Hancart, Lecroq (br0030) 2007 Gonnella, Kurtz (br0050) 2012; 13 Lim, Park (br0090) 2017; 698 Gusfield, Landau, Schieber (br0070) 1992; 41 Ohlebusch, Gog (br0120) 2010; 110 Tustumi, Gog, Telles, Louza (br0150) 2016; 37 Ukkonen (br0160) 1990; 5 Manber, Myers (br0110) 1993; 22 Louza, Gog, Zanotto, Araujo, Telles (br0100) 2016 Weiner (br0170) 1973 Aho, Corasick (br0010) 1975; 18 Gusfield (10.1016/j.ipl.2022.106275_br0060) 1997 Gusfield (10.1016/j.ipl.2022.106275_br0070) 1992; 41 Manber (10.1016/j.ipl.2022.106275_br0110) 1993; 22 Aho (10.1016/j.ipl.2022.106275_br0010) 1975; 18 Gonnella (10.1016/j.ipl.2022.106275_br0050) 2012; 13 Park (10.1016/j.ipl.2022.106275_br0130) 2021 Rachid (10.1016/j.ipl.2022.106275_br0140) 2015; 2015 Ukkonen (10.1016/j.ipl.2022.106275_br0160) 1990; 5 Louza (10.1016/j.ipl.2022.106275_br0100) 2016 Tustumi (10.1016/j.ipl.2022.106275_br0150) 2016; 37 Ohlebusch (10.1016/j.ipl.2022.106275_br0120) 2010; 110 Cazaux (10.1016/j.ipl.2022.106275_br0020) 2020; 155 Weiner (10.1016/j.ipl.2022.106275_br0170) 1973 Lim (10.1016/j.ipl.2022.106275_br0090) 2017; 698 Crochemore (10.1016/j.ipl.2022.106275_br0030) 2007 Dori (10.1016/j.ipl.2022.106275_br0040) 2006; 98 Khan (10.1016/j.ipl.2022.106275_br0080) 2021 |
| References_xml | – volume: 2015 year: 2015 ident: br0140 article-title: A practical and scalable tool to find overlaps between sequences publication-title: BioMed Res. Int. – year: 2021 ident: br0130 article-title: A linear time algorithm for constructing hierarchical overlap graphs publication-title: 32nd Annual Symposium on Combinatorial Pattern Matching – volume: 98 start-page: 66 year: 2006 end-page: 72 ident: br0040 article-title: Construction of Aho Corasick automaton in linear time for integer alphabets publication-title: Inf. Process. Lett. – volume: 22 start-page: 935 year: 1993 end-page: 948 ident: br0110 article-title: Suffix arrays: a new method for on-line string searches publication-title: SIAM J. Comput. – volume: 5 start-page: 313 year: 1990 end-page: 323 ident: br0160 article-title: A linear-time algorithm for finding approximate shortest common superstrings publication-title: Algorithmica – year: 2007 ident: br0030 article-title: Algorithms on Strings – volume: 41 start-page: 181 year: 1992 end-page: 185 ident: br0070 article-title: An efficient algorithm for the all pairs suffix-prefix problem publication-title: Inf. Process. Lett. – volume: 18 start-page: 333 year: 1975 end-page: 340 ident: br0010 article-title: Efficient string matching: an aid to bibliographic search publication-title: Commun. ACM – volume: 37 start-page: 34 year: 2016 end-page: 43 ident: br0150 article-title: An improved algorithm for the all-pairs suffix-prefix problem publication-title: J. Discret. Algorithms – year: 1997 ident: br0060 article-title: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology – start-page: 122 year: 2016 end-page: 132 ident: br0100 article-title: Parallel computation for the all-pairs suffix-prefix problem publication-title: String Processing and Information Retrieval - Proceedings of the 23rd International Symposium – volume: 698 start-page: 14 year: 2017 end-page: 24 ident: br0090 article-title: A fast algorithm for the all-pairs suffix-prefix problem publication-title: Theor. Comput. Sci. – start-page: 1 year: 1973 end-page: 11 ident: br0170 article-title: Linear pattern matching algorithms publication-title: 14th Annual Symposium on Switching and Automata Theory – volume: 13 start-page: 82 year: 2012 ident: br0050 article-title: Readjoiner: a fast and memory efficient string graph-based sequence assembler publication-title: BMC Bioinform. – volume: 110 start-page: 123 year: 2010 end-page: 128 ident: br0120 article-title: Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem publication-title: Inf. Process. Lett. – year: 2021 ident: br0080 article-title: Optimal construction of hierarchical overlap graphs publication-title: 32nd Annual Symposium on Combinatorial Pattern Matching – volume: 155 year: 2020 ident: br0020 article-title: Hierarchical overlap graph publication-title: Inf. Process. Lett. – year: 2021 ident: 10.1016/j.ipl.2022.106275_br0130 article-title: A linear time algorithm for constructing hierarchical overlap graphs – volume: 98 start-page: 66 year: 2006 ident: 10.1016/j.ipl.2022.106275_br0040 article-title: Construction of Aho Corasick automaton in linear time for integer alphabets publication-title: Inf. Process. Lett. doi: 10.1016/j.ipl.2005.11.019 – volume: 13 start-page: 82 year: 2012 ident: 10.1016/j.ipl.2022.106275_br0050 article-title: Readjoiner: a fast and memory efficient string graph-based sequence assembler publication-title: BMC Bioinform. doi: 10.1186/1471-2105-13-82 – volume: 22 start-page: 935 year: 1993 ident: 10.1016/j.ipl.2022.106275_br0110 article-title: Suffix arrays: a new method for on-line string searches publication-title: SIAM J. Comput. doi: 10.1137/0222058 – year: 2007 ident: 10.1016/j.ipl.2022.106275_br0030 – year: 2021 ident: 10.1016/j.ipl.2022.106275_br0080 article-title: Optimal construction of hierarchical overlap graphs – start-page: 122 year: 2016 ident: 10.1016/j.ipl.2022.106275_br0100 article-title: Parallel computation for the all-pairs suffix-prefix problem – volume: 41 start-page: 181 year: 1992 ident: 10.1016/j.ipl.2022.106275_br0070 article-title: An efficient algorithm for the all pairs suffix-prefix problem publication-title: Inf. Process. Lett. doi: 10.1016/0020-0190(92)90176-V – volume: 110 start-page: 123 year: 2010 ident: 10.1016/j.ipl.2022.106275_br0120 article-title: Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem publication-title: Inf. Process. Lett. doi: 10.1016/j.ipl.2009.10.015 – volume: 18 start-page: 333 year: 1975 ident: 10.1016/j.ipl.2022.106275_br0010 article-title: Efficient string matching: an aid to bibliographic search publication-title: Commun. ACM doi: 10.1145/360825.360855 – volume: 155 year: 2020 ident: 10.1016/j.ipl.2022.106275_br0020 article-title: Hierarchical overlap graph publication-title: Inf. Process. Lett. doi: 10.1016/j.ipl.2019.105862 – volume: 5 start-page: 313 year: 1990 ident: 10.1016/j.ipl.2022.106275_br0160 article-title: A linear-time algorithm for finding approximate shortest common superstrings publication-title: Algorithmica doi: 10.1007/BF01840391 – volume: 698 start-page: 14 year: 2017 ident: 10.1016/j.ipl.2022.106275_br0090 article-title: A fast algorithm for the all-pairs suffix-prefix problem publication-title: Theor. Comput. Sci. doi: 10.1016/j.tcs.2017.07.013 – volume: 37 start-page: 34 year: 2016 ident: 10.1016/j.ipl.2022.106275_br0150 article-title: An improved algorithm for the all-pairs suffix-prefix problem publication-title: J. Discret. Algorithms doi: 10.1016/j.jda.2016.04.002 – year: 1997 ident: 10.1016/j.ipl.2022.106275_br0060 – volume: 2015 year: 2015 ident: 10.1016/j.ipl.2022.106275_br0140 article-title: A practical and scalable tool to find overlaps between sequences publication-title: BioMed Res. Int. – start-page: 1 year: 1973 ident: 10.1016/j.ipl.2022.106275_br0170 article-title: Linear pattern matching algorithms |
| SSID | ssj0006437 |
| Score | 2.370456 |
| Snippet | The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S1,…,Sk} of k... The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S 1 ,. .. , S k } of... |
| SourceID | hal crossref elsevier |
| SourceType | Open Access Repository Index Database Publisher |
| StartPage | 106275 |
| SubjectTerms | Aho-Corasick machine Algorithms Computer Science Data structures Failure transition tree String algorithms |
| Title | All-pairs suffix/prefix in optimal time using Aho-Corasick space |
| URI | https://dx.doi.org/10.1016/j.ipl.2022.106275 https://inria.hal.science/hal-03832860 |
| Volume | 178 |
| WOSCitedRecordID | wos000860690400004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: ScienceDirect Freedom Collection - Elsevier customDbUrl: eissn: 1872-6119 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0006437 issn: 0020-0190 databaseCode: AIEXJ dateStart: 19950113 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1bb9MwFLZg44EXxlUbA2QhnqgMudWO34jGRgdVVWkD9c1ynETN1iVRL1N_PseXpC0INB54sSI7tqLzOcfn6oPQu4jGqSpij3DGAxLRnBNJVUj6nl_4knppaG5i-jFko1E8mfCxq7a5MOUEWFXF6zVv_ivU0Adg69TZf4C7WxQ64BlAhxZgh_ZOwCezGWm0l6a3WBVFudbldeEYLNfatFEDh7gxmSI3eW9l7ATJtCYnsBEAruse8Be1Exzk0pXMLmlsUoGeNDNZQJ08PqxX12VmOc4X0PfreVl3g2PA1t5kcAGctnIZZc7SAEqqv2NpaFNgttlpoKPabL3Pjp3akjy_sWZrJbj6UDba4xME0KNvSN6cQ63vfZBciPHnMzE8H33bHd2KHRwkQ2inckY80LGDmHq3oAPvw4IceNt-cn46-dqdyNo5aUN97Oe23m0T5_fL9_xJPrk_bS3tRvK4fIweOZUBJxbqJ-heXj1FB205Duy48zP0qUMeW-Q_WtxxWWGHO9a4Y4M73sYdG9yfo-9np5cnA-IKZBAFUveSMN6PaerxlMtA-iljXDL4w3iYgdClmF-AdMa9NAch3AchpOA0ppnMaZrJSKUqDV-gvaqu8kOEWeYHkimYyGCMci77oLt6oQpYpsJIHaH3LVlEY-9BEW2A4JUAGgpNQ2FpeISilnDCCXJWQBOwGf427S0QuVteX3wOMAvdtwH55V1eOkYPNxv4Fdpbzlf5a_RA3S7LxfyN2x4_AU1bcNQ |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=All-pairs+suffix%2Fprefix+in+optimal+time+using+Aho-Corasick+space&rft.jtitle=Information+processing+letters&rft.au=Loukides%2C+Grigorios&rft.au=Pissis%2C+Solon+P&rft.date=2022-11-01&rft.pub=Elsevier&rft.issn=0020-0190&rft.volume=178&rft_id=info:doi/10.1016%2Fj.ipl.2022.106275&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-03832860v1 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0020-0190&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0020-0190&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0020-0190&client=summon |