Sorted Consecutive Occurrence Queries in Substrings

Uložené v:
Podrobná bibliografia
Názov: Sorted Consecutive Occurrence Queries in Substrings
Autori: Akram, Waseem, Mieno, Takuya
Prispievatelia: Waseem Akram and Takuya Mieno
Informácie o vydavateľovi: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025.
Rok vydania: 2025
Predmety: suffix tree, ddc:004, string algorithm, consecutive occurrences
Popis: The string indexing problem is a fundamental computational problem with numerous applications, including information retrieval and bioinformatics. It aims to efficiently solve the pattern matching problem: given a text T of length n for preprocessing and a pattern P of length m as a query, the goal is to report all occurrences of P as substrings of T. Navarro and Thankachan [CPM 2015, Theor. Comput. Sci. 2016] introduced a variant of this problem called the gap-bounded consecutive occurrence query, which reports pairs of consecutive occurrences of P in T such that their gaps (i.e., the distances between them) lie within a query-specified range [g₁, g₂]. Recently, Bille et al. [FSTTCS 2020, Theor. Comput. Sci. 2022] proposed the top-k close consecutive occurrence query, which reports the k closest consecutive occurrences of P in T, sorted in non-decreasing order of distance. Both problems are optimally solved in query time with O(n log n)-space data structures. In this paper, we generalize these problems to the range query model, which focuses only on occurrences of P in a specified substring T[a.. b] of T. Our contributions are as follows: - We propose an O(n log² n)-space data structure that answers the range top-k consecutive occurrence query in O(|P| + log log n + k) time. - We propose an O(n log^{2+ε} n)-space data structure that answers the range gap-bounded consecutive occurrence query in O(|P| + log log n + output) time, where ε is a positive constant and output denotes the number of outputs. Additionally, as by-products, we present algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which are of independent interest.
Druh dokumentu: Conference object
Popis súboru: application/pdf
Jazyk: English
DOI: 10.4230/lipics.cpm.2025.24
Prístupová URL adresa: https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.24
Rights: CC BY
Prístupové číslo: edsair.od......1814..431ea8fb65bffeca45e833044c69a4c1
Databáza: OpenAIRE
Popis
Abstrakt:The string indexing problem is a fundamental computational problem with numerous applications, including information retrieval and bioinformatics. It aims to efficiently solve the pattern matching problem: given a text T of length n for preprocessing and a pattern P of length m as a query, the goal is to report all occurrences of P as substrings of T. Navarro and Thankachan [CPM 2015, Theor. Comput. Sci. 2016] introduced a variant of this problem called the gap-bounded consecutive occurrence query, which reports pairs of consecutive occurrences of P in T such that their gaps (i.e., the distances between them) lie within a query-specified range [g₁, g₂]. Recently, Bille et al. [FSTTCS 2020, Theor. Comput. Sci. 2022] proposed the top-k close consecutive occurrence query, which reports the k closest consecutive occurrences of P in T, sorted in non-decreasing order of distance. Both problems are optimally solved in query time with O(n log n)-space data structures. In this paper, we generalize these problems to the range query model, which focuses only on occurrences of P in a specified substring T[a.. b] of T. Our contributions are as follows: - We propose an O(n log² n)-space data structure that answers the range top-k consecutive occurrence query in O(|P| + log log n + k) time. - We propose an O(n log^{2+ε} n)-space data structure that answers the range gap-bounded consecutive occurrence query in O(|P| + log log n + output) time, where ε is a positive constant and output denotes the number of outputs. Additionally, as by-products, we present algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which are of independent interest.
DOI:10.4230/lipics.cpm.2025.24