Bidirectional String Anchors for Improved Text Indexing and Top- K Similarity Search

The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good wo...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on knowledge and data engineering Ročník 35; číslo 11; s. 1 - 18
Hlavní autoři: Loukides, Grigorios, Pissis, Solon P., Sweering, Michelle
Médium: Journal Article
Jazyk:angličtina
Vydáno: IEEE 01.11.2023
Témata:
ISSN:1041-4347, 1558-2191
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good worst-case guarantees for on-line pattern searches. In response, we propose bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given an integer <inline-formula><tex-math notation="LaTeX">\ell</tex-math></inline-formula>, our mechanism selects the lexicographically smallest rotation in every length-<inline-formula><tex-math notation="LaTeX">\ell</tex-math></inline-formula> fragment. We show that, like minimizers samples, bd-anchors samples are approximately uniform, locally consistent, and computable in linear time. Furthermore, our experiments demonstrate that the bd-anchors sample sizes decrease proportionally to <inline-formula><tex-math notation="LaTeX">\ell</tex-math></inline-formula>; and that these sizes are competitive to or smaller than the minimizers sample sizes. We theoretically justify these results by analyzing the expected size of bd-anchors samples. We also prove that computing a total order on the input alphabet which minimizes the bd-anchors sample size is NP-hard. We next highlight the benefits of bd-anchors in two important applications: text indexing and top-<inline-formula><tex-math notation="LaTeX">K</tex-math></inline-formula> similarity search. For the first application, we develop an index for performing on-line pattern searches in near-optimal time, and show experimentally that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index; we also show that it is substantially faster than two classic text indexes. For the second application, we develop a heuristic for top-<inline-formula><tex-math notation="LaTeX">K</tex-math></inline-formula> similarity search under edit distance, and show experimentally that it is generally as accurate as the state-of-the-art tool for the same purpose but more than one order of magnitude faster .
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2022.3231780