Constructing Antidictionaries in Output-Sensitive Space

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1, y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compres...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:DCC (Los Alamitos, Calif.) s. 538 - 547
Hlavní autoři: Ayad, Lorraine A.K., Badkobeh, Golnaz, Fici, Gabriele, Heliou, Alice, Pissis, Solon P.
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.03.2019
Témata:
ISSN:2375-0359
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1, y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ_y_1#...#y_N || =o(n), for all N ∊[1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y_1,...,y_k and MaxOut=max{||M^ℓ_y_1#...#y_N||:N ∊[1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
ISSN:2375-0359
DOI:10.1109/DCC.2019.00062