Constructing Antidictionaries of Long Texts in Output-Sensitive Space

A word x that is absent from a word y is called minimal if all its proper factors occur in y . Given a collection of k words y 1 , … , y k over an alphabet Σ , we are asked to compute the set M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection { y 1 , … , y k }. The s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Theory of computing systems Jg. 65; H. 5; S. 777 - 797
Hauptverfasser:	Ayad, Lorraine A.K., Badkobeh, Golnaz, Fici, Gabriele, Héliou, Alice, Pissis, Solon P.
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York Springer US 01.07.2021 Springer Nature B.V
Schlagworte:	Algorithms Alphabets Bioinformatics Computation Computer Science Data compression Genomes Theory of Computation String algorithm Output sensitive algorithm Absent word Antidictionary Data compression
ISSN:	1432-4350, 1433-0490
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A word x that is absent from a word y is called minimal if all its proper factors occur in y . Given a collection of k words y 1 , … , y k over an alphabet Σ , we are asked to compute the set M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection { y 1 , … , y k }. The set M { y 1 , … , y k } ℓ contains all the words x such that x is absent from all the words of the collection while there exist i , j , such that the maximal proper suffix of x is a factor of y i and the maximal proper prefix of x is a factor of y j . In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set M y ℓ of minimal absent words of a word y is equal to M { y 1 , … , y k } ℓ for any decomposition of y into a collection of words y 1 , … , y k such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω ( n ) space for n = \| y \| using any of the plenty available O ( n ) -time algorithms. This is because an Ω ( n )-sized text index is constructed over y which can be impractical for large n . We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ∥ M { y 1 , … , y N } ℓ ∥ = o ( n ) , for all N ∈ [1, k ], where ∥ S ∥ denotes the sum of the lengths of words in set S . For instance, in the human genome, n ≈ 3 × 10 9 but ∥ M { y 1 , … , y k } 12 ∥ ≈ 1 0 6 . We consider a constant-sized alphabet for stating our results. We show that all M y 1 ℓ , … , M { y 1 , … , y k } ℓ can be computed in O ( k n + ∑ N = 1 k ∥ M { y 1 , … , y N } ℓ ∥ ) total time using O ( MaxIn + MaxOut ) space, where MaxIn is the length of the longest word in { y 1 , … , y k } and MaxOut = max { ∥ M { y 1 , … , y N } ℓ ∥ : N ∈ [ 1 , k ] } . Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1432-4350 1433-0490
DOI:	10.1007/s00224-020-10018-5