Constructing Antidictionaries of Long Texts in Output-Sensitive Space
A word x that is absent from a word y is called minimal if all its proper factors occur in y . Given a collection of k words y 1 , … , y k over an alphabet Σ , we are asked to compute the set M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection { y 1 , … , y k }. The s...
Uložené v:
| Vydané v: | Theory of computing systems Ročník 65; číslo 5; s. 777 - 797 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
New York
Springer US
01.07.2021
Springer Nature B.V |
| Predmet: | |
| ISSN: | 1432-4350, 1433-0490 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | A word
x
that is absent from a word
y
is called
minimal
if all its proper factors occur in
y
. Given a collection of
k
words
y
1
, … ,
y
k
over an alphabet
Σ
, we are asked to compute the set
M
{
y
1
,
…
,
y
k
}
ℓ
of minimal absent words of length at most
ℓ
of the collection {
y
1
, … ,
y
k
}. The set
M
{
y
1
,
…
,
y
k
}
ℓ
contains all the words
x
such that
x
is absent from all the words of the collection while there exist
i
,
j
, such that the maximal proper suffix of
x
is a factor of
y
i
and the maximal proper prefix of
x
is a factor of
y
j
. In data compression, this corresponds to computing the antidictionary of
k
documents. In bioinformatics, it corresponds to computing words that are absent from a genome of
k
chromosomes. Indeed, the set
M
y
ℓ
of minimal absent words of a word
y
is equal to
M
{
y
1
,
…
,
y
k
}
ℓ
for any decomposition of
y
into a collection of words
y
1
, … ,
y
k
such that there is an overlap of length at least
ℓ
− 1 between any two consecutive words in the collection. This computation generally requires
Ω
(
n
) space for
n
= |
y
| using any of the plenty available
O
(
n
)
-time algorithms. This is because an
Ω
(
n
)-sized text index is constructed over
y
which can be impractical for large
n
. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when
∥
M
{
y
1
,
…
,
y
N
}
ℓ
∥
=
o
(
n
)
, for all
N
∈ [1,
k
], where ∥
S
∥ denotes the sum of the lengths of words in set
S
. For instance, in the human genome,
n
≈ 3 × 10
9
but
∥
M
{
y
1
,
…
,
y
k
}
12
∥
≈
1
0
6
. We consider a constant-sized alphabet for stating our results. We show that
all
M
y
1
ℓ
,
…
,
M
{
y
1
,
…
,
y
k
}
ℓ
can be computed in
O
(
k
n
+
∑
N
=
1
k
∥
M
{
y
1
,
…
,
y
N
}
ℓ
∥
)
total time using
O
(
MaxIn
+
MaxOut
)
space, where MaxIn is the length of the longest word in {
y
1
, … ,
y
k
} and
MaxOut
=
max
{
∥
M
{
y
1
,
…
,
y
N
}
ℓ
∥
:
N
∈
[
1
,
k
]
}
. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution. |
|---|---|
| Bibliografia: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1432-4350 1433-0490 |
| DOI: | 10.1007/s00224-020-10018-5 |