Sketching and Streaming for Dictionary Compression
We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much atte...
Uloženo v:
| Vydáno v: | DCC (Los Alamitos, Calif.) s. 213 - 222 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
19.03.2024
|
| Témata: | |
| ISSN: | 2375-0359 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in O(\sqrt n \cdot \log n) working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction. |
|---|---|
| ISSN: | 2375-0359 |
| DOI: | 10.1109/DCC58796.2024.00029 |