Sketching and Streaming for Dictionary Compression

We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much atte...

Full description

Saved in:
Bibliographic Details
Published in:DCC (Los Alamitos, Calif.) pp. 213 - 222
Main Authors: Becker, Ruben, Canton, Matteo, Cenzato, Davide, Kim, Sung-Hwan, Kodric, Bojana, Prezza, Nicola
Format: Conference Proceeding
Language:English
Published: IEEE 19.03.2024
Subjects:
ISSN:2375-0359
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in O(\sqrt n \cdot \log n) working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.
ISSN:2375-0359
DOI:10.1109/DCC58796.2024.00029