A grammar-based compression using a variation of Chomsky normal form of context free grammar

This paper proposes a new class of grammar-based lossless source code. Grammar-based code is a class of universal data compression algorithm using a context-free grammar. A Semi-Chomsky Normal Form (semi-CNF) of context free grammar, which is a modified form of the context free grammar (CNF), is new...

Full description

Saved in:

Bibliographic Details
Published in:	ISITA 2016 : proceedings of 2016 International Symposium on Information Theory and Its Applications : Hyatt Regency Monterey Hotel, Monterey, California, USA, October 30 - November 2, 2016 pp. 246 - 250
Main Author:	Arimura, Mitsuharu
Format:	Conference Proceeding
Language:	English
Published:	IEICE 01.10.2016
Subjects:	Context Data compression Encoding Grammar Mars Pattern matching Production
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper proposes a new class of grammar-based lossless source code. Grammar-based code is a class of universal data compression algorithm using a context-free grammar. A Semi-Chomsky Normal Form (semi-CNF) of context free grammar, which is a modified form of the context free grammar (CNF), is newly introduced. The proposed algorithm encodes a given sequence to a binary codeword in three step. In the first step, semi-CNF of the set of production rules is constructed using repeated substitution from a pair of symbols or variables to a new variable. In the second step, semi-CNF is translated to an irreducible or smaller grammar by eliminating production rules which are used only once in the other production rules. A produced grammar is encoded to a binary codeword in the third step. LZ78, Multilevel Pattern Matching (MPM) and Byte Pair Encoding (BPE) algorithms can be treated as examples of this class of codes. LZ78 and MPM algorithms does not use the second step of this procedure. Therefore, the proposed method can improve the compression performance of these algorithms by the unified procedure. This method has an advantage that, transformation from a given sequence to the grammar is quite simple, by using the three-step algorithm through semi-CNF.