An optimal text compression algorithm based on frequent pattern mining

Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and l...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Journal of ambient intelligence and humanized computing Ročník 9; číslo 3; s. 803 - 822
Hlavní autori:	Oswald, C., Sivaselvan, B.
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2018 Springer Nature B.V
Predmet:	Algorithms Artificial Intelligence Codes Coding Compression ratio Computational Intelligence Data compression Data integrity Data mining Datasets Dictionaries Engineering Entropy Huffman codes Original Research Pattern analysis Robotics and Automation Tables (data) User Interfaces and Human Computer Interaction Apriori algorithm Frequent pattern mining Lossless compression Huffman encoding Compression ratio
ISSN:	1868-5137, 1868-5145
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1868-5137 1868-5145
DOI:	10.1007/s12652-017-0540-2