An optimal text compression algorithm based on frequent pattern mining

Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and l...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Journal of ambient intelligence and humanized computing Ročník 9; číslo 3; s. 803 - 822
Hlavní autoři:	Oswald, C., Sivaselvan, B.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2018 Springer Nature B.V
Témata:	Algorithms Artificial Intelligence Codes Coding Compression ratio Computational Intelligence Data compression Data integrity Data mining Datasets Dictionaries Engineering Entropy Huffman codes Original Research Pattern analysis Robotics and Automation Tables (data) User Interfaces and Human Computer Interaction Apriori algorithm Frequent pattern mining Lossless compression Huffman encoding Compression ratio
ISSN:	1868-5137, 1868-5145
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1868-5137 1868-5145
DOI:	10.1007/s12652-017-0540-2