An optimal text compression algorithm based on frequent pattern mining
Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and l...
Gespeichert in:
| Veröffentlicht in: | Journal of ambient intelligence and humanized computing Jg. 9; H. 3; S. 803 - 822 |
|---|---|
| Hauptverfasser: | , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.06.2018
Springer Nature B.V |
| Schlagworte: | |
| ISSN: | 1868-5137, 1868-5145 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus. |
|---|---|
| Bibliographie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1868-5137 1868-5145 |
| DOI: | 10.1007/s12652-017-0540-2 |