A study on word-based and integral-bit Chinese text compression algorithms

Experimental results show that a word‐based arithmetic coding scheme can achieve a higher compression performance for Chinese text. However, an arithmetic coding scheme is a fractional‐bit compression algorithm which is known to be time consuming. In this article, we change the direction to study ho...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Journal of the American Society for Information Science Ročník 50; číslo 3; s. 218 - 228
Hlavní autori:	Cheng, Kwok-Shing, Young, Gilbert H., Wong, Kam-Fai
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	New York John Wiley & Sons, Inc 1999 John Wiley & Sons American Documentation Institute
Predmet:	Automatic text analysis Chinese Chinese materials Computer System Design Electronic Text Exact sciences and technology General aspects Information and communication sciences Information Processing Information processing and retrieval Information science. Documentation Information Systems Integral bit models Language Processing Logical and linguistic tools Sciences and techniques of general use Text Compression Text Condensation Word based models Word Compression ratio Character Segmentation Huffman code Oriental language Chinese Entropy Sampling Algorithm Comparative study
ISSN:	0002-8231, 1097-4571
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Experimental results show that a word‐based arithmetic coding scheme can achieve a higher compression performance for Chinese text. However, an arithmetic coding scheme is a fractional‐bit compression algorithm which is known to be time consuming. In this article, we change the direction to study how to cascade the word segmentation model with a faster alternative, the integral‐bit compression algorithm. It is shown that the cascaded algorithm is more suitable for practical usage. Among several word‐based integral‐bit compression algorithms, WLZSSHUF achieves the best compression results. Not only can it achieve a comparable compression ratio with a PPM compressor, COMP‐2, it demonstrates a faster compression and decompression speed. In the last part of this article, the relation between the accuracy of the word segmentation model (match ratio) and the performance of the compression algorithm (compression ratio) are analyzed. By varying the match ratio, it was discovered that the growth rate of the compression ratio is content‐dependent and close to linear. The results of our study will help the practitioners of information retrieval to design word‐based compression algorithms for Chinese. This is particularly useful to multilingual digital libraries in which a massive volume of data is often involved.
Bibliografia:	RGC - No. POLY4174/98E ark:/67375/WNG-5Q1LJJZJ-C ArticleID:ASI4 istex:4A450846A0BDD76186E09F76749071C958FA2398 ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	0002-8231 1097-4571
DOI:	10.1002/(SICI)1097-4571(1999)50:3<218::AID-ASI4>3.0.CO;2-1