We present a new compression format for natural language texts, allowing both exact and approximate search without decompression.
This new code —called End-Tagged Dense Code— has some advantages with respect to other compression techniques with similar
features such as the Tagged Huffman Code of [Moura et al., ACM TOIS 2000]. Our compression method obtains (i) better compression ratios, (ii) a simpler vocabulary representation, and (iii) a simpler and faster encoding. At the same time, it retains the most interesting features of the method based on the Tagged
Huffman Code, i.e., exact search for words and phrases directly on the compressed text using any known sequential pattern
matching algorithm, efficient word-based approximate and extended searches without any decoding, and efficient decompression
of arbitrary portions of the text. As a side effect, our analytical results give new upper and lower bounds for the redundancy
of d-ary Huffman codes.
Keywords Text compression -
D-ary Huffman coding - text databases
This work is partially supported by CICYT grant (#TEL99-0335-C04), CYTED VII.19 RIBIDI Project, and (for the third author)
Fondecyt Grant 1-020381.