The Data Compression Book-:Dictionary-Based Compression

Chapter 7
Dictionary-Based Compression

So far, the compression methods we have looked at used a statistical model to encode single symbols. They achieve compression by encoding symbols into bit strings that use fewer bits than the original symbols. The quality of the compression goes up or down depending on how good the program is at developing a model. The model not only has to accurately predict the probabilities of symbols, it also has to predict probabilities that deviate from the mean. More deviation achieves better compression.

But dictionary-based compression algorithms use a completely different method to compress data. This family of algorithms does not encode single symbols as variable-length bit strings; it encodes variable-length strings of symbols as single tokens. The tokens form an index to a phrase dictionary. If the tokens are smaller than the phrases they replace, compression occurs.

In many respects, dictionary-based compression is easier for people to understand. It represents a strategy that programmers are familiar with—using indexes into databases to retrieve large amounts of storage. In everyday life, we use phone numbers, Dewey Decimal numbers, and postal codes to encode larger strings of text. This is essentially what a dictionary-based encoder does.

An Example

A good example of how dictionary based compression works can be created by using a standard dictionary. For this example, I will use the Random House Dictionary of the English Language, Second Edition, Unabridged. Using this dictionary’s system as a key for encoding messages, I can achieve a reasonable amount of compression. Using my proprietary scheme, the first eight words of the first sentence in this paragraph would read:

1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2

This dictionary-based encoding scheme consists of a simple lookup table. The first number gives the page of the dictionary, and the second number tells the number of the word on that page. The dictionary has 2,200 pages with less than 256 entries on each page. Thus, 1/1 encodes the first word on the first page, which is “A.” 822/3 encodes the third word on the 822nd page, which is “good.”

To see how much space this scheme would save, look at the number of bits actually used to encode a word. Since a word can land on any of 2,200 pages, we need 12 bits to encode the page number. Each page has fewer than 256 entries, so the number of the entry will take just 8 bits to encode. This gives a total of 20 bits to encode any word in the dictionary, or 2.5 bytes per word.

The ASCII representation of the eight words in our encoded message takes 43 bytes. The encoded message takes 2.5 × 8 bytes, or 20 bytes. Thus, we compressed our text to 50 percent of its original size using dictionary encoding.

In theory, a different encoding method can probably improve on this. The dictionary has about 315,000 words. Shannon’s formula for information content tells us that any one of the words in the dictionary can be encoded using just a little over eighteen bits. We used the page number/entry number scheme to make it easier to look up the encoded word, a general theme in dictionary-based compression.

Static vs. Adaptive

In general, dictionary-based compression replaces phrases with tokens. If the number of bits in the token is less than the number of bits in the phrase, compression will occur. But this definition of dictionary-based compression still leaves enormous room for variation. Consider, for example, the methods for building and maintaining a dictionary.

In some cases, it is advantageous to use a predefined dictionary to encode text. If the text to be encoded is a database containing all motor-vehicle registrations for Texas, we could develop a dictionary with only a few thousand entries that concentrated on words like “General Motors,” “Smith,” “Main,” and “1977.” Once this dictionary were compiled, it could be kept on-line and used by both the encoder and decoder as needed.

A dictionary like this is called a static dictionary. It is built up before compression occurs, and it does not change while the data is being compressed. It has advantages and disadvantages. One of the biggest advantages is that a static dictionary can be “tuned” to fit the data it is compressing. With the motor-vehicle registration database, for example, Huffman encoding could allocate fewer bits to strings such as “Ford” and more bits to “Yugo.” Of course, we could use different bit strings depending on which field is being compressed.

Adaptive compression schemes can’t tune their dictionaries in advance, which in principle would seem a major disadvantage. But static dictionary schemes have to deal with the problem of how to pass the dictionary from the encoder to the decoder. Chapters 3 and 5 showed that passing statistics along with compressed data can significantly harm compression, particularly on small files.

But this doesn’t have to be a disadvantage in every case. In many situations, a static dictionary could remain the same over long periods of time and be kept on line, available to both the compressor and the decompressor. The motor-vehicle database dictionary could be calculated once, for example, then kept on hand. In the case of an exceptionally large amount of data, the compression ratio may not be significantly degraded if the dictionary is passed with the compressed text.

Table of Contents

Chapter 7Dictionary-Based Compression

An Example

Static vs. Adaptive

Chapter 7
Dictionary-Based Compression