The Data Compression Book-:Dictionary-Based Compression

Adaptive Methods

At present, dictionary-based compression schemes using static dictionaries are mostly ad hoc, implementation dependent, and not general purpose. Most well-known dictionary algorithms are adaptive. Instead of having a completely defined dictionary when compression begins, adaptive schemes start out either with no dictionary or with a default baseline dictionary. As compression proceeds, the algorithms add new phrases to be used later as encoded tokens.

The basic principle behind adaptive dictionary programs is relatively easy to follow. Imagine a section of code that compressed text using an algorithm that looked something like this:

for ( ; ; ) {
  word = read_word( input_file );
  dictionary_index = look_up( word, dictionary );
  if ( dictionary_index < 0 ) {
    output( word, output_file );
    add_to_dictionary( word, dictionary );
  } else
    output( dictionary_index, output_file );
}

If the dictionary index used here could be encoded as an integer index into a table, we would achieve respectable compression with what is actually a very simple algorithm. This code is a specialized one set up to apply to written documents, but the principle behind it is similar to that behind many more sophisticated algorithms. It illustrates the basic components of an adaptive dictionary compression algorithm;

1. To parse the input text stream into fragments tested against the dictionary.

2. To test the input fragments against the dictionary; it may or may not be desirable to report on partial matches.

3. To add new phrases to the dictionary.

4. To encode dictionary indices and plain text so that they are distinguishable.

The corresponding decompression program has a slightly different set of requirements. It no longer has to parse the input text stream into fragments, and it doesn’t have to test fragments against the dictionary. Instead, it has the following requirements: (1) to decode the input stream into either dictionary indices or plain text; (2) to add new phrases to the dictionary; (3) to convert dictionary indices into phrases; and (4) to output phrases as plain text. The ability to accomplish these tasks with relatively low costs in system resources made dictionary-based programs popular over the last ten years.

A Representative Example

Compressing data when sending it to magnetic tape has several nice side effects. First, it reduces the use of magnetic tape. Though magnetic tape is not particulary expensive, some applications make prodigous use of it. Second, the effective transfer rate to and from the tape is increased. Improvements in transfer speed through hardware are generally expensive, but compression through software is in a sense “free.” Finally, in some cases, the overall CPU time involved may actually be reduced. If the CPU cost of writing a byte to magnetic tape is sufficiently high, writing half as many compressed bytes may save enough cycles to pay for the compression.

While the benefits of compressing data before sending it to magnetic tape have been clear, only sporadic methods were used until the late 1980s. In 1989, however, Stac Electronics successfully implemented a dictionary-based compression algorithm on a chip. This algorithm was quickly embraced as an industry standard and is now widely used by tape-drive manufacturers worldwide.

This compression method is generally referred to by the standard which defines it: QIC-122. (QIC refers to the Quarter Inch Cartridge industry group, a trade association of tape-drive manufacturers.) As you may know, Stac Electronics expanded the scope of this algorithm beyond tape drives to the consumer hard disk utility market in the form of its successful Stacker program (discused later in this chapter).

QIC-122 provides a good example of how a sliding-window, dictionary-based compression algorithm actually works. It is based on the LZ77 sliding-window concept. As symbols are read in by the encoder, they are added to the end of a 2K window that forms the phrase dictionary. To encode a symbol, the encoder checks to see if it is part of a phrase already in the dictionary. If it is, it creates a token that defines the location of the phrase and its length. If it is not, the symbol is passed through unencoded.

Table of Contents