Previous Table of Contents Next


Chapter 8
Sliding Window Compression

The genesis of modern dictionary-based compression can be traced to the 1977 Ziv and Lempel paper, “ A Universal Algorithm for Sequential Data Compression,” published in IEEE Transactions on Information Theory. In retrospect, this algorithm (referred to hereafter as LZ77) does not seem particularly remarkable. It is simple enough that it could have easily been described thirty or forty years earlier, and there is no doubt that it could have been implemented at least as a “proof of principle” program well before 1977.

However, as was discussed in the previous chapter, till the late 1970s most data compression work concentrated on improved ways to drive Huffman coders and perhaps on more exotic studies of digrams or other statistical topics. So the LZ77 paper truly broke new ground.

The Algorithm

LZ77 compression uses previously seen text as a dictionary. It replaces variable-length phrases in the input text with fixed-size pointers into the dictionary to achieve compression. The amount of compression depends on how long the dictionary phrases are, how large the window into previously seen text is, and the entropy of the source text with respect to the LZ77 model.

The main data structure in LZ77 is a text window, divided into two parts. The first consists of a large block of recently decoded text. The second, normally much smaller, is a look-ahead buffer. The look-ahead buffer has characters read in from the input stream but not yet encoded.

The normal size of the text window is several thousand characters. The look-ahead buffer is generally much smaller, maybe ten to one hundred characters. The algorithm tries to match the contents of the look-ahead buffer to a string in the dictionary. A simplistic example of a text window is shown in Figure 8.1.


Figure 8.1  A text window in use.

Figure 8.1 shows a snippet of C code being compressed. The text window has a total width of 64 characters, with 16 of those characters used by the look-ahead buffer. The LZ77 algorithm, as originally conceived, issued sequences of tokens. Each token consists of three different data items which defined a phrase of variable length in the current look-ahead buffer. The three items in the token are: (1) an offset to a phrase in the text window; (2) the length of the phrase; and (3) the first symbol in the look-ahead buffer that follows the phrase.

In the above example, the look-ahead buffer contains the phrase “<MAX;j++)\r.” By searching through the buffer, we find that “<MAX” is located at position 14 in the text window. It matches the look-ahead buffer for the first four symbols. The first symbol not present in the look-ahead buffer is the space character. So this token is encoded as: 14, 4, ‘ ’.

The compression program that implements the LZ77 algorithm first emits the token, then shifts the text window over by five characters, which is the width of the phrase just encoded. Five new symbols are then read into the look-ahead buffer, and the process repeats.


Figure 8.2  The window after encoding 14, 4, ‘ ’

The next token issued by the compression algorithm would encode the phrase “;j+” as “40, 2,‘+’.” The syntax of this token allows for phrases that have no match of any length in the window. If the look-ahead buffer shown above had no match, for example, it could be encoded a single character at a time using a phrase length of zero: “0, 0, ‘;’.” This method is not efficient, but it ensures that the algorithm can encode any message.

The code to implement this compression algorithm should be fairly simple. It merely has to look through the entire text window for the longest match, encode it, then shift. A brute force application of this algorithm might look something like this:

int window_cmp( char *w, int i, int j, int length )
{
 int count = 0;

 while ( length-- ) {
  if ( w[ i++ ] == w[ j++ ] )
   count++;
  else
   return( count );
 }
 return( count );
}

    .
    .
    .
 match_position = 0;
 match_length = 0;
 for ( i = 0 ; i < ( WINDOW_SIZE - LOOK_AHEAD_SIZE ); i++ ) {
  length = window_cmp( window, i, LOOK_AHEAD, LOOK_AHEAD_SIZE );
  if ( length > match_length ) {
   match_position = i;
   match_length = length;
  }
 }
 encode( match_position, match_length,
         window[ LOOK_AHEAD+match_length ] );
 memmove( window, window+match_length+1, WINDOW_SIZE - match_length );
 for ( i = 0 ; i < match_length+1 ; i++ )
  window[ WINDOW_SIZE - match_length + i ] = getc( input );
   .
   .
   .


Previous Table of Contents Next