The Data Compression Book-:Sliding Window Compression

Table of Contents

The decompression algorithm for LZ77 is even simpler, since it doesn’t have to do comparisons. It reads in a token, outputs the indicated phrase, outputs the following character, shifts, and repeats. It maintains the window, but it does not work with string comparisons. A decompression program that used the output of the previous program might have a loop like this:

decode( &match_position, &match_length, &character );
fwrite( window+match_position, 1, match_length, output );
putc( character, output );
for ( i = 0 ; i < match_length ; i++ )
window[ LOOK_AHEAD+i ] = window[ match_position+i ];
window[ LOOK_AHEAD+i ] = character;
memmove( window, window+match_length+1, WINDOW_SIZE - match_length );

One interesting side effect of this decompression method is that it can use phrases that have not yet been encoded to encode existing phrases. In a file that had one hundred consecutive ‘A’ characters, for example, we would encode the first A as (0, 0, ‘A’). Our window would then look like that shown in Figure 8.3.

Figure 8.3 Coding for one hundred consecutive A characters.

We could then encode the next nine A characters as (38, 9, ‘A’). It may seem odd to use a nine-character phrase here. Though we can see the eight characters in the phrase presently in the look-ahead buffer, the decoder won’t be able to. When the decoder receives the (38, 9, ‘A’) token, its buffer will look like Figure 8.4.

Figure 8.4 The buffer for the decoder when it receives the (38, 9, ‘A’) token.

But by examining the decompression algorithm, you can see how the decompression routine manages this trick. It sits in a loop, copying from the match position to the look-ahead buffer. After the first character has been copies, the buffer looks like Figure 8.5.

Figure 8.5 What the buffer looks like after it copies the first character.

The next time through the loop the second A character will be available to be copied though it was not in the window when the decoding started. After the entire copy is complete, along with the single character store, the buffer is ready to shift, as shown in Figure 8.6.

Figure 8.6 The buffer, when ready to shift.

This illustrates a powerful feature of LZ77 compression: rapid adaptation to the character of the input stream. In this example, it encoded a sequence of ten characters when its “dictionary” had only been loaded with a single character.

Problems with LZ77

The implementation of LZ77 shown here is deliberately crude. It also has to be considered a somewhat liberal interpretation of the algorithm. The authors presented little discussion of implementation details when they presented their method.

There is clearly a major performance bottleneck in the LZ77 approach. When encoding, it has to perform string comparisons against the look-ahead buffer for every position in the text window. As it tries to improve compression performance by increasing the size of the window, and thus the dictionary, this performance bottleneck only gets worse. On the bright side, however, the decompression portion of this algorithm does not have to suffer through this bottleneck. Since it only copies the phrases, it can operate at a much higher rate. Even better, the LZ77 decompressor will not be severely affected by increases in either the size of the text window or the look-ahead buffer.

A second performance problem occurs with the way the sliding window is managed. For conceptual convenience, the discussion here treated the sliding window as though it were truly sliding “across” the text, progressing from the end of the buffer to the front as the encoding process was executed.

This may be conceptually superior, but it is certainly not the best way to code an LZ77 program. In fact, it is much better to have a sliding index or pointer into a fixed buffer. Instead of moving the phrases toward the front of the window, a sliding pointer would keep the next in the same place in the window and move the start and end pointers along the buffer as text is encoded.

Using sliding pointers does create a few problems. For one, we can’t use a straightforward string-compare function like strncmp() to look for longest phrases because a phrase may land across the end of the physical window, with the first character at window[ WINDOW_SIZE - 1 ] and the second at window[ 0 ]. This means that as we do string comparisons we need to use a modulo index into the window instead of a normal index. A recoded version of strncmp() that would work properly under these revised circumstances might look like this:

int window_cmp( char *w, int i, int j, int length )
{
 int count = 0;
 while ( length-- ) {
  if ( w[ i ] == w[ j ] )
   count++;
  else
   return( count );
  i = ++i % WINDOW_SIZE;
  j = ++j % WINDOW_SIZE;
 }
 return( count );
}

This routine is slightly more complicated, but it will pay for itself in savings on calls to memmove(). Keeping the buffer in one place is a big savings in CPU cycles. In addition, the routine can be made even more efficient if WINDOW_SIZE is an integral power of 2. The modulus operator can then be replaced by a logical AND, saving even more time.

An Encoding Problem

Besides the CPU cost problems, the LZ77 algorithm has a major efficiency problem. When encoding phrases, LZ77 achieves good compression rapidly. Even if the phrases being substituted for input text are short, they will still generally cause very effective compression to take place.

The problem occurs when matching phrases are not found in the dictionary. When this is the case, the compression program still has to use the same three component tokens to encode a single character. To realize the cost of this, imagine encoding a single character when using a 4,096-byte window and a sixteen-byte look-ahead buffer. This would take twelve bits to encode a window position and another four bits to encode a phrase length. Using this system, encoding the (0, 0, c) token would take twenty-four bits, all to encode a single eight-bit symbol. This is a very high price to pay, and there ought to be a way to improve it.

Table of Contents