The Data Compression Book-:Sliding Window Compression

The Expansion Routine

The LZSS compression algorithm is highly asymmetrical. The compression routine is fairly complicated, and it does quite a bit of work for every character or phrase that is compressed. In comparison, the expansion code is extremely simple. It has very little work to do, and in fact it can operate nearly as fast as an ordinary copy routine. This makes LZSS an excellent choice for data that needs to be compressed once and expanded many times.

void ExpandFile( BIT_FILE *input, FILE *output, int argc, char *argv[] )
{
 int i;
 int current_position;
 int c;
 int match_length;
 int match_position;

 current_position = 1;
 for ( ; ; ) {
  if ( InputBit( input ) ) {
   c = InputBits( input, 8 );
   putc( c, output );
   window[ current_position ] = c;
   current_position = MOD_WINDOW( current_position + 1 );
  } else {
   match_position = InputBits( input, INDEX_BIT_COUNT );
   if ( match_position == END_OF_STREAM )
    break;
   match_length = InputBits( input, LENGTH_BIT_COUNT );
   match_length += BREAK_EVEN;
   for ( i = 0 ; i <= match_length ; i++ ) {
    c = window[MOD_WINDOW( match_position + i ) ];
    putc( c, output );
    window[ current_position ] = c;
    current_position = MOD_WINDOW( current_position + 1 );
   }
  }
 }
}

Virtually all the time in the expansion routine is spent in the main loop. The first step reads in a single bit. If this is a zero, the next byte will contain an unencoded character. The character is read in, output to the file, and put in the text window at the current position.

If the input bit was a one instead of a zero, the expansion routine reads in a match_position and length instead of a character. It checks to see if the match_position is actually the encoded END_OF_STREAM message. If so, the program is done, and it can exit. Otherwise, the match_length is read in and adjusted to range 2 through 17.

Once the match_position and length are known, a short loop executes. In this loop, the character from the match string is pulled out of the text window, stored in the file, and put in the window at the current position, which is then updated.

That is the entire expansion routine. It is easy to see why expansion takes place so quickly.

Improvements

While LZSS makes for a fairly good compression algorithm, improvements can be made to it. For instance, LZSS compresses poorly when it starts since it does not have any data in the text window for matches. It is fairly simple to preload the window with WINDOW_SIZE-LOOK_AHEAD_SIZE characters, then add all the appropriate strings to the binary tree.

The trouble with preloading the window is deciding what to preload it with, since we have no idea what type of data will come up in the input stream. Probably the easiest thing to do is to insert 256 strings that contain 16 consecutive occurrences of all of the possible symbols in the alphabet. Unfortunately, runs of repeated characters are the types of initial data that will be helped the least by preloading, but there will be some improvement.

Much experimentation can be done with the number of bits used in the index and the length codes. It is possible to achieve better compression by increasing the number of bits used for each of these, but there are at least two negative effects. First, the compression speed will suffer as the window grows, due to the extra work required to navigate and maintain a larger binary tree. Second, compression of smaller files will suffer due to the increased time needed to build a full dictionary. Some of this startup overhead can be reduced by starting the compression code with a smaller code size and working up to the larger sizes as the dictionary fills up.

One way the programs used here can be sped up dramatically is by using blocked I/O. The tokens output by the compression program here are all exactly either one or two bytes. It is possible to buffer these bytes up while accumulating the flag bits. When eight flag bits have been output, the flag byte can be sent, followed by between 8 and 16 bytes of byte-oriented data. Since the eight to sixteen bytes can be written using conventional byte-oriented code, the compression routines will run considerably faster.

Another technique which speeds up both compression and expansion is to create a “ghost buffer” at the end of the text window. In the case of this program the ghost buffer would hold seventeen characters, which would be identical to the characters in the first seventeen locations of the window. By maintaining a copy of the first seventeen bytes in the ghost buffer, the comparison routines can run without using modulo arithmetic on the indices. Since the compression program spends so much time on string comparison, this results in big time savings.

One final bit of inefficiency found in LZSS relates to the handling of duplicate strings. We remove duplicate strings from the binary tree, but we leave them in the text window, where they take up valuable space. It is possible to free up the space used by these duplicate strings in the text window, allowing for expansion of the dictionary. However, the side effect of this is that the decompression program has to keep track of duplicate strings, which will result in a significant cutback in expansion speed.

Table of Contents