The Data Compression Book-:Sliding Window Compression

LZSS Compression

LZSS compression seeks to avoid some of the bottlenecks and performance problems in the original LZ77 algorithm. It makes two major changes to the way the algorithm works. The first is in the way the text window is maintained. Under LZ77, the phrases in the text window were stored as a single contiguous block of text, with no other organization on top of it. LZSS still stores text in contiguous windows, but it creates an additional data structure that improves on the organization of the phrases.

As each phrase passes out of the look-ahead buffer and into the encoded portion of the text windows, LZSS adds the phrase to a tree structure. In the implementation that will be used in this chapter, the tree is a binary search tree. By sorting the phrases into a tree such as this, the time required to find the longest matching phrase in the tree will no longer be proportional to the product of the window size and the phrase length. Instead, it will be proportional to the base 2 logarithm of the window size multiplied by the phrase length.

The savings created by using the tree not only makes the compression side of the algorithm much more efficient, it also encourages experimentation with longer window sizes. Doubling the size of the text window now might only cause a small increase in the compression time, whereas before it would have doubled it.

The second change lies in the actual tokens output by the compression algorithm. Recall that LZ77 output tokens consisted of a phrase offset, a match length, and the character that followed the phrase. This meant that LZ77 was compelled to alternate pointers with plain characters, regardless of the nature of the input text.

LZSS instead allows pointers and characters to be freely intermixed. When first starting up, for example, the compression algorithm may not find any phrase matches to output for the first dozen or so input symbols. Under the LZ77 system, the encoder would still have to output a dummy match position with a length of zero for every symbol it output.

LZSS instead uses a single bit as a prefix to every output token to indicate whether it is an offset/length pair or a single symbol for output. When outputting several consecutive single characters, this method reduces the overhead from possibly several bytes per character down to a single byte per character.

Once the data is well characterized, the compressor may efficiently match up pointers every time it loads new data into the look-ahead buffer. LZ77 had some inefficiency here as well, since every offset/length pair had to be accompanied by a single character. This is not as bad as the previous type of inefficiency, but it still reduces the compression ratio.

Data Structures

Two important data structures are used in the implementation of LZSS shown in this chapter. They are the text window, which contains the previously encoded text buffer, and the look-ahead buffer. The text buffer is a simple character buffer declared and used as might normally be expected in a C program:

unsigned char window[ WINDOW_SIZE ];

As discussed previously, while the idea of a sliding window might imply that the text should “slide” through the window, this would actually be an inefficient way to implement it. Instead, the look-ahead buffer moves through the array and tracks its index as it goes along. This means that once a phrase is stored in the array, it stays there until it is overwritten after WINDOW_SIZE characters have been encoded.

This method of working with the text window also means that all string operations, copies, etc. performed on the window have to be done using modulo WINDOW_SIZE arithmetic. Computing (i+1) mod WINDOW_SIZE is usually done most efficiently if WINDOW_SIZE is an integral power of 2, and this implementation of LZSS expects that to be the case. Having WINDOW_SIZE be an integral power of 2 also leads to the most efficient way of encoding the window indices, so this is almost always the method used in sliding-window data compression.

The second data structure in this program is the binary tree used to store the phrases currently in the text window. The tree is defined by the tree structure shown here:

struct {
 int parent;
 int smaller_child;
 int larger_child;
} tree[ WINDOW_SIZE + 1 ];

Table of Contents