The Data Compression Book-:Sliding Window Compression

Greedy vs. Best Possible

Both LZ77 and LZSS are called “greedy” algorithms: They don’t look ahead into the input stream to analyze it for the best combination of indices and characters. Consider a dictionary-based encoding scheme that used nine bits to encode a single character and twenty-five bits to encode a combined index/offset pair. This scheme would have a break-even point somewhere between two and three characters, which means it would encode a match of two characters as two individual symbols and a match of three symbols as in index/offset token.

Consider now how we would go about encoding the phrase “Go To Statement Considered Harmful” if the contents of the phrase dictionary contained the following fragments: “Go T” “o S” “tat” “Stat.” A greedy encoder would naturally encode the “Go T” phrase of four characters length first, followed by the “o S” phrase of three characters length, then the “tat” phrase of three characters length. The output of the encoder up to this point would look like this:

Offset/Length of “Go T”	:	25 bits
Offset/Length of “o S”	:	25 bits
Offset/Length of “tat”	:	25 bits
		————
		75 bits

The encoder looks like it was doing what makes sense, trying to build phrases up instead of characters. But an optimal encoder would encode the fragment as shown:

Offset/Length of “Go “	:	25 bits
Character ‘T’	:	9 bits
Character ‘o’	:	9 bits
Offset/Length of “Stat”	:	25 bits
		———-
		68 bits

These figures clearly show that the greedy encoder did not do as well as the optimal encoder. But it should also be noted that even in this contrived example, the difference between the two is only about 10 percent. When using dictionary coding, it is difficult to find examples of optimal encoders outperforming greedy encoders by more than a few percent. The largest differences occur when only short phrases are in the dictionary, and there is a real possibility that encoding single symbols will take less space than a phrase.

The problem with optimal coding is simply one of payback. Implementing an optimal encoder generally means that encoding speed will be drastically reduced. While optimizing algorithms are available, they tend to be CPU intensive, and the profit derived is generally small. In the world of data compression, a few good heuristics are often more respected than a provably superior algorithm. The greedy heuristic in this case is definitely the choice of most compression programmers.

The Code

The C implementation of LZSS shown here is relatively simple. A production program would probably want to take advantage of numerous potential improvements, which will be discussed at the end of the chapter.

By the very nature of LZSS compression, the compression program will be considerably more complicated than the decoder. The decoder does not have to worry about maintaining the tree or searching for matches. Those two activities are what the encoder spends most of its time doing.

Constants and Macros

All of the constants and global data used in this program are shown following. The parameters of the text window are initially defined by deciding how many bits to allocate to the two fields used to define a pointer or index into the text window. In this example. INDEX_BIT_COUNT is set to twelve: It will use twelve bits to define an index into the text window. The LENGTH_BIT_COUNT macro is set to four bits, which means it will use a four-bit field to encode the length of a matching phrase.

After determining the size of the two bit fields, other macros can be given values derived from them. First, the WINDOW_SIZE is directly determined by the size of the INDEX_BIT_COUNT. In this case, our text window will consist of 4,096 bytes, or 1 << 12. Since we have allocated four bits for the length parameter used to encode a phrase, we will be able to encode a length of up to sixteen bytes, or 1 << 4. This is defined as the RAW_LOOK_AHEAD_SIZE.

Table of Contents