The Data Compression Book-:Sliding Window Compression

Table of Contents

For every phrase in the window, a corresponding structure element defines the position that phrase occupies in the tree. Each phrase has a parent and up to two children. Since this is a binary tree, the two child nodes are defined as “smaller” and “larger” children. Every phrase that resides under the smaller_child node must be smaller than the phrase defined by the current node, and every phrase under the larger_ child node must be larger. The terms “larger” and “smaller” refer to where the phrases fall in the collating sequence used by the compression program. In this particular program, one phrase is “larger” or “smaller” in the same sense as that used by the standard library strcmp() function.

The tree used in this program has a couple of unusual features that need to be explained. First, though only WINDOW_SIZE phrases are in the window, we have defined the tree to have WINDOW_SIZE + 1 elements. In the implementation used here, tree[ WINDOW_SIZE ] is the special node used to locate the tree’s root. This element doesn’t have a phrase of its own, as do all other nodes in the tree. It also doesn’t have smaller and greater children like the other nodes. Instead, it has the index of a larger_child only, and this index points to the root node of the tree.

Pointing to the root node in this fashion saves processing time and simplifies the code. When working our way up through the tree during an insertion or deletion, therefore, we don’t have to check to see if a parent node points to the root or to another node. Instead, we can assume that a node’s parent node is always a valid tree element. When deleting node i from the tree, for example, you will typically have a section of code that looks something like this: tree[ tree[ i ].parent ].child = tree[ i ].child. Because the pointer to the root node is stored in the same tree, we don’t have to perform any special checks to see if i is the root node. Even if i is the root node, tree[ i ].parent still points to a valid node in the tree.

The second unusual feature is the use of another node for special purposes. Like the other programs in this book, LZSS uses a special code to indicate when the end of the compressed data is reached. In this case, a window index of zero indicates an end-of-stream condition.

Since index 0 has a special purpose, it can never be used as a valid phrase. So the code to insert a new phrase into the tree automatically returns without even trying to insert the phrase at index 0. Since phrase 0 is not used, we can achieve even more code savings by using node 0 as the special UNUSED index. This becomes useful when writing code to maintain the tree. A typical operation performed when deleting node i from the tree, for example, is to reassign a new parent node to i’s children. Code to perform this might look like what follows.

if ( tree[ i ].smaller_child != UNUSED )
 tree[ tree[ i ].smaller_child ].parent = tree[ i ].parent;
if ( tree[ i ].larger_child != UNUSED )
 tree[ tree[ i ].larger_child ].parent = tree[ i ].parent;

But if the UNUSED index actually points to a legitimate storage area, the test for validity can be bypassed, with the resulting code looking like the following:

tree[ tree[ i ].smaller_child ].parent = tree[ i ].parent;
tree[ tree[ i ].larger_child ].parent = tree[ i ].parent;

If either of the children in this example are UNUSED, no harm is done—the parent node for tree[ 0 ] will merely be modified. Since tree [ 0 ] is never used for any tree navigation, no harm is done, and significant CPU time is saved during tree updates.

A Balancing Act

Saving phrases in a binary tree can simplify the search for the best match. But a binary tree can deteriorate when given data that is ordered in some fashion. In the worst case, a binary tree can turn into nothing more than a linked list. Imagine a file that had the string “ABCDEFGHIJKLMNOP” in it. Since the phrases in that string would have to be added to the tree in order, the structure in Figure 8.7 would evolve.

Figure 8.7 The structure that would evolve from the sequence “ABCDEFGHIJKLMNOP.”

This structure may have a pleasing pattern, but it is not well built for locating strings. Given that data compressed from computer files will frequently have patterns of increasing or decreasing phrases, what can we do to avoid this problem?

Of course we can do a lot to help maintain a balanced tree. Many well-known algorithms are built expressly to keep nicely built trees from turning into cycle-stealing unbalanced lists.

In the case of sliding-window data compression, however, it is relatively safe to ignore the problem. Severely unbalanced trees may develop as data is compressed, but the nature of the sliding window almost mandates that unbalanced situations quickly converge to more balanced states. Since old phrases are pulled out of the tree as rapidly as new ones are put in, the effects of an ordered sequence quickly disappear.

As a result, tree balancing is usually not built into sliding-window programs. Probably the only time it would be considered would be in a production version of a compression program that was under severe constraints in terms of CPU cost allowed per byte compressed.

Table of Contents