The Data Compression Book-:A Significant Improvement: Adaptive Huffman Coding

A Rescaling Bonus

An interesting side effect comes out of rescaling our tree at periodic intervals. Though we lose accuracy by scaling our counts, testing reveals that rescaling generally results in better compression ratios than if rescaling is postponed. This occurs because data streams frequently have a “decaying recency” effect, or the statistics for recently seen symbols are generally more valid than those accumulated farther back in the data stream. To put it simply, current symbols are more like recent symbols than older symbols.

The rescaling operation tends to discount the effect of older symbols, while increasing the importance of recent symbols. Though difficult to quantify, this seems to have a good effect on compression. Experimenting with various rescaling points will yield different results at differing values, but it doesn’t seem possible to pin down an optimal strategy. There may be an optimal value for rescaling, but it moves around with different types of data streams.

The Code

The sample code for this chapter is a simple order-0 adaptive Huffman compression routine. It is linked with the standard I/O and user interface routines from the previous chapter to create a standalone compression program and a decompression program.

The key to understanding how this sample code operates lies in understanding the data structures in the program. The data structure that describes the tree is shown next.

struct tree {
     int leaf[ SYMBOL_COUNT ];
     int next_free_node;
     struct node {
          unsigned int weight;
          int parent;
          int child_is_leaf;
          int child;
      } nodes[ NODE_TABLE_COUNT ];
} Tree;

Two arrays describe the tree. The tree itself is entirely represented by the nodes[] array. This array is a set of structures with the following elements:

unsigned int weight:	This weight element is the weight of individual node, just as it has been described previously in this chapter.
int parent:	This int is the index of the parent node. The parent node information is necessary both when encoding a symbol, and when updating the model.
int child_is_leaf:	The child of a given node can either be a leaf or a pair of nodes. This flag is used to indicate which it is.
int child:	If the child is a leaf, this int holds the value of the symbol encoded at the leaf. If the child is a pair of nodes, this value is the index to the first node. Because of the sibling property, the two nodes will always be adjacent to one another, so we know the first node will be child, and the second node will be child+1.

As described earlier in the chapter, every node in the tree is kept in a number list. When discussing the list before, we had the nodes with the lowest weight starting at 1 and working up to higher numbers until reaching the root. The implementation in this program is backwards from that, though the same principles apply. The list of nodes is the nodes[] array, with the highest number on the list appearing at nodes[0]. As we work our way down through the lower weights, we go to higher indices in the nodes list.

When the tree is first initialized, nodes[0] is the root node, nodes [1] is set to the end-of-stream symbol, and nodes[2] is set to the escape symbol. The next_free_node element in the tree is then set to 3, and the next time a character is added to the tree, it will be placed in nodes[3].

The leaf[] array in the tree data structure is used to find the leaf node for a particular symbol. To encode a symbol, start at the leaf node and work up to the root node of the tree, accumulating bits on the way (in reverse order). Without a leaf[] array to keep track of the leaf nodes, we would have to do a search through the entire tree every time we wanted to encode a character.

Initialization of the Array

Regardless of whether we perform compression or expansion, we initialize the Huffman tree using the same routine. When performing adaptive compression, it is extremely important to use an identical algorithm for both initialization and updating of the compression model. In this case, we use the same code to ensure that it happens.

The initialization routine, InitializeTree(tree), is the first thing called by both the compression and expansion code. It uses the following code to initialize nodes 0, 1, and 2:

tree->nodes[ ROOT_NODE].child                = ROOT_NODE + 1;
tree->nodes[ ROOT_NODE].child_is_leaf        = FALSE;
tree->nodes[ ROOT_NODE].weight               = 2;
tree->nodes[ ROOT_NODE].parent               = -1;

tree->nodes[ ROOT_NODE + 1 ].child           = END_OF_STREAM;
tree->nodes[ ROOT_NODE + 1 ].child_is_leaf   = TRUE;
tree->nodes[ ROOT_NODE + 1 ].weight          = 1;
tree->nodes[ ROOT_NODE + 1 ].parent          = ROOT_NODE;
tree->leaf[ END_OF_STREAM ]                  = ROOT_NODE + 1;

tree->nodes[ ROOT_NODE + 2 ].child           = ESCAPE;
tree->nodes[ ROOT_NODE + 2 ].child_is_leaf   = TRUE;
tree->nodes[ ROOT_NODE + 2 ].weight          = 1;
tree->nodes[ ROOT_NODE + 2 ].parent          = ROOT_NODE;
tree->leaf[ ESCAPE ]                         = ROOT_NODE + 2;

tree->next_free_node                         = ROOT_NODE + 3;

for ( i = 0 ; i < END_OF_STREAM ; i++ )
    tree->leaf[ i ] = -1;

The initialization of the tree->nodes[] elements is fairly direct. We assign the escape and end-of-stream nodes a weight of 1, which gives the root a weight of 2. The escape and end-of stream elements in the tree->leaf[] array are initialized to point to the appropriate nodes, and the parent and child pointers for each of the three nodes are initialized.

The final details required during initialization are to set up the tree->next_free_node index and to initialize the remaining elements of the tree->leaf[] array. Since none of the leaf[] elements for our conventional symbols have been initialized, they are all set to values of -1. During the encoding process, we will compare the tree->leaf[] value for a given symbol to -1 to see if it has already been defined.

Table of Contents