The Data Compression Book-:LZ78 Compression

Table of Contents

The exception handler takes advantage of the knowledge that this problem can happen only in the special circumstances of CHARACTER+ STRING+CHARACTER+STRING+CHARACTER. Given that, any time an unknown code occurs, it can determine what the unknown code is given knowledge of the previous string from the input.

   old_string[ 0 ] = input_bits();
   old_string[ 1 ] = '\0';
   putc( old_string[ 0 ], output )
   while ( ( new_code = input_bits() ) != EOF ) {
     new_string = dictionary_lookup( new_code );
     if ( new_string == NULL ) {
       strcpy( new_string, old_string );
       append_character_to_string( new_string, new_string[ 0 ] );
     }
     fputs( new_string, output );
     append_character_to_string( old_string, new_string[ 0 ] );
     add_to_dictionary( old_string );
     strcpy( old_string, new_string );
 }

LZW Implementation

The concepts in the compression algorithm are so simple that the whole algorithm can be expressed in a dozen lines. Implementation of this algorithm is somewhat more complicated, mainly due to management of the dictionary. A short example program that uses twelve-bit codes is in LZW12.C, and it will illustrate some of the techniques used here.

Tree Maintenance and Navigation

As in the LZ78 algorithm, the LZW dictionary is maintained as a multiway tree. But in the case of LZW, the way the data is stored doesn’t look much like a tree. A little analysis, however will reveal a multiway tree hidden behind the dictionary data structures.

   struct dictionary {
    int code_value;
    int parent_code;
    char character;
   } dict[ TABLE_SIZE ];

The structure shown in the preceding figure holds the entire dictionary tree. Each element in the data structure represents a single node. The node is defined by three items: (1) Code_value. This number is the actual code for the string that terminates at this node and is what the compression program emits when it wants to encode the string; (2) Parent_code. Under LZ78-style compression, every string in the dictionary has a parent string one character shorter than it. This integer is the code for that parent string; (3) Character. This is the character for this particular node. If the string encoded by the parent of a node were “GREENLEA,” and the character value was “F,” this node would encode “GREENLEAF.”

Something that immediately becomes noticeable as a problem here is that each dictionary node does not have a pointer or pointers to its child nodes. As we navigate the tree, how are we supposed to find the children of each node if there are no pointers to children?

The answer is that this tree maintains the dictionary pointers through a hashed array of nodes. To find the child of a particular node, we apply a hashing function to see where that puts us in the list. The hashing function used in LZW12.C is shown next.

unsigned int find_child_node( parent_code, child_character )
int parent_code;
int child_character;
{
 int index;
 int offset;

 index = ( child_character << ( BITS - 8 ) ) ^ parent_code;
  if ( index == 0 )
   offset = 1;
  else
   offset = TABLE_SIZE - index;
  for ( ; ; ) {
   if ( dict[ index ].code_value == UNUSED )
     return( index );
   if ( dict[ index ].parent_code == parent_code &&
        dict[ index ].character == (char) child_character )
   return( index );
   index -= offset;
   if ( index < 0 )
     index += TABLE_SIZE;
  }
}

This hashing function is essentially the same one used in the UNIX compress program. It combines the numeric values of the parent_code and the child_character to form a sixteen-bit offset into the list of nodes. After finding the target node, it checks for collisions, since that node may be in use by some other element in the tree. Eventually, one of two things happens. Either this function finds a node already defined as belonging to the parent and child, or it finds an empty node that can be used that way.

This hashing function performs fairly well. The collision avoidance mechanism depends on having TABLE_SIZE be a prime number, and performance depends on it being at least 20 percent larger than two raised to the BITS power. In LZW12.C, TABLE_SIZE needs to be larger than 4,096. The number actually used was 5,021.

With the hashing function in place, we can now effectively navigate down through the tree. The data structures used to maintain the dictionary during compression don’t help us move up the tree, but during compression we don’t need to move up the tree, only down.

During decompression, the hashing function is no longer used. Instead, each node in the tree has its parent code and character value stored at the array offset defined by its own code. This allows for quick lookup of dictionary values, which lets us move up the tree quickly. We need to move up the tree during decompression to determine the entire contents of a string, and this different storage method makes this possible. We never need to move down the tree during decompression, so the hashing function is no longer needed.

One additional feature of the dictionary tree used in LZW12.C needs explanation. The first 256 nodes are considered “special” nodes by the program. Each of them represents the one character string that corresponds with its node value. In other words, code 65 will always represent the character “A,” and it will automatically be assumed not to have a parent. These nodes are all predefined when the program is first initialized.

Compression

Armed just with the hashing function and the data structure, the compression program can be written fairly easily. The program goes through a short initialization phase, then sits in an encoding loop reading characters in and sending codes out. Finally, it does a small amount of cleanup work, then exits.

next_code = FIRST CODE;
for ( i = 0 ; i < TABLE_SIZE ; i++ )
  dict[ i ].code_value = UNUSED;
if ( ( string_code = getc( input ) ) == EOF )
  string_code = END_OF_STREAM;
while ( ( character = getc( input ) ) != EOF ) {
  index = find_child_node( string_code, character );
  if ( dict[ index ].code_value != - 1)
    string_code = dict[ index ].code_value;
  else {
    if ( next_code <= MAX_CODE ) {
      dict[ index ].code_value = next_code++;
      dict[ index ].parent-code = string_code;
      dict[ index ].character = (char) character;
    }
    OutputBits( output, string_code, BITS );
    string_code = character;
  }
}
OutputBits( output, string_code, BITS );
OutputBits( output, END_OF_STREAM, BITS );

This routine first initializes the dictionary array. It does this by marking all nodes in the tree as unused. Remember that the first 256 nodes are special and will be considered used automatically.

Table of Contents