Previous Table of Contents Next


The next macro defined, BREAK_EVEN, determines whether it is better to encode a phrase as an index/length pair or as single characters. In this program, the BREAK_EVEN point is determined by adding up the INDEX_BIT_COUNT and the LENGTH_BIT_COUNT plus 1. These add up to seventeen: it take seventeen bits to encode an index/length pair. Because of this, we set our BREAK_EVEN point to one character. This means that in the program, any matching phrase that is one character or fewer will be encoded as single characters instead of as a phrase.

#define INDEX_BIT_COUNT        12
#define LENGTH_BIT_COUNT       4
#define WINDOW_SIZE            ( 1 << INDEX_BIT_COUNT )
#define RAW_LOOK_AHEAD_SIZE    ( 1 << LENGTH_BIT_COUNT )
#define BREAK_EVEN    ( ( 1 + INDEX_BIT_COUNT + LENGTH_BIT_COUNT ) / 9 )
#define LOOK_AHEAD_SIZE        ( RAW_LOOK_AHEAD_SIZE + BREAK_EVEN )
#define TREE_ROOT              WINDOW_SIZE
#define END-OF_STREAM          0
#define UNUSED                 0
#define MOD_WINDOW( a )        ( ( a ) & ( WINDOW_SIZE - 1 ) )

The BREAK_EVEN point adjusts our LOOK_AHEAD_SIZE. Since we aren’t going to code any phrases with lengths 0 or 1, we can adjust our LOOK_AHEAD_SIZE upward by two. So when we want to encode a phrase length, instead of outputting the length, we output the length - BREAK_EVEN - 1. This means that the length numbers 0 through 15 will actually correspond to phrases of length 2 through 17.

The TREE_ROOT macro defines the node that points to the binary tree root. Since TREE_ROOT is defined as index WINDOW_SIZE, it is a special node that actually lies outside the binary tree. Whenever the program searches the binary tree, it looks at the child of the node at TREE_ROOT to find the root of the tree.

The END_OF_STREAM constant defined the special index used to place an end-of-file indicator in the output stream. In this implementation, END_OF_STREAM is set to zero, specifically because the UNUSED node index is also set to zero. Having the UNUSED node index set to zero leads to a slight improvement in the program’s performance. By using static initialization or be creating the tree with calloc(), we will automatically have a three with every node pointer set to UNUSED—which means we don’t have to have a specific initialization step.

The final macro used in the program in MOD_WINDOW. Since input strings can wrap around the end of the tree and head back to the front, we need to perform all of our arithmetic on window indices modulo the tree size. The MOD_WINDOW macro provides a convenient way to do that.

Global Variables

The data structures that hold both the text window and the binary tree are defined as global variables in this program. They could just as easily be dynamically allocated and passed to the encoder and decoder as arguments, but here we cut a few corners by using globals.

The window[ WINDOW_SIZE ] variable holds the last 4,096 characters read in from the input file. The last seventeen of those characters will have been read in from the file but not yet encoded. These constitute the look-ahead buffer.

For comparison purposes, we also consider those 4,096 characters to be 4,096 strings, with each character being the first character in an 17-byte string. The code in this program also universally refers to a string by the index of its first character in the text window. So when a piece of code does a comparison on string ‘r’, it is looking at the seventeen-byte string starting at position ‘r’ in the text window.

unsigned char window[ WINDOW_SIZE ];

struct {
 int parent;
 int smaller_child;
 int larger_child;
} tree[ WINDOW_SIZE + 1 ];

The binary tree in this program is the data structure tree, which consists of an array of unnamed structures. Each tree node has only three elements: a parent index, a smaller_child index, and a larger_child index. Each of these indices are single numbers referring to the string at that position in the text window. An example of how this tree might look after reading in twenty-five characters follows. Remember that position 0 in the text window is used for other purposes, so strings there don’t get added to the tree.


Figure 8.8  The binary tree after reading in 25 characters.

One nice feature of this binary tree is that we know in advance how many total nodes will be in it, so we can allocate the space in advance instead of while building the tree. This saves the code needed for allocating and freeing nodes, and we know that the node for the string at position ‘i’ will be in the tree structure at position ‘i.’

The Compression Code

The compression code follows. Like all previous programs in this book, it is called with pointers to an open input file and to a special BIT_FILE structure that takes the compressed output. The BIT_FILE structure lets us use the bit-oriented I/O routines in BITIO.H. The compression routine breaks down into three different sections of code: an initialization section, the main compression loop, and a termination section.

void CompressFile( FILE *input. BIT_FILE *output, int argc,
                            char *argv[] )
{
  int i;
  int c;
  int look_ahead_bytes;
  int current_position;
  int replace_count;
  int match_length;
  int match_position;

  current_position = 1;
  for ( i = 0 ; i < LOOK_AHEAD_SIZE ; i++ ) {
   if ( ( c = getc( input ) ) == EOF )
     break;
   window[ current_position + i ] = c;
  }
  look_ahead_bytes = i;
  InitTree( current_position );
  match_length = 0;
  while ( look_ahead_bytes > 0 ) {
   if ( match_length > look_ahead_bytes )
     match_length = look_ahead_bytes;
   if ( match_length <= BREAK_EVEN ) {
    replace_count = 1;
    OutputBit( output, 1 );
    OutputBits( output, window[ current_position ], 8 );
  }  else {
    OutputBit( output, 0 );
    OutputBits( output, match_position, INDEX_BIT_COUNT );
    OutputBits( output, match_length - ( BREAK_EVEN + 1 ),
                LENGTH_BIT_COUNT );
    replace_count = match_length;
   }
   for ( i = 0 ; i < replace_count ; i++ ) {
    DeleteString( MOD_WINDOW( current_position + LOOK_AHEAD_SIZE ) );
    if ( ( c = getc( input ) ) == EOF )
     look_ahead_bytes--;
    else
     window[ MOD_WINDOW( current_position + LOOK_AHEAD_SIZE ) ] = c;
    current_position = MOD_WINDOW( current_position + 1 );
    if ( look_ahead_bytes )
     match_length = AddString( current_position, &match_position );
   }
 }
 OutputBit( output, 0 );
 OutputBits( output, END_OF_STREAM, INDEX_BIT_COUNT );
}


Previous Table of Contents Next