Previous Table of Contents Next


AddString()

The bulk of the work done by the compression routine takes place in AddString(). This routine does two jobs. First, it adds a new string to the binary tree. Second, it tracks the string currently in the tree that best matches the one being inserted.

The process of locating the node for inspection of the new string uses standard techniques for traversing a binary tree. AddString first checks to see if the new string is the END_OF_STREAM node. If it is, it shouldn’t be inserted into the tree, so it takes immediate return with a match_length of zero. This forces the encoder to output a single character instead of trying to encode a phrase at index 0.

After checking for a bad node, the test_node and initial match_length are set up. Throughout the main loop, test_node will point to the current node that will be compared to the new_node. The match_length variable will contain the current longest match found during traversal of the tree.

int AddString( int new_node, int *match_position )
{
  int i;
  int test_node;
  int delta;
  int match_length;
  int *child;

  if ( new_node == END_OF_STREAM )
   return( 0 );
  test_node = tree[ TREE_ROOT ].larger_child;
  match_length = 0;
  for ( ; ; ) {
    for ( i = 0 ; i < LOOK_AHEAD_SIZE ; i++ ) {
     delta = window[ MOD_WINDOW( new_node + i ) ] -
       window[ MOD_WINDOW( test_node + i ) ];
     if ( delta != 0 )
      break;
  }
  if ( i >= match_length ){
    match_length = i;
    *match_position = test_node;
       if ( match_length >= LOOK_AHEAD_SIZE ) {
         ReplaceNode( test_node, new_node );
         return( match_length );
       }
    }
    if ( delta >= 0 )
      child = &tree[ test_node ].larger_child;
    else
      child = &tree[ test_node ].smaller_child;
    if ( *child == UNUSED ) {
      *child = new_node;
      tree[ new_node ].parent = test_node;
      tree[ new_node ].larger_child = UNUSED;
      tree[ new_node ].smaller_child = UNUSED;
      return( match_length );

    }
    test_node = *child;
  }
}

At this point, the main comparison loop is entered. The first section executes a comparison of the test_node and the new_node. Two pieces of information are available after falling out of the loop. The first is the value of delta. The delta variable will be less than one if the string at new_node is less than the test_node, zero if they are the same, and one if the new_node is greater. The second, found in the loop variable i, tells how many characters in the two strings were identical, or the match_length for a particular string.

for ( i = 0 ; i < LOOK_AHEAD_SIZE ; i++ ) {
  delta = window[ MOD_WINDOW( new_node + i ) ] -
          window[ MOD_WINDOW( test_node + i ) ];
  if ( delta != 0 )
    break;
}

After the comparison code completes, the main loop tests whether the match for this phrase is the longest one recorded so far. If it is, the match_length variable is updated, and the test_node position is saved.

if ( i >= match_length ) {
  match_length = i;
  *match_position = test_node;
  if ( match_length >= LOOK_AHEAD_SIZE ) {
    ReplaceNode( test_node, new_node );
    return( match_length );
  }
}

Frequently, the phrase in the look-ahead buffer is an exact match for the test_node. When this is the case, two things happen. First, since the longest match is found, the code will exit the AddString() routine. But before exiting, it performs a node replacement by deleting the test_node and replacing it with the new_node. It could just add new_node to the binary tree, but there is really no point to it, test_node will be redundant data taking up time in the search path if it just uses the normal insertion code. Instead, a special routine replaces test_node with new_node and returns. This leaves a sparser tree that can be searched more quickly. And, since test_node would have been deleted before new_node, it doesn’t sacrifice any compression by doing this.

The final section of the main test loop is the tree navigation step. The delta variable tells whether to follow the larger_child or smaller_child branches from the test_node. If the child we are supposed to follow is UNUSED, we have gone as far as we can in the tree. At this point, the code inserts new_node into the binary tree at the correct child and returns. Otherwise, it moves to the new test_node and goes back to the start of the test loop.

if ( delta >= 0 )
  child = &tree[ test_node ].larger_child;
else
  child = &tree[ test_node ].smaller_child;
if ( *child == UNUSED ) {
  *child = new_node;
  tree[ new_node ].parent = test_node;
  tree[ new_node ].larger_child = UNUSED;
  tree[ new_node ].smaller_child = UNUSED;
  return( match_length );
}
test_node = *child;


Previous Table of Contents Next