The Data Compression Book-:Sliding Window Compression

Initialization

The compression loop needs a steady state before it can start. Two things need to be done: first, the look-ahead buffer needs to be loaded; and second, the tree needs to be initialized.

The code to load the look-ahead buffer is shown next. It tries to load up to seventeen bytes into the buffer. After the loading is complete, two local variables are set up so that the main loop can begin executing. First, current_position is set to one. This means that the look-ahead buffer now starts at position 1. Second, look-ahead_bytes is set to the number of bytes left to be encoded in the look-ahead buffer.

current_position = 1;
for ( i = 0 ; i < LOOK_AHEAD_SIZE ; i++ ) {
  if ( ( c = getc( input ) ) == EOF )
    break;
  window[ current_position + i ] = c;
}
look_ahead_bytes = i;

The look-ahead_bytes variable will be set to seventeen most of the time the main loop executes. Usually seventeen characters are left in the look-ahead buffer to encode. Once the program approaches the end of the file, that number will start to drop.

The next step in the initialization program calls InitTree(). InitTree() establishes a root node for the tree. The first node put into the tree will be at the current position, position 1. The code in InitTree() executes a standard insertion algorithm, establishing the child of the root pointer node and setting up the parents and children of position 1.

void InitTree( int r )
{
    tree[ TREE_ROOT ].larger_child = r;
    tree[ r ].parent = TREE_ROOT;
    tree[ r ].larger_child = UNUSED;
    tree[ r ].smaller_child = UNUSED;
}

The final step in the initialization of CompressFile sets up a match_length of one. This forces the encoding loop to output the first character of the look-ahead buffer as a single character instead of as a phrase. It would not be possible at this point even to search for a match to the string at position 1, since it is the only string in the tree.

The Main Loop

The main loop runs as long as characters are left in the look-ahead buffer to encode. It does three things in the loop: (1) It encodes the current phrase in the look-ahead buffer; (2) it reads new characters into the look-ahead buffer while deleting the oldest from the tree; and (3) it inserts the new strings defined by the new characters into the binary tree while the new characters are being loaded into the look-ahead buffer.

while ( look_ahead_bytes > 0 ) {
 if ( match_length > look_ahead_bytes )
  match_length = look_ahead_bytes;
 if ( match_length <= BREAK_EVEN ) {
  replace_count = 1;
  OutputBit( output, 1 );
  OutputBits( output, window[ current_position ], 8 );
 } else {
  OutputBit( output, 0 );
  OutputBits( output, match_position, INDEX_BIT_COUNT );
  OutputBits( output, match_length - ( BREAK_EVEN + 1 ),
              LENGTH_BIT_COUNT );
  replace_count = match_length;
 }
 for ( i = 0 ; i < replace_count ; i++ ) {
  DeleteString( MOD_WINDOW( current_position + LOOK_AHEAD_SIZE ) );
  if ( ( c = getc( input ) ) == EOF )
   look_ahead_bytes--;
  else
   window[ MOD_WINDOW( current_position + LOOK_AHEAD_SIZE ) ] = c;
  current_position = MOD_WINDOW( current_position + 1 );
  if ( look_ahead_bytes )
   match_length = AddString( current_position, &match_position );
 }
}

The main loop assumes that at the top of the loop, the best match length and position are stored in variables match_length and match_position. The AddString() operation normally does this at the bottom of the loop, but the first time through the loop the initialization code set match_length to zero.

Since the match_length is known, the code just has to decide whether to encode the current phrase in the look-ahead buffer as an index/length pair or whether to output a single character. All that is necessary here is a simple comparison against BREAK_EVEN. If the current phrase match length is more than BREAK_EVEN, it makes sense to encode it as a phrase. Otherwise it is encoded as a single character.

The encoding process is straightforward. The output sequence for a solo character is output as a single 1-bit, followed by the eight bits in the character. For an index/position token, the encoder outputs a 0-bit, followed by the twelve-bit position and the four-bit length. The length is encoded as a number from zero to fifteen that corresponds to a length of two to seventeen.

After encoding, the look-ahead buffer has to be loaded with new characters to replace the ones that have been output. If a phrase was encoded, the variable replace_count is set to the length of the phrase, otherwise, replace_count is set to one to indicate that a single character needs to be replace.

The replacement loop code is shown in the following code excerpt.. New characters read into the look-ahead buffer land on top of the oldest phrases in the text window. Accordingly, before each character is read in, the DeleteString() routine deletes the older phrase.

for ( i = 0 ; i < replace_count ; i++ ) {
 DeleteString( MOD_WINDOW( current_position + LOOK_AHEAD_SIZE ) );
 if ( ( c = getc( input ) ) == EOF )
  look_ahead_bytes--;
 else
  window[ MOD_WINDOW( current_position + LOOK_AHEAD_SIZE ) ] = c;
 current_position = MOD_WINDOW( current_position + 1 );
 if ( look_ahead_bytes )
  match_length = AddString( current_position, &match_position );
}

After the new character is read in, the current_position is updated, and the AddString() routine adds a new phrase to the tree. AddString() also returns the position and length of the best match for the inserted string. These variables will then be used at the top of the loop to encode the current phrase in the look-ahead buffer.

The Exit Code

The exit code for the compression routine is very simple to implement in this program. All that needs to be done is to encode the special END_OF_STREAM position code so that the decoder will know that there is no more data to pull out of the compressed stream. Its job is completed, and it can then return.

Table of Contents