The Data Compression Book-:Statistical Modeling

Table of Contents

When a context issues an escape symbol, we generally fall back to a lower-order context. In our next sample program, we escape to the escape context, a context that never gets updated. It contains 258 symbols, each of which has a count of 1. This guarantees that any symbol encountered in the message can be encoded by outputting an escape code from the current context and by encoding the symbol using the escape context.

How does this affect the example used for the letter u? As it turns out, it makes an enormous difference. The first u symbol that took eight bits in the previous example will take about eight bits here as well. The escape code takes no bits to encode, and in the escape context the u has a 1/257 probability. After that, however, the u is added to the table and given a count of 1. The next appearance of u will require only one bit to encode, since it has a probability of 1/2. By the time 16 u’s have appeared, and while the previous model is still taking four bits to encode it, the escape-driven model will take .06 bits!

The escape code frees us from burdening our models with characters that may never appear. This lets the model adjust rapidly to changing probabilities and quickly reduces the number of bits needed to encode high- probability symbols.

The encoding process for this particular implementation of a multi-order model requires only a few modifications to the previous program. The convert_int_to_symbol() routine now has to check whether a symbol is present in the given context. If not, the escape code is encoded instead, and the function returns the appropriate result to the main encoding loop, as shown:

context = 0;
initialize_model();
initialize_arithmetic_encoder();
for ( ; ; ) {
  c = getc( input );
  if ( c == EOF )
   c = END_OF_STREAM;
  escaped = convert_int_to_symbol( c, context, &s );
  encode_symbol( output, &s );
  if ( escaped ) {
    convert_int_to_symbol( c, ESCAPE, &s );
    encode_symbol( output, &s );
  }
  if ( c == END_OF_STREAM )
    break;
  update_model( c, context );
  context = c;
}

In the main compression loop shown, the compressor first tries to send the original symbol. If the convert_int_to_symbol() routine returns a true, the symbol did not appear in the current context, and the routine resends the symbol using the escape context. We update just the current context model with the symbol just sent, not the escape model.

The decompression loop for this program follows a similar pattern. The code shown next makes one or two possible passes through the loop, depending on whether an escape code is detected. The program for this order-1 context-switching program is on the program diskette that accompanies this book.

context = 0;
initialize_model();
initialize_arithmetic_decoder( input );
for ( ; ; ) {
  last_context = context;
  do {
   get_symbol_scale( context, &s );
    count = get_current_count( &s );
    c = convert_symbol_to_int( count, context, &s );
    remove_symbol_from_stream( input, &s );
    context = c;
  } while ( c == ESCAPE );
  if ( c == END_OF_STREAM )
    break;
  putc( (char) c, output );
  update_model( c, last_context );
  context = c;
}

Improvements

Some problems with the method of encoding in ARITH-1.C are the high-cost operations associated with the model. Each time we update the counts for symbol c, every count in totals[context][] from c up to 256 has to be incremented. An average of 128 increment operations have to be performed for every character encoded or decoded. For a simple demonstration program like the one shown here, this may not be a major problem, but a production program should be modified to be more efficient.

One way to reduce the number of increment operations is to move the counts for the most frequently accessed symbols to the top of the array. This makes the model keep track of each symbol’s position in the totals[context] array, but it reduces the number of increment operations by an order of magnitude. This is a relatively simple enhancement to make to this program. A very good example of a program that uses this technique has been published as part of the paper by Ian H. Witten, Neal Radford, and John Cleary, “Arithmetic Coding for Data Compression,” Communications of the ACM (June 1987). This paper is an excellent source of information regarding arithmetic coding, with some sample C source code illustrating the text.

Table of Contents