The Data Compression Book-:Statistical Modeling

Adaptive Modeling

It seems logical that as the order of the model increases, compression ratios ought to improve as well. The probability of the letter u appearing in the text of this book may only be 5 percent, for example, but if the previous context character is q, the probability goes up to 95 percent. Predicting characters with high probability lowers the number of bits needed, and larger contexts ought to let us make better predictions.

Unfortunately, as the order of the model increases linearly, the memory consumed by the model increases exponentially. With an order 0 model, the space consumed by the statistics could be as small as 256 bytes. Once the order of the model increases to 2 or 3, even the most cleverly designed models will consume hundreds of kilobytes.

The conventional way of compressing data is to make a pass over the symbols to gather statistics for the model. Then a second pass is made to actually encode the data. The statistics are usually carried with the compressed data so the decoder will have a copy. This approach obviously has serious problems if the statistics for the model take more space than the data to be compressed.

Adaptive compression is the solution to this problem. In adaptive data compression, both the compressor and the decompressor start with the same model. The compressor encodes a symbol using the existing model, then it updates the model to account for the new symbol using the existing model, then it updates the model to account for the new symbol. The decompressor likewise decodes a symbol using the existing model, then it updates the model. As long as the algorithm to update the model operates identically for the compressor and the decompressor, the process can operate perfectly without needing to pass a statistics table from the compressor to the decompressor.

Adaptive data compression has a slight disadvantage in that it starts compressing with less than optimal statistics. By subtracting the cost of transmitting the statistics with the compressed data, however, an adaptive algorithm will usually perform better than a fixed statistical model.

Adaptive compression also suffers in the cost of updating the model. When updating the count for a particular symbol using arithmetic coding, for example, the update code has the potential cost of updating the cumulative counts for all other symbols as well, leading to code that on the average performs 128 arithmetic operations for every symbol encoded or decoded, using the modeling techniques needed for arithmetic coding.

Because of high cost in both memory and CPU time, higher-order adaptive models have only become practical in perhaps the last ten years. It is ironic that as the cost of disk space and memory goes down, the cost of compressing the data stored there also goes down. As these costs continue to decline, we will be able to implement even more effective programs than are practical today.

A Simple Example

The sample program in Chapter 4 used Huffman coding to demonstrate adaptive compression. In this chapter, the sample program will use adaptive arithmetic coding. When performing finite-context modeling, we need a data structure to describe each context used while compressing the data. If we move up from an order to an order-1, for example, we will use the previous symbol as a context for encoding the current symbol.

An array of 256 context arrays is probably the simplest way to create the data structures for an order-1 model. As we saw in the last chapter, a simple context model for an arithmetic encoder can be created using an array of cumulative counts for each symbol. If we have 256 symbols in our alphabet, an array of pointers to 256 different context arrays can be created like this:

int *totals[ 256 ];

void initialize_model()
{
     int context;
     int i;

     for (context= 0 ; context < END_OF_STREAM ; context++ ) {
       totals[ context ] = (int *) calloc( END_OF_STREAM + 2,
                       sizeof( int ) );
       if ( totals[ context ] == NULL )
          fatal_error( "Failure allocating context %d", context );
       for ( i = 0 ; i <= ( END_OF_STREAM + 1 ) ; i++ )
          totals[ context ][ i ] = 1;
     }
}

Table of Contents