The Data Compression Book-:Huffman One Better: Arithmetic Coding

Table of Contents

The code for count_bytes scans the entire input file, then returns the input pointer to where it was when called. We assume that the number of counts of a given symbol in the file cannot exceed the span of an unsigned long. If this is not true, other measures will need to be taken to avoid overflow of the counts[] array elements.

After the array has been counted, the counts have to be scaled down. This is done in the scale_counts() routine. The first step here is to scale down the unsigned long counts[] array to fit in an array of unsigned characters. This stores the counts in less space in the output file. The code for this is listed here.

max_count = 0;
for ( i = 0 ; i < 256 ; i++ )
  if (counts[ i ] > max_count )
    max_count = counts[ i ];
scale = max_count / 256;
scale = scale + 1;
for ( i = 0 ; i < 256 ; i++ ) {
    scaled_counts[ i ] = (unsigned char ) ( counts[ i ] /scale );
    if ( scaled_counts[ i ] == 0 && counts[ i ] != 0 )
      scaled_counts[ i ] = 1;
}

After this is complete, one more scaling may need to be done. As part of the limitations on performing arithmetic coding using fixed-length registers, we have to restrict the total of our counts to less than 16,384, or fourteen bits. The second part of scale_counts does this with brute force, checking the total, then dividing by either two or four to get the correct results. An additional count has to be added because of the single count used by the end-of-stream character.

total = 1;
for ( i = 0 ; i < 256 ; i++ )
  total += scaled_counts[ i ];
if ( total > ( 32767 - 256 ) )
  scale = 4;
else if ( total > 16383 )
  scale = 2;
else
  return;
for ( i = 0 ; i < 256 ; i ++ )
   scaled_counts[ i ] /= scale;

There is certainly room in the scale_counts() routine for more sophistication. Every time we lose some of the accuracy in our counts by scaling, we also lose a certain amount of compression. An ideal scaling routine would scale down the counts to fit in the correct range while doing as little scaling as possible. The routines listed here don’t work very hard at doing that.

Once the counts have been scaled down to fit in the unsigned character array, the output_counts() routine is called to save them to the output file. This program employs the same method to store the counts as used in Chapter 3 for the Huffman coding example. Instead of storing the entire array of counts, we only store runs on nonzero values. For details on this, refer to Chapter 3.

The last step in building the model is to set up the cumulative totals array in totals[]. This is the array used when actually performing the arithmetic coding. The code shown below builds that array. Remember that after the totals[] array has been built, the range for symbol x is found in totals[ x ] and totals[ x + 1 ]. The range used for the entire alphabet is found in totals[ END_OF_STREAM +1 ].

totals[ 0 ] = 0;
for ( i = 0 ; i < END_OF_STREAM ; i++ )
  totals[ i + 1 ] = totals[ i ] + scaled_counts[ i ];
totals[ END_OF_STREAM + 1 ] = totals[ END_OF_STREAM ] + 1;

Reading the Model

For expansion, the code needs to build the same model array in totals[] that was used in the compression routine. Since the original file is not available to scan for counts, the program reads in the scaled_counts[] array stored in the compressed file. The code that accomplishes this is identical to the Huffman expansion code in Chapter 3. Refer to Chapter 3 for details on how this code works.

After the scaled_counts[] array has been read in, the same routine used by the compression code can be invoked to build the totals[] array. Calling build_totals() in both the compression and expansion routines helps ensure that we are working with the same array.

Initializing the Encoder

Before compression can begin, we have to initialize the variables that constitute the arithmetic encoder. Three 16-bit variables define the arithmetic encoder: low, high, and underflow_bits. When the encoder first starts to run, the range of the output floating-point number is anywhere between 0 and 1. The low variable is initialized to 0 and the high to 0xFFFF. These two variables have an implied decimal point just ahead of the most significant bit and an infinite trail of ones or zeros. The ones will be shifted into the high variable and the zeros into the low.

low = 0;
high = 0xffff;
underflow_bits = 0;

Table of Contents