Chapter 6Statistical Modeling

The previous three chapters have shown several coding techniques used to compress data. The two coding methods disscussed, Huffman and arithmetic coding, can be implemented using either the fixed or adaptive approaches, but in all cases a statistical model needs to drive them. The chapters which discuss these coding techniques all used a simple order-0 model, which provides fairly good compression. This chapter discusses how to combine more sophisticated modeling techniques with standard coding methods to acheive better compression.

Higher-Order Modeling

To compress data using arithmetic or Huffman coding, we need a model of the data stream. The model needs to do two things to achieve compression: (1) it needs to accurately predict the frequency/probability of symbols in the input data stream, and (2) the symbol probabilities generated by the model need to deviate from a uniform distribution.

Accurately predicting the probability of symbols in the input data is an inherent need in arithmetic or Huffman coding. This type of coding reduces the number of bits needed to encode a character as its probability of appearance increases. If the letter E represents 25 percent of the input data, it should take only two bits to code. If the letter Z represents only .1 percent of the input data, it might take ten bits to code. If the model is not generating probabilities accurately, it might take ten bits to code E and two bits to code Z, causing data expansion instead of compression.

The model also needs to make predictions that deviate from a uniform distribution. The better the model is at making these predictions, the better the compression ratios will be. A model could be created, for example, that assigned all 256 possible symbols a uniform probability of 1/256. This model would create an output file exactly the same size as the input file, since every symbol would take exactly eight bits to encode. Only by correctly finding probabilities that deviate from a normal distribution can the number of bits be reduced, leading to compression. The increased probabilities have to accurately reflect reality, of course, as prescribed by the first condition.

It may seem that the probability of a given symbol occurring in a data stream is fixed, but this is not quite true. The probability of a character can change quite a bit, depending on the model. When compressing a C program, for example, the probability of a new-line character in the text might be 1/40, a probability determined by scanning the entire text and dividing the number of occurrences of the character by the total number of characters. But if a modeling technique looks at a single previous character, the probabilities change. In that case, if the previous character were a ‘}’, the probability of a new-line character goes up to 1/2. This improved modeling technique leads to better compression, though both models were generating accurate probabilities.

Finite Context Modeling

The modeling discussed in this chapter is called “finite-context” modeling. It is based on a simple idea; calculate the probabilities for each incoming symbol based on the context in which the symbol appears. In all of the examples shown here, the context consists of nothing more than symbols previously encountered. The “order” of the model refers to the number of previous symbols that make up the context.

The simplest finite-context model would be an order-0 model that calculates the probability of each symbol independently of any previous symbols. To implement this model, all that is needed is a single table containing the frequency counts for each symbol that might be encountered in the input stream. An order-1 model keeps track of 256 different tables of frequencies, since it needs a separate set of counts for each possible context. Similarly, an order-2 model needs to handle 65,536 different tables of contexts.

The models used in chapters 3, 4, and 5 were all order-0 models. They didn’t take up much storage space, and they were simple to manipulate. By confining ourselves to order-0 modeling, however, we ensured that our data-compression ratios were relatively modest.