The Data Compression Book-:Statistical Modeling

The Finishing Touches: Tables –1 and –2

The final touch to the context tree in ARITH-N.C is the addition of two special tables. The order(-1) table has been discussed previously. This is a table with a fixed probability for every symbol. If a symbol is not found in any of the higher-order models, it will show up in the order(-1) model. This is the table of last resort. Since it guarantees that it will always provide a code for every symbol in the alphabet, we don’t update this table, which means it uses a fixed probability for every symbol.

ARITH-N.C also has a special order(-2) table used for passing control information from the encoder to the decoder. The encoder can pass a-1 to the decoder to indicate end-of-file. Since normal symbols are always defined as unsigned values ranging from 0 to 255, the modeling code recognizes a negative number as a special symbol that will generate escapes all the way back to the order(-2) table. The modeling code can also determine that since -1 is a negative number, the symbol should just be ignored when the update_model() code is called.

Model Flushing

The creation of the order(-2) model allows us to pass a second control code from the encoder to the expander—the flush code, which tells the decoder to flush statistics out of the model. The compressor does this when the performance of the model starts to slip. The ratio is adjustable and is set in this implementation to 10 percent. When compression falls belows this ratio, the model is “flushed” by dividing all counts by two. This gives more weight to newer statistics, which should improve the compression.

In reality the model should probably be flushed whenever the input symbol stream drastically changes character. If the program is compressing an executable file, for example, the statistics accumulated during the compression of the executable code are probably of no value when compressing the program’s data. Unfortunately, it isn’t easy to define an algorithm that detects a “change in the nature” of the input.

Implementation

Even with the Byzantine data structures used here, the compression and expansion programs built around ARITH-N.C have prodigious memory requirements. When running on DOS machines limited to 640K, these programs have to be limited to order-1, or perhaps order-2 for text that has a higher redundancy ratio.

To examine compression ratios for higher orders on binary files, there are a couple of choices for these programs. First, they can be built using a DOS Extender, such as Rational Systems/16M. Or they can be built on a machine that has either a larger address space or support for virtual memory, such as Windows 95, VMS, or UNIX. The code distributed here was written in an attempt to be portable across all these options.

Testing shows that with an extra megabyte of extended memory and a DOS Extender, virtually any ASCII file can be compressed on a PC using order -3 compression. Some binary files require more memory. A UNIX system had no problem with order -3 compression and turned in the best performance overall in terms of speed.

Conclusions

Compression-ratio test show that statistical modeling can perform at least as well as dictionary-based methods. But these programs are at present somewhat impractical because of their high resource requirements. ARITH-N is fairly slow, compressing data with speeds in the range of 1K per second and needing huge amounts of memory to use higher-order modeling. As memory becomes cheaper and processors become more powerful, however, schemes such as the ones shown here may become practical. They could be applied today to circumstances in which either storage or transmission costs are extremely high.

Order-0 adaptive modeling using arithmetic coding could be useful today in situations requiring extremely low consumption of memory. The compression ratios might not be as good as those gained with sophisticated models, but memory consumption is minimized.

Enhancement

The performance of these algorithms could be improved significantly beyond the implementation discussed here. The first area for improvement would be in memory management. Right now, when the programs run out of memory, they abort. A more sensible approach would be to have the programs start with fixed amounts of memory available for statistics. When the statistics fill the space, the program should then stop updating the tables and just use what it had. This would mean implementing internal memory-management routines instead of using the C run-time library routines.

Another potential improvement could come in the tree structure for the context tables. Locating tables through the use of hashing could be quite a bit faster and might require less memory. The context tables themselves could also be improved. When a table has over 50 percent of the potential symbols defined for it, an alternate data structure could be used with the counts stored in a linear array. This would allow for faster indexing and would reduce memory requirements.

Finally, it might be worth trying ways to adaptively modify the order of the model being used. When compressing using order-3 statistics, early portions of the input text generate a lot of escapes while the statistics tables fill up. It ought to be possible to start encoding using order-0 statistics while keeping order-3 statistics. As the table fills up, the order used for encoding could be incremented until it reaches the maximum.

Table of Contents