Data Structures

All improvements to the basic statistical modeling assume that higher-order modeling can actually be accomplished on the target computer. The problem with increasing the order is one of memory. The cumulative totals table in the order-0 model in Chapter 5 occupied 516 bytes of memory. If we used the same data structures for an order-1 model, the memory used would shoot up to 133K, which is still probably acceptable. But going to order-2 will increase the RAM requirements for the statistics unit to thirty-four megabytes! Since we would like to try orders even higher than 2, we need to redesign the data structures that hold the counts.

To save memory space, we have to redesign the context statistics tables. In Chapter 5, the table is about as simple as it can be, with each symbol being used as an index directly into a pair of counts. In the order-1 model, the appropriate context tables would be found by indexing once into an array of context tables, then indexing again to the symbol in question, a procedure like that shown here:

```low_count = totals[ last_char ][ current_char ];
high_count = totals[ last_char ][ current_char + 1 ];
range = totals[ last_char ][ 256 ];
```

This is convenient, but enormously wasteful. Full context space is allocated even for unused tables, and within the tables space is allocated for all symbols, seen or not. Both factors waste enormous amounts of memory in higher-order models.

The solution to the first problem, reserving space for unused contexts, is to organize the context tables as a tree. Place the order-0 context table at a known location and use it to contain pointers to order-1 context tables. The order-1 context tables will hold their own statistics and pointers to order-2 context tables. This continues until reaching the “leaves” of the context tree, which contain order_n tables but no pointers to higher orders. Using a tree structure can keep the unused pointer nodes set to null pointers until a context is seen. Once the context is seen, a table is created and added to the parent node of the tree.

The second problem is creating a table of 256 counts every time a new context is created. In reality, the highest-order contexts will frequently have only a few symbols, so we can save a lot of space by only allocating space for symbols seen for a particular context.

After implementing these changes, we have a set of data structures that look like this:

```     typedef struct {
unsigned char symbol;
unsigned char counts;
} STATS;

typedef struct {
struct context *next;

typedef struct context {
int max_index;
STATS *stats;
struct context *lesser_context;
} CONTEXT;
```

The new context structure has four major elements. The first is the counter, max_index, which tells how many symbols are presently defined for this particular context table. When a table is first created, it has no defined symbols, and max_index is -1. A completely defined table will have a max_index of 255. The max_index variable tells how many elements are allocated for the arrays pointed to by stats and links. Stats is an array of structures, each containing a symbol and a count for that symbol. If the context table is not one of the highest-order tables, it will also have a links array. Each symbol defined in the stats array will have a pointer to the next higher-order context table in the links table.

A sample of the context table tree is shown in figure 6.1. The table shown is one that will have just been created after the input text “ABCABDABE” when keeping maximum order-3 statistics. Just nine input symbols have already generated a fairly complicated data structure, but it is orders of magnitude smaller than one consisting of arrays of arrays. Figure 6.1  A context table tree: “ABCABDABE.”

One element in this structure that hasn’t been explained is the lesser_context pointer. This pointer is needed when using higher-order models. If the modeling code is trying to locate an order-3 context table, it first has to scan through the order-0 symbol list looking for the first symbol, the match, the order-1 symbol list, and so on. If the symbol lists in the lower orders are relatively full, this can be a lengthy process. Even worse, every time an escape is generated, the process has to be repeated when looking up the lower-order context. These searches can consume an inordinate amount of CPU time.

The solution to this is to maintain a pointer for each table that points to the table for the context one order less. The context table ABC should have its back pointer point to BC, for example, which should have a back pointer to C, which should have a pointer to “”, the null table. Then the modeling code only needs to keep a pointer to the current highest order context. Given that, finding the order-1 context table is simply a matter of performing (n-1) pointer operations.

With the table shown in Figure 6.1, for example, suppose the next incoming text symbol is X and the current context is ABE. Without the benefit of the lesser context pointers, I have to check the order-3, 2, 1, and 0 tables for X. This takes 15 symbol comparisons and three table lookups. Using reverse pointers eliminates all the symbols comparisons and performs just three table lookups.

To update the figure 6.1 context tree to contain an X after ABE, the modeling code has to perform a single set of lookups for each order/context. This code is shown in ARITH-N.C in the add_character_to_model() routine. Every time a new table is created, it needs to have its back pointer created properly at the same time, which requires a certain amount of care in the design of update procedures.