The Data Compression Book-:The Dawn Age: Minimum Redundancy Coding

Table of Contents

Note, however, that the Huffman codes differ in length from Shannon-Fano codes. The code length for A is only a single bit, instead of two, and the B and C symbols have 3-bit codes instead of two bits. The following table shows what effect this has on the total number of bits produced by the message.


Symbol	Count	Shannon-Fano Size	Shannon-Fano Bits	Huffman Size	Huffman Bits

A	15	2	30	1	15
B	7	2	14	3	21
C	6	2	12	3	18
D	6	3	18	3	18
E	5	3	15	3	15

This adjustment in code size adds 13 bits to the number needed to encode the B and C symbols, but it saves 15 bits when coding the A symbol, for a net savings of 2 bits. Thus, for a message with an information content of 85.25 bits, Shannon-Fano coding requires 89 bits, but Huffman coding requires only 87.

In general, Shannon-Fano and Huffman coding are close in performance. But Huffman coding will always at least equal the efficiency of Shannon-Fano coding, so it has become the predominant coding method of its type. Since both algorithms take a similar amount of processing power, it seems sensible to take the one that gives slightly better performance. And Huffman was able to prove that this coding method cannot be improved on with any other integral bit-width coding stream.

Since D. A. Huffman first published his 1952 paper, “A Method for the Construction of Minimum Redundancy Codes,” his coding algorithm has been the subject of an overwhelming amount of additional research. Information theory journals to this day carry numerous papers on the implementation of various esoteric flavors of Huffman codes, searching for ever better ways to use this coding method. Huffman coding is used in commercial compression programs, FAX machines, and even the JPEG algorithm. The next logical step in this book is to outline the C code needed to implement the Huffman coding scheme.

Huffman in C

A Huffman coding tree is built as a binary tree, from the leaf nodes up. Huffman may or may not have had digital computers in mind when he developed his code, but programmers use the tree data structure all the time.

Two programs used here illustrate Huffman coding. The compressor, HUFF-C, implements a simple order-0 model and a single Huffman tree to encode it. HUFF-E expands files compressed using HUFF-C. Both programs use a few pieces of utility code that will be seen throughout this book. Before we go on the actual Huffman code, here is a quick overview of what some of the utility modules do.

BITIO.C

Data-compression programs perform lots of input/output (I/O) that does reads or writes of unconventional numbers of bits. Huffman coding, for example, reads and writes bits one at a time. LZW programs read and write codes that can range in size from 9 to 16 bits. The standard C I/O library defined in STDIO.H only accommodates I/O on even byte boundaries. Routines like putc() and getc() read and write single bytes, while fread() and fwrite() read and write whole blocks of bytes at a time. The library offers no help for programmers needing a routine to write a single bit at a time.

To support this conventional I/O in a conventional way, bit-oriented I/O routines are confined to a single source module, BITIO.C. Access to these routines is provided via a header file called BITIO.H, which contains a structure definition and several function prototypes.

Two routines open files for bit I/O, one for input and one for output. As defined in BITIO.H, they are

BIT_FILE *OpenInputBitFile( char *name );
BIT_FILE *OpenOutputBitFile ( char *name );

These two routines return a pointer to a new structure, BIT_FILE. BIT_FILE is also defined in BITIO.H as shown:

typedef struct bit_file {
     FILE *file;
     unsigned char mask;
     int rack;
     int pacifier_counter;
} BIT_FILE:

OpenInputBitFile() or OpenOutputBitFile() perform a conventional fopen() call and store the returned FILE structure pointer in the BIT_FILE structure. The other two structure elements are initialized to their startup values, and a pointer to the resulting BIT_FILE structure is returned.

In BITIO.H, rack contains the current byte of data either read in from the file or waiting to be written out to the file. mask contains a single bit mask used either to set or clear the current output bit or to mask in the current input bit.

Table of Contents