The Data Compression Book-:LZ78 Compression

Table of Contents

With a tree like this, comparing an existing string to the dictionary is simple. It is just a matter of walking through the tree, traversing a single node of the tree for every character in the phrase. If the phrase terminates at a particular node, we have a match. If there are more phrases but we have reached a leaf node, there is not a match. After the symbol has been encoded, adding it to the leaf node is also simple—just a matter of adding space to the descendant list, then inserting a new descendant node at the node last matched.

One negative side effect of LZ78 not found in LZ77 is that the decoder has to maintain this tree as well. With LZ77, a dictionary index was just a pointer or index to a previous position in the data stream. But with LZ78, the index is the number of a node in the dictionary tree. The decoder, therefore, has to keep up the tree in exactly the same fashion as the encoder, or a disastrous mismatch will occur.

Another issue ignored so far is that of the dictionary filling up. Regardless of how big the dictionary space is, it is going to fill up sooner or later. If we are using a sixteen-bit code, the dictionary will fill up after it has 65,535 phrases defined in it.

There are several alternative choices regarding a full dictionary. Probably the safest default choice is to stop adding new phrases to the dictionary after it is full. This only requires an extra line or two of code in the add_phrase_to_dictionary() routine.

But just leaving the dictionary alone may not be the best choice. When compressing large streams of data, we may see significant changes in the character of the incoming data. When compressing a program’s binary image (such as an EXE file), for example, we would expect to see a major shift in the statistical model of the data as we move out of the code section of the file and into the data section.

If we keep using our existing phrase dictionary, we may be stuck with an out-of-date dictionary that isn’t compressing very well. At the same time, we have to be careful not to throw away a dictionary that is compressing well.

The UNIX compress program, which uses an LZ78 variant, manages the full dictionary problem by monitoring the compression ratio of the file. If the compression ratio ever starts to deteriorate, the dictionary is deleted and the program starts over from scratch. Otherwise, the existing dictionary continues to be used, though no new phrases are added to it.

An Effective Variant

As with LZ77, LZ78 was first published in a research journal and discussed in a very technical and abstract fashion. It wasn’t until 1984 that a variant of LZ78 made headway in the programming world. This was when Terry Welch published “A Technique for High-Performance Data Compression” in IEEE Computer.

Work on the UNIX compress program began almost immediately after Terry Welch’s article appeared. The technique Welch described, and the implementation in compress, are referred to as LZW compression.

LZSS improved on LZ77 compression by eliminating the requirement that each token output a phrase and a character. LZW makes the same improvement on LZ78. In fact, under LZW, the compressor never outputs single characters, only phrases.

To do this, the major change in LZW is to preload the phrase dictionary with single-symbol phrases equal to the number of symbols in the alphabet. Thus, there is no symbol that cannot be immediately encoded even if it has not already appeared in the input stream.

The LZW compression algorithm in its simplest form follows. A quick examination of the algorithm shows that LZW always tries to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.

  old_string[ 0 ] = getc(input);
  old_string[ 1 ] = '\0';
  while ( !feof( input ) ) {
    character = getc( input );
    strcpy( new_string, old_string );
    strncat( new_string, &character, 1 );
    if ( in_dictionary( new_string ) )
      strcpy( old_string, new_string );
    else {
      code = look_up_dictionary( old_string );
      output code( code );
      add_to_dictionary( new_string );
      old_string[ 0 ] = character;
      old_string[ 1 ] = '\0';
    }
    code = look_up_dictionary( old string );
    output_code( code );

A sample string used to demonstrate the algorithm is shown next. The input string is a set of English words from a spelling dictionary, separated by the ‘ ’ character. On the first pass through the loop, a check is performed to see if the string “ W” is in the table. Since it isn’t, the code for “ ” is output, and the string “ W” is added to the table. Since the dictionary has codes 0-255 already defined as the 256 possible character values, the first string definition is assigned to code 256. After the third letter,‘E’, has been read in, the second string code, “WE”, is added to the table, and the code for letter ‘W’ is output. In the second word, the characters ‘ ’ and ‘W’ are read in, matching string number 256. Code 256 is then output, and a three-character string is added to the string table. The process continues until the string is exhausted and all codes have been output.

Input String: " WED WE WEE WEB WET "


Characters Input	Code Output	New code value and associated string
“ W”	‘ ’	256 = “ W”
“E”	‘W’	257 = “WE”
“D”	‘E’	258 = “ED”
“ ”	“D”	259 = “D ”
“WE”	256	260 = “WE”
“ ”	‘E’	261 = “E”
“WEE”	260	262 = “ WEE”
“ W”	261	263 = “E W”
“EB”	257	264 = “WEB”
“ ”	B	265 = “B”
“WET”	260	266 = “ WET”
<EOF>	T

Table of Contents