The Data Compression Book-:LZ78 Compression

Table of Contents

The sample output for the string is shown with the resulting string table. The string table fills up rapidly, since a new string is added each time a code is output. In this highly redundant input, five code substitutions were output, along with seven characters. If we were using nine-bit codes for output, the nineteen-character input string would be reduced to a 13.5-byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn’t begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.

Decompression

The companion algorithm for compression is the decompression algorithm. It takes the stream of codes output from the compression algorithm and uses them to recreate the exact input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the dictionary to the decompressor. The table can be built exactly as it was during compression, using the input stream as data. This is possible because the compression algorithm always outputs the phrase and character components of a code before it uses it in the output stream, so the compressed data is not burdened with carrying a large dictionary.

    old_string[ 0 ] = input_bits();
    old_string[ 1 ] = '\0';
    putc( old_string[ 0 ], output )
    while ( ( new_code = input_bits() ) != EOF )
       new_string = dictionary_lookup( new_code );
       fputs( new_string, output );
       append_char_to_string( old_string, new_string[ 0 ] );
       add_to_dictionary( old_string );
       strcpy( old_string, new_string );
    }

Preceding is a rough C implementation. Like the compression algorithm, it adds a new string to the string table each time it reads in a new code. In addition, it translates each incoming code into a string and sends it to the output.

Following is the output of the algorithm given the input created by the earlier compression. Note that the string table ends up looking exactly like the table built during compression. The output string is identical to the input string from the compression algorithm. Note also that the first 256 codes are already defined to translate to single-character strings, as in the compression code.

Input Codes: "WED<256>E<260><261><257>B<260>T"


Input/ NEW_CODE	OLD_CODE	STRING/ Output	CHARACTER	New table entry
‘ ’	‘ ’	“ ”
‘W’	‘ ’	“W”	‘W’	256 = “ W”
‘E’	‘W’	“E”	‘E’	257 = “WE”
‘D’	‘E’	“D”	‘D’	258 = “ED”
256	‘D’	“ W”	‘ ’	259 = “D”
‘E’	256	“E”	‘E’	260 = “ WE”
260	‘E’	“ WE”	‘ ’	261 = “E”
261	260	“E “	‘E’	262 = “ WEE”
257	261	“WE”	‘W’	263 = “E W”
‘B’	257	“B”	‘B’	264 = “WEB”
260	‘B’	“ WE”	‘ ’	265 = “B”
‘T’	260	“T”	‘T’	266 = “ WET”

The Catch

Unfortunately, the decompression algorithm shown is just a little too simple. A single exception in the LZW compression algorithm causes some trouble in decompression. Each time the compressor adds a new string to the phrase table, it does so before the entire phrase has actually been output to the file. If for some reason the compressor used that phrase as its next code, the expansion code would have a problem. It would be expected to decode a string that was not yet in its table.

Unfortunately, there is a way this can occur. If there is a phrase already in the table composed of a CHARACTER, STRING pair, and the input stream then sees a sequence of CHARACTER, STRING, CHARACTER, STRING, CHARACTER, the compression algorithm will output a code before the decompressor defines it.

A simple example will illustrate the point. Imagine the string IWOMBAT is defined in the table as code 300. Later, the sequence IWOMBATIWOMBATI occurs in the table. The compression output will look like the following:

Input String: IWOMBAT.......IWOMBATIWOMBATIXXX

**<Problem section>**

Character Input	New code value and associated string	Code Output
...I
WOMBATA	300 = IWOMBAT	288 (IWOMBA)
.	.	.
.	.	.
...I	.	.
WOMBATI	400 = IWOMBATI	300 (IWOMBAT)
WOMBATIX	401 = IWOMBATIX	400 (IWOMBATI)

When the decompression algorithm sees this input stream, it first decodes code 300 and outputs the IWOMBATI string. It will then add the definition for code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table.

Fortunately, this is the only time when the decompression algorithm will encounter an undefined code. Since it is the only time, we can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code and handles it. In the sample, the decompression routine sees a code of 400. Since 400 is undefined the program goes back to the previous code/string, which was “IWOMBAT”, or code 300. It then appends the first character of the string to the end of the string, yielding “IWOMBATI,” the correct value for code 400. Processing then proceeds as normal.

Table of Contents