The Data Compression Book-:Dictionary-Based Compression

Dictionary Compression: Where It Shows Up

Dictionary-based compression has found more and more homes in the last ten years as both hardware and software improvements make it practical. We can subdivide applications for dictionary-based compression into two areas: general-purpose programs and hardware-specific code.

As shown, dictionary-based coding took over desktop general-purpose compression. In the MS-DOS world, programs such as PKZIP, ARC, ARJ, and LHarc all use dictionary-based algorithms to compress and archive files in a general-purpose manner. Most of these programs have ports to at least one or two other platforms, UNIX being the most popular.

Dictionary-based compression is also used in some special-purpose desktop programs. Most backup programs, for example, use some form of compression to make their operation faster and more efficient. PC Backup, developed by Central Point Software developed (a company later acquired by symantic), uses a dictionary-based algorithm from Stac Electronics, the company that produces the Stacker disk compression utility and which initiated the QIC-122 compression standard.

Compuserve Information Service developed a dictionary-based compression scheme used to encode bit-mapped graphical images. The GIF format uses an LZW variant to compress repeated sequences in screen images. Compression is clearly needed when using these type of images. Computer images take up lots of storage space. As video resolutions improve, the size of the saved images grows dramatically. Compuserve users also typically use modems to upload or download these images. When running on older 2400 baud modems, compressing images becomes even more crucial. Even at 14400 bps and faster speeds, the exploding use of the World-Wide Web on the Internet means increased demand for speedy transfer of graphics, and increased reliance on the GIF format for non-photographic images. (Image-oriented formats will be discussed in later chapters.)

Compressing files before transmitting them saves telecommunications bandwidth. But this requires compatible compression software on both ends. A more convenient method of conserving bandwidth’s to build data compression directly into the modem. Microcom Corp. originally developed this idea, which used Huffman coding to compress data before it was transmitted by its modems. Microcom’s compression algorithm, MNP-5, uses a dynamic Huffman coding scheme that performs well as a general-purpose compressor on most data streams.

In recent years, the international telecommunications industry has widely ratified and adopted a new compression algorithm used by modem manufacturers: V.42bis, a dictionary-based compression scheme which offers better compression ratios than MNP-5. With the adoption of an international standard, modem builders can now implement data compression in their modems and have confidence that they can communicate with modems from other manufacturers.

As mentioned previously, tape drive manufacturers have adopted an industry-standard compression algorithm: QIC-122. QIC-122 is generally implemented on the tape drive itself or on the tape controller using a dedicated microcontroller or the Stac Electronics compression engine chip. Hewlett-Packard has proposed an alternative compression standard known as DCLZ. DCLZ uses an LZ78-type algorithm, which supposedly offers better compression performance than programs based on QIC-122.

The hard disk has been an active battleground for dictionary-based compression. In recent years, the utility programs for archiving and compressing files have been supplemented by programs that work at the device-driver level to transparently compress data stored on disk. The most visible of these utilities is Stacker, from Stac Electronics. There are others, including Disk Doubler on the Macintosh platform. In version 6 of MS-DOS, Microsoft added Stacker-like compression to its operating system, and then was forced to remove it after a well-publicized lawsuit between Microsoft and Stac Electronics. When operating at the device-driver level, these programs add a level of convenience to compression that is hard to beat.

One of the primary difficulties with compressing data on the hard disk is that most of today’s dictionary schemes are adaptive. A general-purpose algorithm would be needed to operate on a disk drive or controller, making a static dictionary difficult to implement. Most adaptive algorithms do not perform very well until they have built up some statistical information, which may take several K of input data.

Unfortunately, disk drives are used in a random-access mode, which means a program can begin reading data at any point in the file. If the file were compressed using a conventional adaptive method, we might have to go back to the start and begin reading there in order to properly decompress the file. This may not be much of a problem in a 16 Kbyte text file, but imagine the performance problems in a 16-Mbyte database!

At present, manufactures of software- and hardware-based disk compression drivers avoid these problems by compressing at the sector level. Since the device driver typically reads in a sector or more at a time, the adaptive algorithm will restart at the begining of each sector boundary. Even better, if the device driver controls the size of the sector, it can be set to a somewhat larger value than might normally be used, giving the adaptive algorithm a chance to improve its compression.

With algorithms such as QIC-122, increasing the sector size past a certain point will not likely improve matters, since the dictionary is only composed of the previous 2K bytes of data. But more powerful compression algorithms that take advantage of older information will frequently want to increase the sector size.

In practice, many users of disk-compression programs that work at the device-driver level find performance to be less of an issue than one might expect. This is due to the fact that, while compression/decompression does consume additional CPU cycles, this effort is compensated for by the reduced amount of data that needs to be transferred between memory and the hard disk. When a mechanical procedure (physically moving the hard disk arm across the platter) competes with an electronic one (decompressing data in a CPU cache), the electronic process usually wins out.

Table of Contents