The Data Compression Book-:Glossary

Table of Contents

Glossary

Adaptive compression, Adaptive modeling: Data compression techniques that use a model can either use a fixed model for the entire stream they are processing, or modify the model as the stream is processed. Techniques that modify the model as it is processed are said to use adaptive modeling. An example of an adaptive compression technique would be LZW compression.
ADPCM: Adaptive Differential Pulse Code Modulation. Standard PCM encoding is a common technique for encoding audio data. Telephone conversations and audio CDs both use conventional PCM. PCM samples a waveform at uniform steps and encodes the level of the waveform. DPCM is Differential Pulse Code Modulation. DPCM doesn’t encode the level, it instead encodes the difference from the last sample. ADPCM takes that a step further, and modifies the coding of the difference depending on the state of the waveform. PCM encoding in telephone systems uses 64 K bits per second. ADPCM can reduce that rate to 32 or even 16 K bits per second with relatively little reduction in voice quality.
Alphabet: An Alphabet is the set of all of the possible symbols that may appear in a message. For example, when compressing ASCII text files, the Alphabet would probably consist of characters 0x00 through 0x7f.
Archive: An archive is a volume or file containing one or more files that may or may not have been compressed. An archive is typically used as a convenient way to store or transport files. Programs such as ARC and PKZip compress files before placing them into archives.
Arithmetic coding: Traditional coding techniques such as ASCII or Huffman coding encode symbols into unique patterns of bits. Arithmetic coding instead takes an entire text and encodes it as a single floating point number less than 1 and greater than or equal to 0. Arithmetic coding can more efficiently encode texts by eliminating the quantization effects of other coding techniques.
ARC, MS-DOS program: ARC is a commercial archiving program created by System Enhancement Associates, of Wayne, N.J. ARC was one of the earliest compression/archive utilities to achieve wide popularity in the desktop computing world, beginning in the mid-1980s.
ARJ, MS-DOS program: ARJ is a commercial archiving program created by Robert Jung. ARJ is free of charge for individual use, but commercial users must pay a license fee. ARJ is also supplied with ANSI C source code for extracting files from ARJ archives that may be distributed without restrictions.
Block Coding: Compression of images is frequently done by coding smaller blocks of the image independent of one another. For example, the JPEG algorithm uses an 8-by-8 block size when compressing graphics.
CCITT: CCITT is the International Telegraph and Telephone Consultative Committee. This standards organization is responsible for the sanctioning of many compression and transmission methods in use today, including several PCM and ADPCM techniques, FAX transmission, and the evolving JPEG and MPEG standards.
Codes, (en)coding: Symbols that are to be stored or manipulated by a computer are converted to codes. This process is referred to as coding. ASCII and EBCDIC are two of the most common methods of coding written text. Data compression can occur if more efficient methods of coding, such as Huffman coding, are used.
COMPACT, UNIX program: COMPACT is a UNIX program that used Dynamic Huffman coding to compress files. It generally fell out of use in favor of the COMPRESS program.
COMPRESS, UNIX program: COMPRESS is a UNIX program that uses an LZW implementation to compress files. COMPRESS has found widespread use in the UNIX community, and is available in the public domain. It has recently been thought that COMPRESS may infringe on a Unisys patent, which may curtail its use and distribution.
Compression ratios: Compression ratios are used to describe the difference between a file and a compressed copy of itself. There are several different ways of expressing this number. One common method is a ratio between input and output, as in “a 4:1 compression ratio.” Another popular method is to express the difference between the files as a percentage ranging from 0% to 100% (or greater, if the compression failed to actually reduce the size of the file). Some people invert this scale, using 100% as the “best” compression ratio. Occasionally, you still see the ratio of compressed to plain files expressed as “bits per byte.”
CRC, Cyclical Redundancy Check: A CRC is a number generated by applying a formula to a block of data, generally for use as a checksum. A good CRC formula should generate a different number for as many different error conditions as possible. The CRC formula referred to in this book is the commonly used 32-bit CCITT-specified formula.
Discrete Cosine Transform: The DCT is used in the JPEG image compression method. The DCT is similar to the Fast Fourier Transform, in that it transforms a set of data from the the spatial domain to the frequency domain and back again. Once a photographic image is transformed by the DCT into a set of frequency information, it can be effectively compressed using “lossy” techniques. Expanding the same image involves converting the frequency information back to spatial information.
Dictionary, adaptive/static: Macro substitution methods use a dictionary to compress data. A string of symbols is encoded as a pointer into a dictionary. An adaptive method, such as LZ77, is continually modifying its dictionary. A static dictionary will compress an entire stream using the same dictionary.
Entropy, Information content: Entropy is a measure of the amount of information in an object. The concept of “absolute entropy” remains elusive. In general, entropy is calculated with respect to a given model. Entropy can be expressed in bits. In this form it is generally referred to as “information content.”
Escape code: An Escape code is a special symbol used to ‘escape’ from the current context. In data compression, escape codes are frequently used when a symbol, not found in the current dictionary of symbols, needs to be encoded. The Escape code tells the decoder to change to a different context, where the symbol can be properly coded.
Freeware: Freeware is a term applied to software that is distributed without charge and may be used freely by anyone. It is distinguished from Public Domain software by virtue of the fact that the owner of the software retains the copyright to the work. Retaining the copyright allows the owner to restrain or control any modification or redistribution of the package. Software distributed by the Free Software Foundation, such as the EMACS editor, is generally referred to as freeware.
Group 3 FAX: Group 3 FAX is a CCITT standard for transmission of facsimile data. It can compress black and white images using a combination of differential, run length, and Huffman coding.
Huffman coding: Huffman is a method of encoding symbols that varies the length of the symbol in proportion to its information content. Symbols with a low probability of appearance are encoded with a code using may bits. Symbols with a high probability of appearance are represented with a code using fewer bits. Huffman codes can be properly decoded because they obey the prefix property, which means no code can be a prefix of another code. Huffman coding was first described in a seminal paper by D.A. Huffman in 1952.
Information theory: Information theory is the study of the storage, processing, and transmission of information. This branch of science is generally acknowledged as having been created by Claude Shannon at Bell Labs shortly after World War II.
ISO: ISO is the International Standards Organization. ISO is one of the bodies (along with the CCITT) involved in the JPEG and MPEG standardization efforts.
JPEG: JPEG stands for the Joint Photographic Experts Group. It is referred as a “joint” group because this committee is sanctioned by the CCITT and the ISO, two prominent international standards groups. JPEG refers both to the committee and their work—a compression standard that defines a method for compressing photographic images. Images compressed with the JPEG algorithm undergo a “lossy” compression. The amount of compression can be varied, with a resulting loss or gain in resolution. Most of the JPEG-like algorithms in use today rely on powerful dedicated processors to perform compression and decompression. Software-only techniques on general purpose desktop processors are still fairly slow. JPEG compression can achieve impressive compression ratios, reducing the storage required by images to less than 10% of the size of the original with only very slight loss of resolution. By sacrificing more resolution, you can compress images to 95% or more using JPEG.
LHarc: LHarc is a freeware compression program authored by Haruyasu Yoshizaki. It uses the LZSS variant of LZ77 compression, followed by a Dynamic Huffman postprocessing stage. The freeware status of this program, combined with the availability of source code, have made this a popular program.
Linear Predictive Coding: LPC is a coding technique that transmits voice data using a model of the vocal tract.
Lossless: Lossless compression is used to compress a text stream so that it can be expanded into an identical copy of the stream. This type of compression is normally required for data files.
Lossy: Lossy compression refers to a compression technique where the compressed stream cannot be expanded into an exact copy of the input. Lossy compression can be used on digitally stored representations of analog phenomena, such as graphics images and stored audio samples. The ability to sacrifice small amounts of resolution allows lossy algorithms to compress files to significantly smaller ratios. Lossy compression is sometimes referred to as “noisy” compression.
LZW, LZSS, LZ-77, LZ-78: Jacob Ziv and Abraham Lempel published a pair of papers in 1977 and 1978 that described two different dictionary-based compression techniques. LZ77 substituted strings from a fixed-size window into previously seen text. LZ78 builds up a phrase dictionary from previously seen text, with no limit on how far back a phrase may have appeared. These papers spurred a flurry of activity by other researchers who refined these techniques, resulting in compression algorithms that were superior to earlier statistical-based Huffman coding. Dictionary methods are widely-used today in V.42bis modems, in software such as LHarc, ARJ, and PKZIP, and in QIC magnetic tape drives.
Model: Compression algorithms generally maintain a “model,” which is a set of accumulated statistics describing the state of the encoder. For example, in a simple compression program, the model may be a count of the frequency of every symbol in an input file.
MPEG: MPEG stands for Moving Pictures Experts Group. Like JPEG, MPEG refers to both a group and the standard that the group is developing. MPEG is a committee sanctioned by the ISO to work on the digital transmission of broadcast quality full motion video and sound. The goal of the MPEG group is to be able to send a high-quality picture and stereo soundtrack through a 1.5 Mbps channel. The MPEG has drafted a standard which is presently under review.
Nyquist Rate: The rate that analog signal of frequency f has to be sampled at in order to be accurately reproduced. Harry Nyquist determined that this rate has to be greater than 2*f.
Order (in re. model): The order of a model refers to how many previous symbols are taken into consideration when encoding. For example, an order-0 model ignores all previous symbols when determining what code to use for a given symbol. So even if the previous character in a file was ‘q’, the probability of a ‘u’ in an order-0 model will not go up. An order-1 model would take note of the ‘q’ and greatly increase the probability for the ‘u’.
Phrase: A phrase is a string of symbols of an arbitrary length. When programming in C, the term “string” can usually be substituted for phrase.
PKZIP: PKZIP is a popular desktop compression/archiving program. PKZIP is distributed via shareware channels by PKWare, Glendale, WI. This program uses several different dictionary-based compression algorithms to compress input files. It has achieved enough popularity in the MS-DOS world to be accepted as a “standard,” although the source code for the program remains proprietary.
p X 64: p x 64 refers to the CCITT standard regarding digital transmission of audiovisual information, commonly referred to as videoconferencing. ISDN communications networks allocate bandwidth in 64K bit/second “chunks.” The ‘p’ in p X 64 refers the notion that a video transmission channel will be allocated a certain number of 64 Kbps channels. The quality of transmission will be affected by the magnitude of p. Videoconferencing can be done with p as low as 1 or 2, but transmission of motion is severely restricted. As p grows, higher quality transmission becomes possible.
Run Length Encoding: Run Length Encoding, or RLE, is a simple technique used to compress runs of identical symbols in a data stream. Typically RLE encodes a run of symbols as a symbol and a count. The simplicity of RLE and ease of implementation result in its frequent appearance in compression programs, although its performance is relatively poor.
Shannon Fano coding: Shannon Fano codings is a coding technique developed in the early 1950s which attempted to minimize the number of bits used in a message when the probabilities of symbols in the message were known. Shannon Fano coding has generally been superseded by Huffman coding, which produces provably optimum code sets, resulting in marginally better performance than Shannon Fano codes.
Shannon, Claude: Claude Shannon is known as the father of information theory for his work done in the 1940s and 1950s. Shannon defined the concept of “information content” and entropy as it relates to data. Shannon is also an ardent student of the art of juggling, both by man and machine.
Shareware: Shareware refers to an increasingly common method of software distribution. Shareware is based on a “try before you buy” concept. The user of a shareware program is authorized by the creator to try the program for a limited amount of time. After the evaluation period is over, the user is expected to “register” the program if it will continue being used. Registration generally consists of a payment in return for improved documentation, support, and other considerations.
: While shareware is a decidedly ad hoc concept, some standardization in the industry is being attempted by the Association for Shareware Professionals. The ASP has established a set of guidelines for authors of shareware defining various standards for distribution and support.
: While not part of the shareware definition, most shareware programs also have a significantly lower price than their commercial counterparts. Many consumers mistakenly feel that shareware is “free,” an idea being battled by the ASP.
lSQ, MS-DOS compression program: One of the earliest desktop compression programs was SQ, and its counterpart USQ. These two programs implemented a simple order 0 Huffman compression algorithm. They were first developed for the CP/M operating system, and were ported to MS-DOS soon after its release. SQ did not perform file archiving, so it was frequently combined with LU, a program that combined files into a library.
: The appearance of ARC as a shareware program soon drove SQ into obscurity. ARC not only offered superior compression performance with its LZW algorithm, it also combined the compression and librarian functions into a single program. This let users compress groups of files and move them into a single archive in one operation.
lStatic model: A static model is one that does not change while a stream of symbols is compressed. An example of this would be a simple order-0 Huffman code compression program. The classic implementation of this program counts all of the different characters in a file during a first pass. This data is used to then build a Huffman coding tree, which constitutes the static model. A second pass is then made over the data to perform the actual compression.
Symbols: In data compression terminology, a symbol is an atomic unit of information. General purpose compression programs frequently compress streams of bytes, where the byte is the same thing as the symbol. However, a symbol could just as easily be a floating point number, or a spoken word, etc.
TAR, UNIX program: TAR is a standard UNIX program used to create archives. It takes a group of files and combines them into a single file or volume. TAR does not perform any compression of the files. UNIX archives are frequently created by using TAR to pack a group of files together, then using COMPRESS to perform the data compression. The resulting file is referred to as being in “TAR.Z” format.
Token: An arbitrary object used to encode something else. In dictionary-based compression, a token is an object that can be used to decode a phrase.
Welch, Terry: Terry Welch took the LZ78 data compression algorithm and refined it into a “Technique for High-Performance Data Compression.” His patent on the LZW algorithm is now controlled by Unisys.
Ziv & Lempel: Jacob Ziv and Abraham Lempel are two Israeli information theorists who published the seminal papers on dictionary-based data compression in 1977 and 1978.

Table of Contents