The Data Compression Book-:Speech Compression

Table of Contents

Just how effective silence compression can be at compressing files is shown in the following table. As expected, files without much silence in them were not greatly affected. But files that contained significant gaps were compressed quite a bit.


File Name	Raw Size	Compressed Size	Compression
SAMPLE-1.RAW	50777	37769	26%
SAMPLE-2.RAW	12033	11657	3%
SAMPLE-3.RAW	73091	73072	0%
SAMPLE-4.RAW	13852	10962	21%
SAMPLE-5.RAW	27411	22865	17%

The final question to ask about silence detection is how it affects the fidelity of input files. The best way to answer that is to take the sample files, compress them, then expand them into new files. The expanded files should differ only from the originals in that strings of characters near the silence value of 80H should all have been arbitrarily made exactly 80H, changing slightly noisy silence to pure silence.

In most cases, it is not possible to tell the sound samples that have been through a compression/expansion cycle from the originals. By tinkering with the parameters, it is possible to start erasing significant sections of speech, but that obviously means the parameters are not set correctly. All in all, when applied correctly, silence compression provides an excellent way to squeeze redundancy out of sound files.

Companding

Silence compression can be a good way to remove redundant information from sound files, but in some cases it may be ineffective. In the preceding examples, SAMPLE-3.RAW had so few silent samples it was only reduced by a few bytes out of 73K. This situation is somewhat analogous to using run-length encoding on standard text or data files: it will sometimes produce great gains, but it is not particularly reliable.

In the early 1960s, telecommunications researchers were looking for a method of data compression that could always reduce the number of bits in a sound sample. Customer satisfaction tests showed that it took about thirteen bits of resolution in the DAC sampled at 8,000Hz to provide an acceptable voice connection, but it seemed likely that much of that resolution was going to waste.

We need thirteen bits of resolution in a phone conversion because of the large dynamic range of the human voice. To accommodate a loud speaker, the voltage input range of the DAC has to be set at a fairly high level. The problem is that the input voltage from a very soft voice is several orders of magnitude lower than this. If the ADC had eight bits of resolution, it would only detect input signals close to 1 percent of the magnitude of the highest input. This proved unacceptable.

It turns out, however, that the thirteen bits of resolution needed to pick up the voice of the quietest speaker is overkill for resolution of the loudest speaker. If our microphone input for a loud speaker is in the neighborhood of 100mv, we might only need one millivolt of resolution to provide good sound reproduction. The thirteen-bit ADC might be giving 200 microvolt resolution, which turns out to be more than is necessary.

The telecommunications industry solved this using a non-linear matched set of ADCs and DACs. The normal ADC equipment used in desktop computers (and most electronic equipment) uses a linear conversion scheme in which each increase in a code value corresponds to a uniform increase in input/output voltage. This arrangement is shown in Figure 10.12.

Figure 10.12 A linear conversion scheme in which each increase in a code value corresponds to a uniform increase in input/output voltage.

Using a linear conversion scheme such as this, when we go from code 0 to code 1, the output voltage from the DAC might change from 0mv to 1mv. Likewise, going from code 100 to code 101 will change the DAC output voltage from 100mv to 101mv.

The system in our telecommunications equipment today uses a “companding codec”—jargon for “compressing/expanding coder/decoder.” The codec is essentially a chip that combines several functions, including those of the DAC, ADC, and input and output filters. We are concerned with the DAC and ADC.

The codec used in virtually all modern digital telephone equipment does not use a standard linear function when converting codes to voltages and voltages to codes. Instead, it uses an exponential function that changes the size of the voltage step between codes as the codes grow larger. Figure 10.13 shows an example of what this curve looks like. The resolution for smaller code values is much finer than at the extremes of the range. For example, the difference between a code of zero and a code of one might be 1mv, while the difference between code 100 and code 101 could be 10mv.

Figure 10.13 An exponential function that changes the size of voltage steps.

The exponential curve defined for telecommunications codecs gives an effective range of thirteen bits out of a codec that only uses eight-bit samples. We can do the same thing with out eight-bit sound files by squeezing eight-bit samples into a smaller number of codes.

Our eight-bit sound files can be considered approximately seven-bit samples with a single sign bit, indicating whether the output voltage is positive or negative. This gives us a range running from zero to 128 to encode for the output of our non-linear compression function.

If we assume that we will have N codes to express the range of zero to 127, we can develop a transfer function for each code using the following equation:

output = 127.0 * ( pow( 2.0, code / N ) - 1.0 )

In other words, we calculate the output by raising 2 to the code/N power. The value of code/N will range from zero for code 0 up to one for code N, resulting in an output range that runs from zero to 127, with a decidedly non-linear look.

An example of how this might work would be found if we used eight samples to encode the range zero to 128. This, in effect, compresses seven bits to three. The output value produced by an input code is shown in the table that follows.

Transforming three bits to seven

Input Code	Output Value
0	0
1	13
2	28
3	44
4	62
5	81
6	103
7	127

Table of Contents