Internet Telephony and Voice Compression

by Kelly Ann Smith and Daniel Brushteyn

www.cis.upenn.edu/~kellyann/papers/iphone.html


Introduction

Internet telephone service is one of the hottest new features on the Internet. Paying nothing more than Internet connection fees, users can make long-distance calls via computer. The voice is compressed and transmitted over the Internet, just like any other data, then expanded at the receiver's end. This paper summarizes the advantages and disadvantages of using Internet telephony software. Voice transfer and compression is also examined; Internet telephone calls are always sent in compressed form, with low bandwidth and high quality being the goal.

Internet Telephony

Internet telephony issues include quality of service and financial implications. The quality is usually not quite as good as a conventional phone, but the low price compensates for it. The telephony software is relatively cheap. Often, the basic package is free, with additional services costing extra. The whole package is only $30 - $50. The only additional cost is for the Internet connection.

Features

Internet telephony products have all of the features of standard telephones and more. Common telephone-type features available through Internet phone include call holding, call waiting, muting and volume control, caller ID, call blocking and screening, directory assistance, speed dial, and voice mail [1]. Non-telephone features available include whiteboarding, document sharing and file transfer. IRC-type conversations are also possible, using voice instead of text, although text chat is available for when the network gets busy or the system is down. A log book allows you to keep track of all incoming and outgoing calls; the time, date, length of call, caller name, IP address, and source and destination gateways are all recorded. Automatic routing/traffic balancing algorithms are used, which determine the best available route based on the destination and current network traffic.

The user often has a choice of network parameters, including sampling rate and compression algorithm. A high sampling rate gives more information about the speech, but can flood the channel; words, phrases, or even whole sentences may be lost [2]. A better quality connection requires a longer delay than a mediocre one; the choice between high quality and minimal delay is up to the user. The user is often offered the choice of compression algorithm; connections through VocalTec's Internet Phone use one of the following compression algorithms: TrueSpeech 8.5 [3], VSC (VocalTec's own algorithm), and GSM. Some compression algorithms may require more bandwidth or computing power than a user has access to, so the most powerful one may not always be accessible by a certain user.

In general, people that want to talk to each other over the Internet each log onto a computer equipped with a microphone and speaker and establish a connection. However, a user doesn't have to be online to reap the benefits of online telephone service. Any combination of computer/telephone calls can be made, even telephone to telephone, over the Internet. Whenever a telephone is used, the call must be transferred from the Internet to the local telephone system. The companies that provide Internet phone software also provide gateways through which these conversions occur. A fee for using the gateway is incurred by the user; these charges are very small compared to standard long-distance charges. For example, a transatlantic phone call using a telephone over the Internet could be as low as $.04/minute, as opposed to $1.00/minute over telephone lines.

Drawbacks

The voice quality of an Internet telephone call is its main drawback. The quality is often compared to that of a speakerphone; a little choppy at times, but generally understandable [2, 4, 5]. This is mainly a function of network congestion. When traffic causes delays or out-of-order packets, some packets are dropped, causing breaks in the signal. Also, there is a noticeable delay in the time it takes for the message to be sent through the channel. This is due to the complexity of the compression algorithm used (it takes time for the signal to be compressed and expanded) and the traffic on the network.

There are additional drawbacks besides low quality. Both users must use the same brand of software. Only one firewall, CheckPoint FireWall-1, allows Internet phone calls to pass through, and only by one Internet telephone software, VocalTec's Internet Phone [6]. The standard connection is half-duplex; only one person may be transmitting at one time. Although some sound cards are full-duplex, both people must use the correct hardware and software to have a full-duplex conversation.

Security

The levels of encryption vary from excellent to nonexistent, depending on the specific brand of Internet telephone software. Phil Zimmerman's PGPfone has excellent encryption; its encryption algorithms are as good as the ones used in the secure phones AT&T sells to the US government. One algorithm is used to create the session key for the conversation, then one of two algorithms may be selected to encrypt the bit stream itself [2]. Third Planet's DigiPhone allows you to add new encryption software; the standard encryption on this program is minimal. VocalTec's Internet Phone does not currently offer any encryption, although it plans to in its next version. The level of security varies with the specific brand of software, although it's likely that all new software will employ some type of security.

Summary of Internet Telephony

Internet phone users enjoy free long distance phone calls, coupled with numerous additional features. Both traditional telephone features, such as call-waiting and voice mail are included, as well as non-traditional features, such as group chats and text-based document sharing and whiteboarding. Users can choose various parameters, including quality vs. delay and the compression algorithm. Although the quality isn't as good as conventional phone calls, the discounted price makes up for it.

Although the price of the call is now negligible, the phone companies will be likely to object to the free long distance service offered by the Internet and may raise the price of local phone calls in response. It remains to be seen whether or not Internet telephone calls will continue to be such a good bargain to the average user, since the pricing for voice traffic is now undergoing change.

Voice Transfer and Compression Techniques

All calls made over the Internet employ some type of voice compression. A brief discussion of voice transfer and compression techniques is presented here in order to better understand some of the issues involved in sending voice over the Internet.

The analog audio signal must first be converted into a digital signal in order to be transmitted over the Internet. The bandwidth of a telephone line is about 3400 - 4000 Hz, since filters cut off frequencies higher than this. Nyquist's theorem says:

If a signal has been run through a filter of a bandwidth H, then the original filtered signal may be completely reconstructed by making 2H samples per second.
If a phone-quality signal is sampled at 8000 times per second, we will get the original filtered signal.

Pulse Code Modulation with Mu-Law Encoding

We need to digitally encode every sample taken. The quality of the signal will depend on the number of bits used to encode it. If an infinite number of bits were used, we would be able to represent the signal exactly as it had been transmitted over the phone line. We would like to minimize the number of bits needed for each sample to reduce bandwidth and shorten the delays necessary to encode and recreate the signal at each end.

The standard for pulse code modulation is ITU G.711 [7]. It involves assigning a level to each sample at every 1/8000 second. Only eight bits are sent to encode each sample, so only 256 different levels may be encoded. This produces a channel rate of 64 kbps.

At least twelve bits are needed to cover the range in amplitude of a voice signal. However, we are much more sensitive to changes at lower amplitude than high; therefore, to represent speech, we can use more bits to encode at low amplitudes and fewer bits to encode at high amplitudes. This nonuniform quantization can be done in several ways, such as mu-law and A-law encoding.

The ITU standard includes specifications for mu-law and A-law encoding and decoding. Mu-law is the standard for transmission over networks in the United States and Japan, while A-law is used in Europe. Both methods result in a signal sample being compressed down to 8 bits, from either 13 bits (mu-law) or 12 bits (A-law).

A short program segment in C follows, demonstrating how a 16-bit sample (where only the first 13 bits are used) is converted to an 8-bit signal using mu-law conversion [8]. (Since hardware already exists for taking 16-bit samples, the original sample is usually assumed to be 16 bits.)


#define BIAS 0x84              /* define the add-in bias for 16 bit samples */
#define CLIP 32635

int sign, exponent, mantissa, sample;

unsigned char linear2ulaw(sample)
 {static int exp_lut[256] =
       {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
        5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
        6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
        6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
        7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
        7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
        7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
        7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};

  sign = (sample >> 8) & 0x80;                        /* set aside the sign */
  if (sign != 0) sample = -sample;                    /* get magnitude      */
  if (sample > CLIP) sample = CLIP;                   /* clip the magnitude */

  sample = sample + BIAS;                   /* add bias for standardization */
  exponent = exp_lut[(sample >> 7) & 0xFF];    /* find exponent using table */
  mantissa = (sample >> (exponent + 3)) & 0x0F;       /* find mantissa      */
  return( ~(sign | (exponent << 4) | mantissa) ); /* sign exponent mantissa */
}

Three parameters are computed from the 13-bit sample: the sign, the exponent, and the mantissa. This is simply scientific notation: -2.54 E+5 contains the same components, only in decimal form (- is the sign, 2.54 is the mantissa, and +5 is the exponent). This also demonstrates the beneficial nature of nonuniform quantification: we only care about the first three significant digits; if the number is large (e.g. -2.54 E+5), it isn't very important to know the value in the ones place. However, if the number is very small (e.g. 3.42 E+2), we do care more about these less significant digits.

In the above algorithm, the exponent simply refers to the position of the first '1' in the (absolute value of the) sample. The mantissa is the next four bits. For example, consider the following sample:

             _______________________________________________
            |     |                   |           |         |
            |  1  |  0 0 0 0 0 1 0 1  |  1 1 0 1  |  X X X  |
            |_____|___________________|___________|_________|
                               * m m     m m
             sign    eight positions              3 unused bits
                     for the exponent
Here, the exponent is determined by calculating the value of the 8 exponent bits (= 5) and looking up the log(base 2) of this value in the table exp_lut (= 2). In other words, the first '1' (*) occurs 2 places to the left of the rightmost exponent position. Therefore, the mantissa (m) will be simply the following 4 bits (= 0 1 1 1). The sample includes a bias added in so that there will be a '1' somewhere in the 8 positions for the exponent. The last three bits are discarded, since we don't need that level of precision. The sample is thus converted from 1 00000101 1101 XXX to 1 010 0111.

Pulse code modulation is not considered to be part of signal compression; it is simply a method to quantize the actual signal nonuniformly. We get more useful information from these 8 bits than we would get from 8 uniformly quantized bits. PCM is usually done as a first step in compression; most compression algorithms compress the PCM signal, not the original signal.

Compression Methods

The goal of compression methods is to optimize the interplay of the following parameters: bit rate, delay, quality, and complexity. Often, high quality comes at the price of a complex algorithm, which causes a delay. The delay is a function of both the complexity of the algorithm (which can be changed) and the amount of traffic on the network at a given time (which cannot be changed). Voice compression quality is measured by Mean Opinion Score (MOS).
5 :
person to person
4 :
phone quality
3 :
adequately understandable, but not very good quality
2 :
can understand words, but not recognize the speaker
1 :
can't make out the words or recognize the speaker

The two main techniques are waveform encoding, which sends information about the signal as purely a sequence of samples, and modeling of the human vocal tract, which sends certain parameters based on the physical characteristics which produce speech.

Waveform Encoding: ADPCM

An extremely simple version of waveform encoding could be performed by sending the signal amplitude at each sampling time. As discussed above, this would require 64 kbps. However, since we have prior information about the signal (e.g. what the previous samples were), we can use this to develop a method for sending fewer bits of data.

Speech may be accurately predicted over short periods of time, such as 1/8000 second. Knowledge of the previous samples provides a good basis for estimating the next sample. Using this estimate, we can simply encode the difference between the estimated signal and the actual signal. Since the prediction is likely to be very accurate, the error will be small, and we can encode the error in just a few bits. On an 8-bit sample, we can send 2 - 5 bits for the error. When the signal varies rapidly from sample to sample, it is not exactly reproducible, but this produces a bad estimate for only one 1/8000 second sample. The subsequent sample estimates are based on the actual signal, not the estimate, so the estimator always has the most accurate information and is quickly corrected.

Adaptive Differential Pulse Code Modulation (ADPCM; ITU G.726 [9]) uses 4 bits and a standard prediction algorithm. Since only 4 bits are used, the bandwidth of the compressed signal is only 32 kbps, for a compression ratio of 2:1. This is an extremely simple compression algorithm, so there is not much delay in transmission. It also provides very good quality. Of course, the compression ratio is only 2:1, so good quality is to be expected.

Waveform Encoding: Transform Encoding

Transform encoding uses arithmetic transformations on the signal, producing an expression that describes the signal in fewer bits. The decoder uses the inverse transform to reproduce the signal.

This concept is demonstrated using Fourier series. Any signal may be represented by its Fourier coefficients. If the encoder simply sends these coefficients, it does not have to send the entire signal. The coefficients take up much less space than encoding the whole signal would, and the encoder can easily reconstruct the signal. In fact, only a few coefficients would probably be necessary to construct a voice signal over a short period of time, since speech is sinusoidal in nature.

Adaptive Transform Encoding (ATC) uses a fast transformation (e.g. FFT) to split blocks of the speech signal into frequency bands, each of which is analyzed. The number of bits needed to code each transformation coefficient is adapted, depending on which frequency range is being encoded. Speech has different qualities for different frequencies, and we can take advantage of this knowledge [9]. Using this encoding, phone quality speech can be produced at 16 kbps, for a compression ratio of 4:1. This is more complicated than ADPCM, but not too complicated to produce significant delays.

Modeling the Human Vocal Tract

Encoders that exploit the physical knowledge we have about human speech signals are called vocoders. Here, the voice signals are modeled as coming from the human vocal tract. LPC (linear predictive encoding) predicts the amplitude of voice frequencies from a model of the human vocal tract. Voice signals are represented by parameters, including gain (loudness), pitch, and other parameters, representing the position of the mouth, nose, tongue, teeth, etc. [10].

The signal is analyzed to determine the most likely values of these parameters that would have produced this signal; these parameters are sent over the network. This is similar to the concepts behind transform encoding, in that parameters that characterize the signal are sent instead of the signal. It is unlike waveform encoding, since it is not applied to any signal, rather, it is known that the signal represents speech, and can be parameterized as such.

A speech signal using this method can compress speech down to 8 kbps, and even 4 kbps (16:1). The encoding causes a delay, since the signal analysis and parameter calculations are complex; however, the decoding does not present too much of a delay. The quality is not perfect; if the bit rate is too low, the speech sounds synthesized. The speech is recognizable, but the speaker is not. This is due to the fact that the mouth positions are modeled instead of the actual signal.

Compression over Internet Phone

Internet telephony products generally use proprietary compression algorithms for encoding speech, so little is known about the exact methods these algorithms use. A brief summary of the published statistics of a common proprietary algorithm, DSP Group's TrueSpeech, is given here.

There are several different versions of TrueSpeech; the most common are TrueSpeech 8.5 and G.723 (which is ITU standard G.723). TrueSpeech 8.5 is a speech compression algorithm with relatively low complexity, with a 15:1 compression ratio and a MOS of 3.7. It can be encoded on a 486DX2 66 MHz machine and decoded on a 386 machine [11]. TrueSpeech G.723 is a higher complexity algorithm which includes Voice Activity Detection (VAD). This allows compression of the pauses in between words, increasing the compression ratio up to 35:1, with a MOS of 3.98 - almost phone quality. This requires higher computing power; at least a 60 MHz Pentium to encode and a 486DX 33 MHz to decode [11].

TrueSpeech is becoming the industry standard for videoconferencing and Internet telephony. Many companies already use this product, including AT&T, Intel, Microsoft, Prodigy, VDOnet, and VocalTec [11]. TrueSpeech 8.5 is built into Windows 95 and Windows NT.

Summary of Voice Transfer and Compression Techniques

The goal of voice transfer over the Internet is to be able to send as much information about the voice as possible, providing good reproduction, with as few bits as possible. The voice must first be digitized, then compressed and transmitted. The usual first step is to nonuniformly quantize the signal, using PCM with mu-law encoding.

Compression involves various other techniques, including simple waveform encoding and more complex algorithms, such as modeling the human vocal tract. The former technique compresses the actual signal, just as it would any other waveform signal, using methods such as sending the error of the predicted signal (e.g. ADPCM), or sending parameter about the signal (e.g. ATC). The latter technique uses the information we have about how humans produce speech to represent the signal as a set of parameters, which takes fewer bits to encode.

Summary and Conclusion

Internet telephony allows a user to make long-distance phone calls for only a slight investment. The voice is compressed and sent over the Internet, then expanded at the other side. Compression techniques were discussed to provide a background to understand the issues involved with voice compression, however, the proprietary algorithms used in specific Internet telephony software is not known.

Due to the compression techniques used, perfect quality is never achieved. However, the quality may be extremely close to telephone quality, using the best algorithms on an uncongested network. Other drawbacks include a delay in receiving the signal, and restrictions on what software is used. Because of the great savings on long-distance phone calls, it is likely that many of these drawbacks can be overlooked. It remains to be seen whether or not local phone companies will therefore begin to charge more for this service, and how much an individual will be affected. As for now, Internet phone is an extremely cheap way to talk long-distance.


References

1.
VocalTec Ltd. "VocalTec Telephony Gateway: Product Component Description and Functional Specifications"
2.
Wayner, Peter. "Hey Baby, Call Me at My IP Address" & "Pronounced Packet Problems". Byte 21:142-4, April 1996.
3.
DSP Group, Inc. "DSP Group's Truespeech Audio Licensed By Vocaltec For Use In Internet Telephone Software"
4.
Salamone, Salvatore. "Companies Cut Phone Bills". Byte 20:36, May 1995.
5.
Levy, Steven. "The rise of Internet telephony". Datamation 42:64-7, August 1996.
6.
CheckPoint Software Technologies, Inc. "CheckPoint Software Announces Support for Secure Use of Internet Phone"
7.
Hall, Jared. Pulse Code Modulation
8.
Hunt, Andrew. comp.speech FAQ: Q2.7: How do I convert to/from mu-law format?
9.
Byrne, R., Chelian, M., Chow, E., Goodhart, C., and Markley, R. Very Low Bit Rate Compression
10.
Bar-Ziv, Itai. My project - Voice compression in 1200 Bps
11.
DSP Group, Inc. "The Speech Technology for Videoconferencing, Telephony, and the Internet"