The user often has a choice of network parameters, including sampling rate and compression algorithm. A high sampling rate gives more information about the speech, but can flood the channel; words, phrases, or even whole sentences may be lost [2]. A better quality connection requires a longer delay than a mediocre one; the choice between high quality and minimal delay is up to the user. The user is often offered the choice of compression algorithm; connections through VocalTec's Internet Phone use one of the following compression algorithms: TrueSpeech 8.5 [3], VSC (VocalTec's own algorithm), and GSM. Some compression algorithms may require more bandwidth or computing power than a user has access to, so the most powerful one may not always be accessible by a certain user.
In general, people that want to talk to each other over the Internet each log onto a computer equipped with a microphone and speaker and establish a connection. However, a user doesn't have to be online to reap the benefits of online telephone service. Any combination of computer/telephone calls can be made, even telephone to telephone, over the Internet. Whenever a telephone is used, the call must be transferred from the Internet to the local telephone system. The companies that provide Internet phone software also provide gateways through which these conversions occur. A fee for using the gateway is incurred by the user; these charges are very small compared to standard long-distance charges. For example, a transatlantic phone call using a telephone over the Internet could be as low as $.04/minute, as opposed to $1.00/minute over telephone lines.
There are additional drawbacks besides low quality. Both users must use the same brand of software. Only one firewall, CheckPoint FireWall-1, allows Internet phone calls to pass through, and only by one Internet telephone software, VocalTec's Internet Phone [6]. The standard connection is half-duplex; only one person may be transmitting at one time. Although some sound cards are full-duplex, both people must use the correct hardware and software to have a full-duplex conversation.
Although the price of the call is now negligible, the phone companies will be likely to object to the free long distance service offered by the Internet and may raise the price of local phone calls in response. It remains to be seen whether or not Internet telephone calls will continue to be such a good bargain to the average user, since the pricing for voice traffic is now undergoing change.
The analog audio signal must first be converted into a digital signal in order to be transmitted over the Internet. The bandwidth of a telephone line is about 3400 - 4000 Hz, since filters cut off frequencies higher than this. Nyquist's theorem says:
If a signal has been run through a filter of a bandwidth H, then the original filtered signal may be completely reconstructed by making 2H samples per second.If a phone-quality signal is sampled at 8000 times per second, we will get the original filtered signal.
The standard for pulse code modulation is ITU G.711 [7]. It involves assigning a level to each sample at every 1/8000 second. Only eight bits are sent to encode each sample, so only 256 different levels may be encoded. This produces a channel rate of 64 kbps.
At least twelve bits are needed to cover the range in amplitude of a voice signal. However, we are much more sensitive to changes at lower amplitude than high; therefore, to represent speech, we can use more bits to encode at low amplitudes and fewer bits to encode at high amplitudes. This nonuniform quantization can be done in several ways, such as mu-law and A-law encoding.
The ITU standard includes specifications for mu-law and A-law encoding and decoding. Mu-law is the standard for transmission over networks in the United States and Japan, while A-law is used in Europe. Both methods result in a signal sample being compressed down to 8 bits, from either 13 bits (mu-law) or 12 bits (A-law).
A short program segment in C follows, demonstrating how a 16-bit sample (where only the first 13 bits are used) is converted to an 8-bit signal using mu-law conversion [8]. (Since hardware already exists for taking 16-bit samples, the original sample is usually assumed to be 16 bits.)
#define BIAS 0x84 /* define the add-in bias for 16 bit samples */
#define CLIP 32635
int sign, exponent, mantissa, sample;
unsigned char linear2ulaw(sample)
{static int exp_lut[256] =
{0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
sign = (sample >> 8) & 0x80; /* set aside the sign */
if (sign != 0) sample = -sample; /* get magnitude */
if (sample > CLIP) sample = CLIP; /* clip the magnitude */
sample = sample + BIAS; /* add bias for standardization */
exponent = exp_lut[(sample >> 7) & 0xFF]; /* find exponent using table */
mantissa = (sample >> (exponent + 3)) & 0x0F; /* find mantissa */
return( ~(sign | (exponent << 4) | mantissa) ); /* sign exponent mantissa */
}
Three parameters are computed from the 13-bit sample: the sign, the exponent,
and the mantissa. This is simply scientific notation: -2.54 E+5 contains
the same components, only in decimal form (- is the sign, 2.54 is the mantissa,
and +5 is the exponent). This also demonstrates the beneficial nature of
nonuniform quantification: we only care about the first three significant
digits; if the number is large (e.g. -2.54 E+5), it isn't very important to
know the value in the ones place. However, if the number is very small
(e.g. 3.42 E+2), we do care more about these less significant digits.
In the above algorithm, the exponent simply refers to the position of the first '1' in the (absolute value of the) sample. The mantissa is the next four bits. For example, consider the following sample:
_______________________________________________
| | | | |
| 1 | 0 0 0 0 0 1 0 1 | 1 1 0 1 | X X X |
|_____|___________________|___________|_________|
* m m m m
sign eight positions 3 unused bits
for the exponent
Here, the exponent is determined by calculating the value of the 8 exponent
bits (= 5) and looking up the log(base 2) of this value in the table exp_lut
(= 2). In other words, the first '1' (*) occurs 2 places to the left of the
rightmost exponent position. Therefore, the mantissa (m) will be simply the
following 4 bits (= 0 1 1 1). The sample includes a bias added in so that
there will be a '1' somewhere in the 8 positions for the exponent.
The last three bits are discarded, since we don't need that level of precision.
The sample is thus converted from 1 00000101 1101 XXX to 1 010 0111.
Pulse code modulation is not considered to be part of signal compression; it is simply a method to quantize the actual signal nonuniformly. We get more useful information from these 8 bits than we would get from 8 uniformly quantized bits. PCM is usually done as a first step in compression; most compression algorithms compress the PCM signal, not the original signal.
The two main techniques are waveform encoding, which sends information about the signal as purely a sequence of samples, and modeling of the human vocal tract, which sends certain parameters based on the physical characteristics which produce speech.
Speech may be accurately predicted over short periods of time, such as 1/8000 second. Knowledge of the previous samples provides a good basis for estimating the next sample. Using this estimate, we can simply encode the difference between the estimated signal and the actual signal. Since the prediction is likely to be very accurate, the error will be small, and we can encode the error in just a few bits. On an 8-bit sample, we can send 2 - 5 bits for the error. When the signal varies rapidly from sample to sample, it is not exactly reproducible, but this produces a bad estimate for only one 1/8000 second sample. The subsequent sample estimates are based on the actual signal, not the estimate, so the estimator always has the most accurate information and is quickly corrected.
Adaptive Differential Pulse Code Modulation (ADPCM; ITU G.726 [9]) uses 4 bits and a standard prediction algorithm. Since only 4 bits are used, the bandwidth of the compressed signal is only 32 kbps, for a compression ratio of 2:1. This is an extremely simple compression algorithm, so there is not much delay in transmission. It also provides very good quality. Of course, the compression ratio is only 2:1, so good quality is to be expected.
This concept is demonstrated using Fourier series. Any signal may be represented by its Fourier coefficients. If the encoder simply sends these coefficients, it does not have to send the entire signal. The coefficients take up much less space than encoding the whole signal would, and the encoder can easily reconstruct the signal. In fact, only a few coefficients would probably be necessary to construct a voice signal over a short period of time, since speech is sinusoidal in nature.
Adaptive Transform Encoding (ATC) uses a fast transformation (e.g. FFT) to split blocks of the speech signal into frequency bands, each of which is analyzed. The number of bits needed to code each transformation coefficient is adapted, depending on which frequency range is being encoded. Speech has different qualities for different frequencies, and we can take advantage of this knowledge [9]. Using this encoding, phone quality speech can be produced at 16 kbps, for a compression ratio of 4:1. This is more complicated than ADPCM, but not too complicated to produce significant delays.
The signal is analyzed to determine the most likely values of these parameters that would have produced this signal; these parameters are sent over the network. This is similar to the concepts behind transform encoding, in that parameters that characterize the signal are sent instead of the signal. It is unlike waveform encoding, since it is not applied to any signal, rather, it is known that the signal represents speech, and can be parameterized as such.
A speech signal using this method can compress speech down to 8 kbps, and even 4 kbps (16:1). The encoding causes a delay, since the signal analysis and parameter calculations are complex; however, the decoding does not present too much of a delay. The quality is not perfect; if the bit rate is too low, the speech sounds synthesized. The speech is recognizable, but the speaker is not. This is due to the fact that the mouth positions are modeled instead of the actual signal.
There are several different versions of TrueSpeech; the most common are TrueSpeech 8.5 and G.723 (which is ITU standard G.723). TrueSpeech 8.5 is a speech compression algorithm with relatively low complexity, with a 15:1 compression ratio and a MOS of 3.7. It can be encoded on a 486DX2 66 MHz machine and decoded on a 386 machine [11]. TrueSpeech G.723 is a higher complexity algorithm which includes Voice Activity Detection (VAD). This allows compression of the pauses in between words, increasing the compression ratio up to 35:1, with a MOS of 3.98 - almost phone quality. This requires higher computing power; at least a 60 MHz Pentium to encode and a 486DX 33 MHz to decode [11].
TrueSpeech is becoming the industry standard for videoconferencing and Internet telephony. Many companies already use this product, including AT&T, Intel, Microsoft, Prodigy, VDOnet, and VocalTec [11]. TrueSpeech 8.5 is built into Windows 95 and Windows NT.
Compression involves various other techniques, including simple waveform encoding and more complex algorithms, such as modeling the human vocal tract. The former technique compresses the actual signal, just as it would any other waveform signal, using methods such as sending the error of the predicted signal (e.g. ADPCM), or sending parameter about the signal (e.g. ATC). The latter technique uses the information we have about how humans produce speech to represent the signal as a set of parameters, which takes fewer bits to encode.
Due to the compression techniques used, perfect quality is never achieved. However, the quality may be extremely close to telephone quality, using the best algorithms on an uncongested network. Other drawbacks include a delay in receiving the signal, and restrictions on what software is used. Because of the great savings on long-distance phone calls, it is likely that many of these drawbacks can be overlooked. It remains to be seen whether or not local phone companies will therefore begin to charge more for this service, and how much an individual will be affected. As for now, Internet phone is an extremely cheap way to talk long-distance.