VoIP Quality & Technology


Interest in Voice over IP (VoIP) has increased steadily over the past few years. Enterprises, ISPs, ITSPs (Internet Telephony Service Providers), and carriers view VoIP as a viable way to implement packet voice. Reasons for implementing VoIP quality network typically include toll-bypass, network consolidation, and service convergence. Toll-bypass allows long-distance calls to be placed without incurring the usual toll charges.

Through network consolidation, voice, video, and data can be carried over a single network infrastructure, thereby simplifying network management and reducing cost through the use of conventional equipment. With service convergence, enhanced functionality can be implemented through the coupling of multimedia services. This full integration permits new applications, such as unified messaging and web call center.

However, designing a VoIP network requires careful planning to ensure that voice quality can be properly maintained. This document examines the factors that affect voice quality and the test and analysis strategy for a VoIP network.

VoIP Compvoip-qulity-components


The image above shows the major components of a VoIP quality network. The gateway converts signals from the traditional telephony interfaces (POTS, T1/E1, ISDN, E&M trunks) to VoIP. An IP phone is a terminal that has native VoIP support and can connect directly to an
IP network. In this paper, the term terminal will be used to refer to either a gateway, an IP phone, or a PC with a VoIP interface (This is consistent with the terminology used in H.323. However, this paper does not assume that the VoIP system is necessarily based on H.323.)

The server provides management and administrative functions to support the routing of calls across the network. In a system based on H.323, the server is known as a gatekeeper. And in SIP/SDP, the server is a SIP server.

In a system based on MGCP or MEGACO, the server is a call agent. Finally, the IP network provides connectivity between all the terminals. The IP network can be a private network, an Intranet, or the Internet.

Once a call has been set up, a speech will be digitized and then transmitted across the network as IP frames. Voice samples are first encapsulated in RTP (Real-time Transport Protocol) and UDP (User Datagram Protocol) before being transmitted in an IP frame. Figure 2 shows an example of a VoIP frame in both LAN and WAN.


For example, if the CODEC used is G.711 and the packetization period is 20 ms, the payload will be 160 bytes. This will result in a total frame length of 206 bytes in WAN and 218 bytes in LAN.

Voice Quality

In designing a VoIP network, it is important to consider all the factors that will affect voice quality. A summary of the major factors follows.


Before analog voice can be transmitted over an IP network, it must first be digitized. The common coding standards are listed in the following table:

Description Your Total:

Coding Standard Algorithm Data Rate
G.711 PCM (Pulse Code Modulation) 64 kbps
G.726 ADPCM (Adaptive Differential Pulse Code Modulation) 16, 24, 32, 40 kbps
G.728 LD-CELP (Low Delay Code Excited Linear Prediction) 16 kbps
G.729 CS-ACELP (Conjugate Structure Algebraic CELP) 8 kbps
G.723.1 MP-MLQ (Multi-Pulse Maximum Likelihood Quantization)

ACELP (Algebraic Code Excited Linear Prediction)

 6.3 kbps 5.3 kbps


There is a general correlation between the voice quality and the data rate: the higher the data rate, the higher the voice quality. The relationship between the two will be examined in greater detail in “Mean Opinion Score (MOS)”

Frame Loss

VoIP frames have to traverse an IP network, which is unreliable. Frames may be dropped as a result of network congestion or data corruption. Furthermore, for real-time traffic like voice, retransmission of lost frames at the transport layer is not practical because of the additional delays. Hence, voice terminals have to deal with missing voice samples, also referred to as frame erasures. The effect of frame loss on voice quality depends on how the terminals handle frame erasures.

In the simplest case, the terminal leaves a gap in the voice stream if a voice sample is missing. If too many frames are lost, the speech will sound choppy with syllables or words missing. One possible recovery strategy is to replay the previous voice sample. This works well if only a few samples are missing. To better cope with burst errors, interpolation is usually used. Based on the previous voice samples, the decoder will predict what the missing frames should be. This technique is known as Packet Loss Concealment (PLC).



Another important consideration in designing a VoIP network is the effect of delay. Impairments caused by delays include echo and talker overlap.

Sources of Delays

Before assessing the impact of delay, it is useful to first identify the sources of delays.

Algorithmic Delay.

This is the delay introduced by the CODEC and is inherent in the coding algorithm. The following table summarizes the algorithmic delay of common coding standards.

Coding Standards Algorithmic Delay (ms)
G.711 0.125*
G.726 1
G.728 3-5
G.729 15†
G.723.1 37.5††

* The algorithmic delay can be 3.75ms if PLC is implemented.

† Includes lookahead buffer.

†† Includes lookahead buffer.

Packetization Delay.

In RTP, voice samples are often accumulated before putting into a frame for transmission to reduce the amount of overhead. RFC 1890 specifies that the default packetization period should be 20 ms. For G.711, this means that 160 samples will be accumulated and then transmitted in a single frame. On the other hand, G.723.1 generates a voice frame every 30 ms and each voice frame is usually transmitted as a single RTP packet.

Serialization Delay.

This is the time required to transmit the IP packet. For example, if G.711 is used and the packetization period is 20 ms (i.e., there are 160 bytes in the RTP payload), then the entire frame will be 206 bytes assuming PPP encapsulation. To transmit the frame, it will require 1.1 ms on a T1 line, 3.2 ms at 512 kbps, and 25.8 ms at 64 kbps. Furthermore, the serialization delay is incurred whenever it passes through another store-and-forward device such as a router or a switch. Thus, a frame that traverses 10 routers will incur this delay 10 times.

Propagation Delay.

This is the time required for the electrical or optical signal to travel along a transmission medium and is a function of the geographic distance. The propagation speed in a cable is approximately 4 to 6 microseconds per kilometer. For satellite transmission, the delay is 110 ms for a 14000-km altitude satellite and 260 ms for a 36000-km altitude satellite.

Component Delay.

These are delays caused by the various components within the transmission system. For example, a frame passing through a router has to move from the input port to the output port through the backplane. There is some minimum delay due to the speed of the backplane and some variable delays due to queuing and router processing.

Echo Cancellation

The first impairment caused by delay is the effect of echo. Echo can arise in a voice network due to poor coupling between the earpiece and the mouthpiece in the handset.
This is known as acoustic echo. It can also arise when part of the electrical energy is reflected back to the speaker by the hybrid circuit3 in the PSTN (Public Switched Telephone Network). This is known as the hybrid echo.

When the one-way end-to-end delay is short, whatever echo that is generated by the voice circuit will come back to the speaker very quickly and will not be noticeable. In fact, the guideline is that echo cancellation is not necessary if the one-way delay is less than 25 ms. In other words, if the echo comes back within 50 ms, it will not be noticeable. However, the one-way delay in a VoIP network will almost always exceed 25 ms. Therefore, echo cancellation is always required.

Talker Overlap

Even with perfect echo cancellation, carrying on a two-way conversation becomes difficult when the delay is too long because of talker overlap. This is the problem that occurs when one party cuts off the other party’s speech because of the long delay. G.114 provides the following guidelines regarding the one-way delay limit:

0 to 150 ms Acceptable for most user application
150 to 400 ms Acceptable provided that Administrations are aware of the transmission time impact on the transmission quality
Above 400ms Unacceptable for general network planning purposes

Delay Variation (Jitter)

When frames are transmitted through an IP network, the amount of delay experienced by each frame may differ. This is because the amount of queuing delay and processing time can vary depending on the overall load in the network. Even though the source gateway generates voice frames at regular intervals (say, every 20 ms), the destination gateway will typically not receive voice frames at regular intervals because of jitter.

In general, jitter will result in clumping and gaps in the incoming data stream. The general strategy in dealing with jitter is to hold the incoming frames in a playout buffer long enough to allow the slowest frames to arrive in time to be played in the correct sequence. The larger the amount of jitter, the longer some of the frames will be held in the buffer, which introduces additional delay.

Minimize the delay

To minimize the delay due to buffering, most implementations use an adaptive jitter buffer. In other words, if the amount of jitter in the network is small, the buffer size will be small. If the jitter increases due to increased network load, the buffer size will increase automatically to compensate for it.

Therefore, jitter in the network will impair voice quality to the extent that it increases the end-to-end delay due to the playout buffer. Sometimes when the jitter is too large, the playout buffer may choose to allow some frame loss to keep the additional delay from getting too long.


Delay Budget

Example 1


An example of a VoIP network and the sources of delay. The following delay budget can be constructed. Assume an end-to-end delay target of 150 ms.


G.723.1 (algorithmic delay) 37.5
G.723.1 (processing delay) 30
Serialization delay (two T1s) 2.0
Propagation delay (5000km of fiber) 25.0
Other component delays 2.0
Total fixed delay 96.5

Variable delay limit = 150 – 96.5 = 53.5 ms

In this example, the fixed (minimum) delay is calculated to be 96.5 ms. The presence of jitter will add to the end-to-end delay. How much jitter can the system tolerate? If the end-to-end delay target is 150 ms, then the maximum jitter that can be tolerated is 53.5 ms. The assumption is that the jitter will be removed by a playout buffer which can delay frames by up to 53.5 ms to remove the jitter.

Example 2

However, this example assumes that you knew the exact topology of the network, and thus were able to calculate all the delay components. In the next example, we assume that the voice gateways are connected via a VPN service offered by an ISP.

Assume an end-to-end delay target of 150 ms:

G.723.1 (algorithmic delay) 37.5
G.723.1 (processing delay) 30
Total fixed delay 67.5

Internet delay limit = 150 – 67.5 = 82.5 ms

In this example, we can only identify the delays due to the two gateways. To stay within the delay target of 150 ms, the delay introduced by the ISP must not exceed 82.5 ms. Note that this represents both the fixed and variable delays. In other words, the minimum delay along the VPN path might be 50 ms. The maximum jitter that the system can tolerate will be 32.5 ms, which will be compensated by the playout buffer. Today, many ISPs offer VPN service with a Service Level Agreement (SLA). An SLA will typically guarantee a certain round-trip delay between sites.


Delay budget example 2