Packet loss

Packet loss occurs when one or more packets of data travelling across a computer network fail to reach their destination. Packet loss is one of the three main error types encountered in digital communications. Packet loss can be caused by signal degradation over the network medium due to multi-path fading, packet drop because of channel congestion, corrupted packets rejected in-transit, faulty networking hardware, faulty network drivers or normal routing routines.

RTP-Sight detects packet loss and stores loss distribution to 10 loss intervals so it is able to find larger consecutive losses. This is important because between two calls with two percent package loss, one with random losses throughout will be heard much better than one with a string of consecutive losses.

Packet delay variation PDV

In computer networking, packet delay variation (PDV) is the difference in end-to-end one-way delay between selected packets in a flow with any lost packets being ignored. The effect is sometimes referred to as jitter and although not in electronics, usage of the term jitter may cause confusion. In this document jitter will always mean PDV.

The delay is from the start of the packet being transmitted at the source to the end of the packet being received at the destination. A component of the delay which does not vary from packet to packet can be ignored, hence if the packet sizes are the same and packets always take the same time to be processed at the destination then the packet arrival time at the destination could be used instead of the time the end of the packet is received. For interactive real-time applications, e.g., VoIP, PDV can be a serious issue and hence VoIP transmissions may need Quality of Service-enabled networks to provide a high-quality channel.

The effects of PDV in multimedia streams can be removed by a properly sized jitter buffer at the receiver, which may only cause a detectable delay before the start of media playback.

RTP-Sight compares each RTP packet if the delay differs from the optimal value (for most cases the delay between two RTP packets are 20ms). If the delay is higher than 50ms it will be counted to one of PDV intervals which is stored for each RPT direction in cdr table. There are those PDV intervals: 50 – 70ms, 70 – 90ms, 90 – 120ms, 120 – 150ms, 150-200ms, > 300ms

The main advantage over traditional standard jitter metric value is that you can search calls for specific delays characteristics.

Jitter buffer

Jitter buffers or de-jitter buffers are used to counter PDV (jitter) introduced by queuing in packet switched networks a continuous stream of audio (or video) is transmitted over the network The maximum jitter that can be countered by a de-jitter buffer is equal to the buffering delay introduced before starting the play-out of the mediastream. In the context of packet-switched networks, the term packet delay variation is often preferred over jitter. Some systems use sophisticated delay-optimal de-jitter buffers that are capable of adapting the buffering delay to changing network jitter characteristics. These are known as adaptive de-jitter buffers and the adaptation logic is based on the jitter estimates calculated from the arrival characteristics of the media packets. Adaptive de-jittering involves introducing discontinuities in the media play-out, which may be irritating to the listener or viewer. Adaptive de-jittering is usually used for audio play-outs that feature a VAD/DTX encoded audio, which allows the lengths of the silence periods to be adjusted, thus minimizing the perceptible impact of the adaptation.

MOS score

Mean opinion score (MOS) is a test that has been used for decades in telephonnetworks to obtain the human user's view of the quality of the network. Historically, and implied by the word Opinion in its name, MOS was a subjective measurement where listeners would sit in a "quiet room" and score call quality as they perceived it; per ITU-T recommendation P.800, "The talker should be seated in a quiet room with volume between 30 and 120 m3 and a reverberation time less than 500 ms (preferably in the range 200-300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Measuring Voice over IP (VoIP) is more objective, and is instead a calculation based on performance of the IP network over which it is carried. The calculation, which is defined in the ITU-T PESQ P.862 standard. Like most standards, the implementation is somewhat open to interpretation by the equipment or software manufacturer. Moreover, due to technological progress of phone manufacturers, a calculated MOS of 3.9 in a VoIP network may actually sound better than the formerly subjective score of > 4.0.

In multimedia (audio, voice telephony, or video) especially when codecs are used to compress the bandwidth requirement (for example, of a digitized voice connection from the standard 64 kilobit/second PCM modulation), the MOS provides a numerical indication of the perceived quality of received media from the users' perspective after compression and/or transmission. The MOS is expressed as a single number in the range 1 to 5, where 1 is lowest perceived audio quality, and 5 is the highest.

MOS tests for voice are specified by ITU-T recommendation P.800

The MOS is generated by averaging the results of a set of standard, subjective tests where a number of listeners rate the audio quality of test sentences read aloud by both male and female speakers over the communications medium being tested. A listener is required to give each sentence a rating using the following rating scheme:

4GoodPerceptible but not annoying
3FairSlightly annoying
1BadVery annoying

The MOS is the arithmetic mean of all the individual scores, and can range from 1 (worst) to 5 (best).

Compressor/decompressor (codec) systems and digital signal processing (DSP) are commonly used in voice communications, and can be configured to conserve bandwidth, but there is a trade-off between voice quality and bandwidth conservation. The best codecs provide the most bandwidth conservation while producing the least degradation of voice quality. Bandwidth can be measured quantitatively, but voice quality requires human interpretation, although estimates of voice quality can be made by automatic test systems.

As an example, the following are mean opinion scores for one implementation of different codecs

CodecData rate [kbit/s]MOS
G.711 (ISDN)644.1
G.723.1 r636.33.9
GSM EFR12.23.8
G.726 ADPCM323.85
GSM FR12.23.5

MOS prediction

RTP-Sight transforms  PDV and packet loss into MOS score according to ITU-T E?model (please note that jitter is PDV)­. The voipmonitor MOS does not represent audio signal but network parameters. Because the relation between PDV and MOS score depends on jitterbuffer implementation voipmonitor implements three jitterbuffer simulators and thus 3 MOS scores:

  • MOS F1 – fixed jitterbuffer simulator up to 50 ms buffer. Any PDV higher than 50ms will produce packet loss even though there is no packet loss in the stream.
  • MOS F2 – fixed jitterbuffer simulator up to 200 ms buffer. Any PDV higher than 200ms will produce packet loss.
  • MOS adapt – adaptive jitterbuffer simulator up to 500ms buffer. Any PDV higher than current buffer length which is changing adaptively will produce packet loss.

If a call is long enough and there are only a few packet loss / PDV problems the MOS score can be averaged to good values although user remembers that the call had problems. This can happen on calls >15 minutes. We plan in future to calculate MOS scores after 20 seconds intervals and remember the worst MOS score.

RTP-Sight uses equations based on packet loss simulation using PESQ subjective MOS score. We have simulated random packet loss between RTP sender and receiver on a scale from 0 - 20% using Markov model distribution. Degraded audio signal for every packet loss simulation is compared with original sound by the PESQ which produces MOS score. Resulting data is on following chart where there are three surfaces. Top surface is MOS score (which is on Z axe) for G.711 codec with PLC implementation (asterisks internal PLC). Middle surface is for G.729 with native PLC and the bottom surface is for G.711 without PLC.

RTP-Sight uses the G.711 PLC surface variant for all calls regardless on codec and this is the reason why the MOS score starts at 4.5 for every good call regardless on codec. This is our intention because parametric MOS score is designed in our application for searching for calls with bad packet loss / PDV combinations regardless on codec or for watching sudden changes in MOS scores across whole SIP trunks. Mixing G.729 and G.711 MOS scores would be difficult to know if 3.9 MOS score (which is the highest number for G.729) is bad because of G.729 calls or if 3.9 is bad due to packet loss /PDV drops in G.711 calls.

And how the MOS score is exactly calculated? Based on our simulation data we have created approximate function which transforms data based on Ppl and BurstR into MOS score. The function is hardcoded directly in the sniffer.

Post Dial Delay (PDD)

Post Dial Delay (PDD) is experienced by the customer originating the call from the time the final digit is dialled to the point at which they hear ring tone or other in-band information. Where the originating network is required to play an announcement before completing the call then this definition of PDD excludes the duration of such announcements.


The RTP Control Protocol (RTCP) is a sister protocol of the Real-time Transport Protocol (RTP). Its basic functionality and packet structure is defined in the RTP specification RFC 3550 superseding its original standardization in 1996 (RFC 1889).RTCP provides out-of-band statistics and control information for an RTP flow. It partners RTP in the delivery and packaging of multimedia data, but does not transport any media streams itself. Typically RTP will be sent on an even-numbered UDP port, with RTCP messages being sent over the next higher odd-numbered port. The primary function of RTCP is to provide feedback on the quality of service (QoS) in media distribution by periodically sending statistics information to participants in a streaming multimedia session.RTCP gathers statistics for a media connection and information such as transmitted octet and packet counts, lost packet counts, jitter, and round-trip delay time. An application may use this information to control quality of service parameters, perhaps by limiting flow, or using a different codec.VoIPmonitor (version >= 5) is able to parse and store RTCP statistics. For each call RTCP jitter, fraction loss and total loss is saved for each direction.


The average call duration is a measurement that reflects an average length of telephone calls.


Answer Seizure Ratio

ASR is a measure of network quality defined in ITU SG2 Recommendation E.411. Its calculated by taking the number of successfully answered calls and dividing by the total number of calls attempted (seizures). Since busy signals and other rejections by the called number count as call failures, the calculated ASR value can vary depending on user behavior.


Packet loss concealment (PLC) is a technique to mask the effects of packet loss in VoIP communications. Because the voice signal is sent as packets on a VoIP network, they may travel different routes to get to destination. At the receiver a packet might arrive very late, corrupted or simply might not arrive. One of the cases in which the last situation could happen is where a packet is rejected by a server which has a full buffer and cannot accept any more data. In a VoIP connection, error-control techniques such as ARQ are not feasible and the receiver should be able to cope with packet loss.

PLC techniques

  • Zero insertion: the lost speech frames are replaced with zero
  • Waveform substitution: the missing gap is reconstructed by repeating a portion of already received speech. The simplest form of this would be to repeat the last received frame. Other techniques account for fundamental frequency, gap duration etc. Waveform substitution methods are popular because of their simplicity to understand and implement. An example of such an algorithm is proposed in ITU recommendation G.711 Appendix I.
  • Model-based methods: an increasing number of algorithms that take advantage of speech models of interpolating and extrapolating speech gaps are being introduced and developed.


Some values are expressed as %95 or %99 which is 95th percentile respectively 99th percentile. For example if MOS score %95 is 3 it tells that at least 5% of all calls have the MOS score 3 or worse. It is better to watch %95 or %99 than average or min/max values because average/min/max do not tell well that 5% of all calls are bad.

Example how the percentile is calculated for MOS score. Lets have 100 calls where only last 5 calls have MOS score 3.1, 2.5, 3.2, 1.0, 2.9.

  • order all MOS calls by the best MOS score to the lowest (4.5, 4.5, ..., 3.2, 3.1, 2.9, 2.5, 1.0)
  • remove first 95% of all calls (3.2, 3.1, 2.9, 2.5, 1.0)
  • take the first number from the left of the remaining 5% which is 3.2

In this example the MOS score 95th percentile is 3.2. The average MOS score is 4.4, Min is 1.0 Max is 4.5. As you can see the average / min / max are not much useful but the %95 percentile tells that we have problem with 5% of all calls.