Internet DRAFT - draft-espelien-avt-common
draft-espelien-avt-common
Audio/Video Transport Working Group M. Espelien
Internet Draft: RTP Payload Common Format for R. Gellens
Vocoder Speech Qualcomm Inc.
Document: draft-espelien-avt-common-01.txt October 2001
RTP Payload Common Format for Vocoder Speech
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. Internet-Drafts are
working documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups. Note that other groups may also
distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-
Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society 2001. All Rights Reserved.
Espelien & Gellens Expires April 2002 [Page 1]Internet Draft Common Payload Format October 2001
Table of Contents
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Conventions Used in this Document . . . . . . . . . . . . . 3
3 Changes from Previous Revision . . . . . . . . . . . . . . . 3
4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Background and Motivation for Common Format . . . . . . . . . 3
6 Common Characteristics . . . . . . . . . . . . . . . . . . . 4
6.1 PureVoice Characteristics . . . . . . . . . . . . . . . . 5
6.2 EVRC Characteristics . . . . . . . . . . . . . . . . . . 5
6.3 SMV Characteristics . . . . . . . . . . . . . . . . . . . 5
7 Common RTP Packet Format . . . . . . . . . . . . . . . . . . 6
7.1 Normal Format . . . . . . . . . . . . . . . . . . . . . . 6
7.2 TOC Entries . . . . . . . . . . . . . . . . . . . . . . 8
7.3 Bundling Codec Data Frames . . . . . . . . . . . . . . . 9
7.3.1 Additional Bundling Restrictions on the Sender . . . 10
7.4 Interleaving Codec Data Frames . . . . . . . . . . . . . 10
7.4.1 Additional Interleaving Restrictions on the Sender . 11
7.5 Finding Interleave Group Boundaries . . . . . . . . . . . 11
7.6 Reconstructing Interleaved Speech . . . . . . . . . . . 12
7.7 Receiving Invalid Values . . . . . . . . . . . . . . . . 12
7.8 Optimized Single Frame Format . . . . . . . . . . . . . 13
7.9 Detecting Which Format . . . . . . . . . . . . . . . . . 13
7.10 Codec Data Frame Format . . . . . . . . . . . . . . . . 13
7.10.1 PureVoice Codec Data Frame Format . . . . . . . . . 13
7.10.2 EVRC or SMV Codec Data Frame Format . . . . . . . . 14
7.11 Adding New Codecs . . . . . . . . . . . . . . . . . . . 15
8 Tardy Packets . . . . . . . . . . . . . . . . . . . . . . . 15
9 Lost Packets . . . . . . . . . . . . . . . . . . . . . . . . 16
10 Implementation Issues . . . . . . . . . . . . . . . . . . . 16
10.1 Interleaving Length . . . . . . . . . . . . . . . . . . . 16
11 Security Considerations . . . . . . . . . . . . . . . . . . 17
12 Real Time and Storage Mode . . . . . . . . . . . . . . . . . 17
12.1 RTP Mode . . . . . . . . . . . . . . . . . . . . . . . . 17
12.2 Storage Mode . . . . . . . . . . . . . . . . . . . . . . 17
13 IANA Considerations . . . . . . . . . . . . . . . . . . . . 18
13.1 Registration of MIME Media Type . . . . . . . . . . . . . 19
13.1.1 audio/EVRC Media Type Registration . . . . . . . . . 19
13.1.2 audio/SMV Media Type Registration . . . . . . . . . . 19
13.1.3 audio/qcelp-common Media Type Registration . . . . . 20
13.2 Optional Media Type Parameters . . . . . . . . . . . . . 21
14 Mapping to SDP Parameters . . . . . . . . . . . . . . . . . 22
15 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 23
16 References . . . . . . . . . . . . . . . . . . . . . . . . . 23
17 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 24
1 Abstract
This document describes a common [RTP] payload format for speech
encoded using wireless vocoders which share certain common
characteristics (see section 6).
Espelien & Gellens Expires April 2002 [Page 2]Internet Draft Common Payload Format October 2001
This is expected to be especially useful in wireless systems. For
exmaple, CDMA networks use one of three vocoders: [PureVoice]
(Qcelp), [EVRC] (Enhanced Variable Rate Codec) and in the future
[SMV] (Selectable Mode Vocoder). All of these vocoders share a
number of common characteristics (see section 6) and can be
transmitted using the RTP payload format specified in this document.
New vocoders with such characteristics can easily be added to this
common format by following the steps in section 7.11.
An interleaved format is included to reduce the effect of packet
loss on speech quality, as well as a bundled format, and a format
optimized for header compression.
2 Conventions Used in this Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [KEYWORDS].
3 Changes from Previous Revision
This is the second version. Changes include:
+ make frame size more generic (previous version assumed 20 ms
frame size)
+ clarify null and erasure frames
+ correct grammatical errors.
4 Introduction
This document describes a generalized format for use as an [RTP]
payload type. Three CDMA vocoders are initially specified in this
common format; more can be added by following the procedures in
section 7.11. The [PureVoice] Qcelp vocoder and the [EVRC] vocoder
are already widely deployed in CDMA wireless networks. [SMV] is the
codec of choice for next generation CDMA wireless networks and is
likely to be widely deployed as next generation wireless networks
are rolled out.
Multiple codec data frames MAY be bundled together to reduce the
per-frame transmission overhead.
Codec data frames can be interleaved to reduce quality degradation
due to lost packets. The sender can choose various interleave
settings based on the importance of low end-to-end delay versus
greater tolerance for lost packets.
A format optimized for header compression is provided (see section
7.8).
5 Background and Motivation for Common Format
The Electronic Industries Association (EIA) & Telecommunications
Industry Association (TIA) has published three standards which
Espelien & Gellens Expires April 2002 [Page 3]Internet Draft Common Payload Format October 2001
define the speech compression algorithms for CDMA applications:
PureVoice, EVRC and SMV.
The [SMV] codec is the preferred speech codec standard for CDMA2000.
The SMV will be deployed in third generation handsets, in addition
to PureVoice and EVRC codecs.
There are currently handsets that support two of these codecs, and
in the future handsets might support all three codecs. The
PureVoice and EVRC codecs are currently deployed in millions of
first and second generation CDMA handsets.
The format of the three codec (PureVoice, EVRC and SMV) frames is
very similar.
The similarities suggest that a common specification for
encapsulating these three wireless vocoders as well as potential
future wireless vocoders is possible and worth pursuing.
The environment (memory, processor speed, etc.) of wireless handsets
is constrained. A common RTP payload format for multiple vocoders
allows the handset to support these vocoders with a single, smaller
RTP implementation than would be needed for separate formats,
reducing code size and complexity, and therefore shortening time to
market, lowering costs, and improving quality. It also permits
saved handset resources to be spent on user features.
Since an RTP format for [EVRC] and [SMV] has not yet been approved,
a direct case can be made for a common format supporting at least
these two (plus future) codecs.
The situation with [PureVoice] is more complex. An RTP format
already exists [vnd.Qcelp] and is specified in [RFC2658]; therefore
it would be ideal for a common format supporting PureVoice as well
as EVRC and SMV to interoperate with existing implementations of
[vnd.Qcelp]. However, if interoperability is sacrificed,
significant benefits can be obtained by making better use of RTP
packet bits; for example, allowing for table-of-contents entries as
well as a frame count field, yet spending the same number of bits
(or fewer) per packet on average.
The common format specified here gives up interoperability with
[vnd.Qcelp] in order to gain packet optimization benefits.
6 Common Characteristics
The format of the three initial codec (PureVoice, EVRC and SMV)
frames is very similar. This specification is designed to transport
data frames of vocoders that have the following characteristics:
- are frame based
Espelien & Gellens Expires April 2002 [Page 4]Internet Draft Common Payload Format October 2001
- null and erasure frames are allowed
- total number of rates < 17.
- maximum full rate frame can be transported in a single RTP
packet using this specific format.
Vocoders with characteristics that can be expressed in format type,
TOC entries and codec frames can easily be expressed in this common
format. New vocoders with such characteristics can be added to this
common format by following the steps in section 7.11.
6.1 PureVoice Characteristics
The Qcelp [PureVoice] codec compresses each 20 milliseconds of 8000
Hz sampled input speech into one of four different size output
frames: Rate 1 (266 bits), Rate 1/2 (124 bits), Rate 1/4 (54 bits)
or Rate 1/8 (20 bits). In addition, there are two zero bit vocoder
frame types (see PureVoice Table in section 7.2): null frames and
erasure frames. (Erasure frames are never transmitted; they are
substituted by the receiver for lost or damaged frames. Null frames
are produced as a result of the vocoder running at rate 0. Null
frames are zero bits long and are also not transmitted.)
6.2 EVRC Characteristics
The [EVRC] codec compresses each 20 milliseconds of 8000 Hz sampled
input speech into one of four different size output frames: Rate 1
(171 bits), Rate 1/2 (80 bits), Rate 1/4 (40 bits) or Rate 1/8 (16
bits). In addition, there are two zero bit vocoder frame types (see
EVRC Table in section 7.2): null frames and erasure frames.
(Erasure frames are never transmitted; they are substituted by the
receiver for lost or damaged frames.Null frames are produced as a
result of the vocoder running at rate 0. Null frames are zero bits
long and are also not transmitted.)
6.3 SMV Characteristics
Like the EVRC, the [SMV] codec also compresses each 20 milliseconds
of 8000 Hz sampled input speech into one of four different size
output frames: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate 1/4 (40
bits) or Rate 1/8 (16 bits). In addition, there are two zero bit
vocoder frame types (see SMV Table in section 7.2): null frames and
erasure frames. (Erasure frames are never transmitted; they are
substituted by the receiver for lost or damaged frames. Null frames
are produced as a result of the vocoder running at rate 0. Null
frames are zero bits long and are also not transmitted.)
The SMV is more bandwidth efficient than the EVRC vocoder. The SMV
achieves lower average data rates (ADR) by transmitting at
percentages of each rate as shown in the table above. The
assumptions and details of noise levels and ADR are described in
Chapter 4 of [SMV]. The EVRC is equivalent in performance to SMV
mode 1.
Espelien & Gellens Expires April 2002 [Page 5]Internet Draft Common Payload Format October 2001
The SMV codec operates in one of four modes. Each mode employs one
of the vocoders operating at the rates mentioned above. Each mode
operates in all rates (full to 1/8) for varying percentages of time,
based on desired average data rate specified, taking into account
characteristics of the speech samples.
[SMV] modes can be changed on a frame by frame basis. Note that the
[SMV] mode is not encapsulated in the RTP packet; only fields
defined in section 7.1 or 7.8 are sent as RTP payload. [SMV] modes
are included in this document for informational purposes only.
While each [SMV] mode can operate in all rates (full to 1/8) for
varying percentages of time, higher or lower average data rate are
achieved for each mode. This is shown in the table below:
Mode 0 Mode 1 Mode 2 Mode 3
-------------------------------------------------------------
Rate 1 68.90% 38.14% 15.43% 07.49%
Rate 1/2 06.03% 15.82% 38.34% 46.28%
Rate 1/4 00.00% 17.37% 16.38% 16.38%
Rate 1/8 25.07% 28.67% 29.85% 29.85%
-------------------------------------------------------------
ADR 7205 bps 5182 bps 4073 bps 3692 bps
The SMV codec chooses the output frame rate based on an analysis of
the input speech and the current operating mode (either normal or
one of three reduced rates). For typical speech patterns, this
results in an average output of 4.2k bits/second for normal mode and
lower for reduced rate modes.
7 Common RTP Packet Format
The RTP timestamp is in 1/8000 of a second units. The RTP payload
data for the common format is one of two types: normal (type 1) and
optimized single frame (type 2).
7.1 Normal Format
Normal packet format allows for multiple codec frames to be included
in each RTP packet. The sender chooses how many codec data frames
to include in each RTP packet. If more than one, the sender chooses
to bundle or interleave the frames. Bundling groups two or more
consecutive data frames in a single RTP packet. Interleaving groups
two or more non-consecutive frames in a packet. Interleaving can
mitigate the listener's perception of data loss.
Espelien & Gellens Expires April 2002 [Page 6]Internet Draft Common Payload Format October 2001
The normal codec RTP payload data is formatted as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header [RTP] |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|R|R| LLL | NNN |R|R|Frame Count| TOC | ... | TOC |padding|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|one or more codec data frames, one per TOC entry |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The RTP header has the expected values as described in [RTP]. The
use of the marker bit in the RTP header is outside the scope of this
document. The use of the marker bit is defined by the application.
When multiple codec data frames are present in a single RTP packet,
the timestamp is, as always, that of the oldest data represented in
the RTP packet.
The assignment of an RTP payload type for this new packet format is
outside the scope of this document, and will not be specified here.
It is expected that the RTP profile for a particular class of
applications will assign a payload type for this encoding, or if
that is not done then a payload type in the dynamic range will be
chosen. [SDP] can be used to signal out of band the RTP payload type
(see example in section 14).
The fields following the RTP header have the following meaning:
1st bit: Reserved (R): 1 bit
MUST be set to zero by sender; SHOULD be ignored by receiver.
2nd bit: Reserved (R): 1 bit
MUST be set to zero by sender; SHOULD be ignored by receiver.
3rd-5th bit: Interleave (LLL): 3 bits
MUST be set to a value from 0 to 7. If this field is non-zero,
interleaving is enabled. All receivers MUST support
interleaving. Senders MAY support interleaving. Senders that
do not support interleaving MUST set field LLL and NNN to zero.
6th-8th bit: Interleave Index (NNN): 3 bits
MUST have a value less than or equal to the value of LLL.
Values of NNN greater than the value of LLL are invalid.
More than one codec data frame MAY be included in a single RTP
packet. Multiple data frames are either bundled or interleaved.
Bundling is described in detail in Section 7.3, and interleaving in
Section 7.4.
Espelien & Gellens Expires April 2002 [Page 7]Internet Draft Common Payload Format October 2001
If only one codec data frame is included in an RTP packet, the LLL
and NNN fields MUST be zero.
9th bit: Reserved (R): 1 bit
MUST be set to zero by sender; SHOULD be ignored by receiver.
10th bit: Reserved (R): 1 bit
MUST be set to zero by sender; SHOULD be ignored by receiver.
11th-16th bit: Frame Count (Count): 6 bits
MUST be set by sender to the number of codec data frames minus
one. Valid values range from 0 to 63. The frame count plus one
indicates how many TOC entries (and codec data frames) are
present in the RTP packet. A value of zero indicates one frame.
A value of 63 indicates 64 frames.
TOC entries are described in section 7.2. TOC entries provide
information about the encoding rate and length of the respective
codec frame. Codec frames are speech data encoded at various rates
(Full, 1/2, 1/4, or 1/8). Null and erasure frames are not played
out but have zero length and corresponding TOC entry indicating null
or erased frame type.
17th-20th bit: First Table of Contents (TOC): 4 bits
MUST be set by sender as described in section 7.2. There is one
TOC entry for each codec frame. The value can range from 0 to 5
as shown in the three tables below. Each value indicates to the
receiver the length of the corresponding codec data frame.
Padding (padding): 0 or 4 bits
If the frame count is odd, then the sender MUST set 4 bits of
padding following the last TOC entry and preceding the first
codec data frame to zero. If the frame count is even, then no
padding is used; the first codec data frame immediately follows
the last TOC entry.
The receiver interprets the bits following the last TOC entry or
padding as the first codec data frame.
Codec Frame(s):
Length depends on codec and rate See descriptions in section
7.2. Each codec frame uses zero or more bits, depending on the
rate specified by TOC and codec type specified by MIME type.
(For example, half Rate EVRC and SMV codec frames are 80 bits
long, while a half rate PureVoice codec frames are 124 bits
long.) The sender sets the TOC value, and associated codec
frame. The tables below correlate TOC values with valid codec
lengths for the initial three codecs; future codecs specify
mapping in their MIME registration, as per section 7.11.
Espelien & Gellens Expires April 2002 [Page 8]Internet Draft Common Payload Format October 2001
7.2 TOC Entries
TOC entries apply only to multiple frame (Type 1) format as
described in section 7.1. Each TOC entry is correlated with the
respective codec data frame. The TOC value indicates the rate set
and number of bits in the data frame. For PureVoice, EVRC and SMV
the following tables are used:
TOC PureVoice
Value Rate Codec data frame size (in octets)
----- ------- ----------------------------------------------
0 Blank 0 (0 bits)
1 1/8 3 (20 bits; 4 zero bits of padding at end)
2 1/4 6 (54 bits; 2 zero bits of padding at end)
3 1/2 16 (124 bits; 4 zero bits of padding at end)
4 1 34 (266 bits; 6 zero bits of padding at end)
5 Erasure 0 SHOULD NOT be transmitted by sender
6-15 n/a n/a Reserved. SHOULD NOT be transmitted
Note that the common frame format for PureVoice has TOC entries
instead of lead bytes. As a result, the PureVoice codec frame size
in the table indicates the size of the data itself, just as it does
for EVRC and SMV.
TOC EVRC
Value Rate Codec data frame size (in octets)
----- ------- --------------------------------------------
0 Blank 0 (0 bits)
1 1/8 2 (16 bits)
2 1/4 5 (40 bits)
3 1/2 10 (80 bits)
4 1 22 (171 bits; 5 padded at end with zeros)
5 Erasure 0 SHOULD NOT be transmitted by sender
6-15 n/a n/a Reserved. SHOULD NOT be transmitted
TOC SMV
Value Rate Codec data frame size (in octets)
----- ------- ---------------------------------------------
0 Blank 0 (0 bits)
1 1/8 2 (16 bits)
2 1/4 5 (40 bits)
3 1/2 10 (80 bits)
4 1 22 (171 bits; 5 padded at end with zeros)
5 Erasure 0 SHOULD NOT be transmitted by sender
6-15 n/a n/a Reserved. SHOULD NOT be transmitted
7.3 Bundling Codec Data Frames
Bundling codec data frames only applies to multiple frame format as
described in section 7.1. As indicated in section 7, more than one
codec data frame MAY be included in a single RTP packet. Bundling
codec data frames means multiple data frames are included
consecutively in a packet (without interleaving). The bundling of
Espelien & Gellens Expires April 2002 [Page 9]Internet Draft Common Payload Format October 2001
codec data frames is signaled by setting the frame count to a value
greater than 0 (which also requires that the LLL and the NNN values
MUST both be zero).
Senders MAY support bundling. All receivers MUST support bundling.
Receivers MAY signal the maximum number of codec data frames they
can handle in a single RTP packet. This can be done using out of
band signaling (for example in [SDP] parameters). See also maxptime
in section 13.2.
7.3.1 Additional Bundling Restrictions on the Sender
Furthermore, senders have the following additional restrictions:
o MUST never include more codec data frames in a single RTP packet
than signaled by maxptime in Section 13.1.
o To the extent that it is possible to determine the MTU of the
underlying transport, MUST not include more codec data frames in a
single RTP packet than will fit in the MTU. For the purpose of
computing the maximum bundling value, all codec data frames SHOULD
be assumed to have the Rate 1 size.
It is essential that a single codec full rate frame be sent in an
unfragmented single RTP packet. Note that optimized single frames
are sent 20 ms (milliseconds) at a time, one in each RTP packet.
Therefore for optimized single frame format, maxptime MUST be 20 ms,
for the currently supported vocoders; see section 14.
7.4 Interleaving Codec Data Frames
Interleaving is meaningful only when more than one codec data frame
is bundled into a single RTP packet.
All receivers MUST support interleaving. Senders MAY support
interleaving.
Interleaving of codec data frames is signaled by setting the LLL
bits to a value from 1 to 7 inclusive.
Receivers MAY signal the maximum number of bundles (maxinterleave)
they can handle in a single interleaving group. This can be done
using out of band signaling (for example in [SDP] parameters).
Section 13.2 describes the maxinterleave parameter.
Espelien & Gellens Expires April 2002 [Page 10]Internet Draft Common Payload Format October 2001
Given a time-ordered sequence of output, codec frames numbered 0..n,
a bundling value B, and an interleave value L where n = B * (L+1) -
1, the output frames are placed into RTP packets as follows (the
values of the fields LLL and NNN are indicated for each RTP packet):
First RTP Packet in Interleave group:
LLL=L, NNN=0
Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total
of B frames
Second RTP Packet in Interleave group:
LLL=L, NNN=1
Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a
total of B frames
This continues to the last RTP packet in the interleave group:
L+1 RTP Packet in Interleave group:
LLL=L, NNN=L
Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a
total of B frames
Senders MUST transmit in timestamp-increasing order. Furthermore,
within each interleave group, the RTP packets making up the
interleave group MUST be transmitted in value-increasing order of
the NNN field. While this does not guarantee reduced end-to-end
delay on the receiving end, when packets are delivered in order by
the underlying transport, delay is reduced to the minimum possible.
7.4.1 Additional Interleaving Restrictions on the Sender
Additionally, senders have the following restrictions:
o Once beginning a session with a given maximum interleaving
value, the sender MUST NOT increase the interleaving to a value
that exceeds the maximum interleaving that was signaled. The
maximum interleaving value is signaled by maxinterleave in
section 13.2.
o MAY change the interleaving value only between interleave
groups.
7.5 Finding Interleave Group Boundaries
Given an RTP packet with sequence number S, interleave value (field
LLL) L, and interleave index value (field NNN) N, the interleave
group consists of RTP packets with sequence numbers from S-N to
S-N+L inclusive. In other words, the interleave group always
consists of L+1 RTP packets with sequential sequence numbers. The
bundling value for all RTP packets in an interleave group MUST be
the same.
Espelien & Gellens Expires April 2002 [Page 11]Internet Draft Common Payload Format October 2001
The receiver determines the expected bundling value for all RTP
packets in an interleave group by the number of codec data frames
bundled in the first RTP packet of the interleave group received.
Note that this might not be the first RTP packet of the interleave
group sent if packets are delivered out of order (or lost) by the
underlying transport.
On receipt of an RTP packet in an interleave group with other than
the expected bundling value, the receiver MAY discard codec data
frames off the end of the RTP packet or add erasure codec data
frames to the end of the packet in order to manufacture a substitute
packet with the expected bundling value. The receiver MAY instead
choose to discard the whole interleave group and play silence.
7.6 Reconstructing Interleaved Speech
Given an RTP sequence number ordered set of RTP packets in an
interleave group numbered 0..L, where L is the interleave value and
B is the bundling value, and codec data frames within each RTP
packet that are numbered in order from first to last with the
numbers 1..B, the original, time-ordered sequence of output frames
from the codec is reconstructed as follows:
First L+1 frames:
Frame 0 from packet 0 of interleave group
Frame 0 from packet 1 of interleave group
And so on up to...
Frame 0 from packet L of interleave group
Second L+1 frames:
Frame 1 from packet 0 of interleave group
Frame 1 from packet 1 of interleave group
And so on up to...
Frame 1 from packet L of interleave group
And so on up to...
Bth L+1 frames:
Frame B from packet 0 of interleave group
Frame B from packet 1 of interleave group
And so on up to...
Frame B from packet L of interleave group
7.7 Receiving Invalid Values
On receipt of an RTP packet with an invalid value of the NNN field,
the RTP packet MUST be treated as lost by the receiver for the
purpose of generating erasure frames as described in section 9.
A codec data frame with a reserved value in the TOC field SHOULD
also be considered invalid. All codec frames in a packet after an
Espelien & Gellens Expires April 2002 [Page 12]Internet Draft Common Payload Format October 2001
invalid TOC field SHOULD be considered invalid.
7.8 Optimized Single Frame Format
Optimized single frame format is designed for maximum efficiency in
transmission of codec data with certain forms of header compression.
Only one codec data frame is sent in each RTP packet, and there are
no frame count or TOC field entries, or other payload header fields.
The codec rate can be determined from the length of the codec frame,
since there is only one codec data frame in each RTP packet of this
type.
If two frame types have different rates, but are expressed in the
same number of codec frame bytes, there MUST be other signaling to
distinguish them. For example, the codec sender could encode the
rate in the frame data. This is a vocoder design issue and further
discussion is out of the scope of this document.
The optimized single frame RTP payload data is formatted as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header [RTP] |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| Only one codec data frame |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
7.9 Detecting Which Format
All receivers MUST be able to process both types of packets. The
sender MAY choose to use one or both types of packets.
The packets of the two types can be distinguished by checking the
payload type field in the RTP header. The association of payload
type number with the packet type is done out-of-band, for example by
[SDP] during the setup of a session.
7.10 Codec Data Frame Format
The formats described in this section are applicable to both normal
and optimized single frame RTP payload formats as described in
sections 7.1 and 7.8.
Bits are layed out as they come out of the vocoder. This will be
referred to as native format. The native format for [PureVoice] is
LSB (least significant bit) first (see example in section 7.10.1).
The native format for [EVRC] and [SMV] is MSB (most significant bit)
first (see example in section 7.10.2).
Espelien & Gellens Expires April 2002 [Page 13]Internet Draft Common Payload Format October 2001
7.10.1 PureVoice Codec Data Frame Format
The output of the PureVoice codec is converted into data frames for
inclusion in the RTP payload as follows:
The bits as numbered in the standard [PureVoice] from the highest to
the lowest are packed into octets. The highest numbered bit (bit
265 for Rate 1, bit 123 for Rate 1/2, bit 53 for Rate 1/4 and bit 19
for Rate 1/8) is placed in the most significant bit (Internet bit 0)
of the first octet (octet 0) of the codec data frame; the second
highest numbered bit (bit 264 for Rate 1... bit 18 for Rate 1/8) is
placed in the second most significant bit (Internet bit 0) of the
first octet (octet 0) of the codec data frame. This continues until
all of the bits have been placed in the codec data frame. Any
remaining unused bits of the last octet of the codec data frame MUST
be set to zero.
For example, the frame below shows in detail how a PureVoice Rate
codec 1/8 frame is packed into a data frame:
The codec data frame for a Rate 1/8 frame is 20 bits long. Bits 0
through 19 from the standard Rate 1/8 frame are placed as indicated
with bits marked with "Z" being set to zero. The Rate 1/4, 1/2 and
full rate frames are converted similarly (with padding) to align on
octet boundaries.
PureVoice Rate 1/8 codec data frame (octet 0 - 2)
0 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1|1|1|1|1|1|1|1|1|1| | | | | | | | | | | | | | |
|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|Z|Z|Z|Z|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Internet bit 0 refers to the left-most bit of the left-most octet.
Internet bit 1 refers the next bit (to the right) of the left-most
octet. [RFC 2658] discusses network byte and internet byte order in
more detail.
7.10.2 EVRC or SMV Codec Data Frame Format
The output of the EVRC or SMV codec is converted into data frames
for inclusion in the RTP payload as follows:
The bits as numbered in the standard [RTP] from the lowest to the
highest are packed into octets. The lowest numbered bit (bit 1) is
placed in the most significant bit (Internet bit 0) of the first
octet of the codec data frame; the second lowest bit is placed in
the second most significant bit of the first octet, the third lowest
in the third most significant bit of the first octet, and so on.
This continues until all of the bits have been placed in the codec
Espelien & Gellens Expires April 2002 [Page 14]Internet Draft Common Payload Format October 2001
data frame. Any remaining unused bits of the last octet of the
codec data frame MUST be set to zero (note that this is only
applicable to rate 1 frames as the others fit completely into a
whole number of octets).
For example, the frame below shows in detail how an EVRC or SMV Full
Rate 1 codec frame is packed into a data frame:
EVRC or SMV Rate 1 codec data frame (octet 0 - 3)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3|
|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Rate 1 codec data frame (octet 19 - 21)
1 1 1 1
4 5 6 7
4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | |
|4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z|
|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The codec data frame for a Rate 1 frame is 22 octets long. Bits 1
through 171 from the standard Rate 1 frame are placed as indicated
with bits marked with "Z" being set to zero. The Rate 1/8, 1/4, and
1/2 frames are converted similarly but do not require zero padding
because they align on octet boundaries.
7.11 Adding New Codecs
Codecs that share the characteristics in section 6 can be added to
this common format by following the steps below:
1. Register new MIME type.
2. In MIME Type registration specify that when transported in
RTP, this common format is used.
3. Provide mapping of TOC value to rate and frame size of codec
payload (as shown in section 7.2).
8 Tardy Packets
Assume that the receiver has begun playing frames from an interleave
group. The time has come to play frame x from packet n of the
Espelien & Gellens Expires April 2002 [Page 15]Internet Draft Common Payload Format October 2001
interleave group. Further assume that packet n of the interleave
group has not been received.
Now, assume that packet n of the interleave group arrives before
frame x+(L+1) of that packet is needed. Receivers SHOULD use frame
x+(L+1) of the newly received packet n rather than substituting an
erasure frame. In other words, just because packet n wasn't
available the first time it was needed to reconstruct the
interleaved speech, the receiver SHOULD NOT assume that the packet
is not available when the same packet is subsequently needed for
interleaved speech reconstruction.
9 Lost Packets
Codecs transported using this format support the notion of erasure
frames. These are frames that for whatever reason are not
available. When reconstructing interleaved speech or playing back
non-interleaved speech, erasure frames MUST be fed to the codec for
all missing packets.
Receivers MAY use the timestamp clock to determine how many codec
data frames are missing. For vocoders with 20 ms frames and 8 kHz
sampling rate (such as the vocoders defined in section 7.10), each
codec data frame advances the timestamp clock EXACTLY 160 (20ms x 8
kHz) counts.
Since the bundling/interleaving value can vary, the timestamp clock
is the only reliable way to calculate exactly how many codec data
frames are missing when a packet is dropped.
Specifically when reconstructing interleaved speech, a missing RTP
packet in the interleave group SHOULD be treated as containing B
erasure codec data frames where B is the bundling value for that
interleave group.
10 Implementation Issues
10.1 Interleaving Length
All wireless codecs interpolate the missing speech content when
given an erasure frame. However, consecutive erasure frames reduce
the listener's perception of voice quality. This makes interleaving
desirable over bundling as it increases speech quality in the
presence of lost packets.
On the other hand, interleaving can greatly increase the end-to-end
delay. Where an interactive session is desired, an interleave value
(field LLL) of 0 to 2 is RECOMMENDED.
When end-to-end delay is not a concern, an interleaving value (field
LLL) of 4 or 5 is RECOMMENDED, subject to maxinterleave parameter.
See description of this parameter in section 13.2.
Espelien & Gellens Expires April 2002 [Page 16]Internet Draft Common Payload Format October 2001
The parameters maxbundle and maxinterleaving at the initial setup of
the session guarantee that the receiver can allocate a well-known
amount of buffer space at the beginning of the session that will be
sufficient for all future reception in that session. Less buffer
space could be needed at some point in the future if the sender
decreases the bundling value or interleaving value, but never more
buffer space. This prevents the receiver needing to allocate more
buffer space (with the possible result that none is available).
11 Security Considerations
RTP packets using the payload format defined in this specification
are subject to the security considerations discussed in the RTP
specification [RTP], and any appropriate profile (for example,
[PROFILE]).
This implies that confidentiality of the media streams can be
achieved by encryption. Because the data compression used with this
payload format is applied end-to-end, encryption can be performed
after compression so there is no conflict between the two
operations.
A potential denial-of-service threat exists for data encodings using
compression techniques that have non-uniform receiver-end
computational load. The attacker can inject pathological datagrams
into the stream which are complex to decode and cause the receiver
to be overloaded. However, this encoding does not exhibit any
significant non-uniformity.
As with any IP-based protocol, in some circumstances, a receiver can
be overloaded simply by the receipt of too many packets, either
desired or undesired. Network-layer authentication can be used to
discard packets from undesired sources, but the processing cost of
the authentication itself might be too high. In a multicast
environment, pruning of specific sources might be implemented in
future versions of IGMP [6] and in multicast routing protocols to
allow a receiver to select which sources are allowed to reach it.
12 Real Time and Storage Mode
12.1 RTP Mode
RTP mode is used to transmit codec frames in real time and
interactive fashion (as opposed to playing a static stored file
described in section 12.2.) RTP mode uses RTP headers with SDP
negotiation (section 14) to describe the MIME media type and the RTP
ptype format.
Speech frames lost in transmission and non-received frames MUST be
played out as erasure frames (see definition in Section 9) to keep
synchronization with the original media.
Espelien & Gellens Expires April 2002 [Page 17]Internet Draft Common Payload Format October 2001
12.2 Storage Mode
Storage mode is used for storing speech frames, for example, as a
file, email attachment, or web link.
When stored as a file, the first few octets of the file are a "magic
number" that identify the file. See sections 13.1.1, 13.1.2 and
13.1.3 for EVRC, SMV and PVC respectively for more details.
All files are stored in normal mode groups (section 7.1). It is
optional for the application to translate between normal mode format
and optimized mode format. The codec data frames are stored in
groups, preceded by group header information identical to payload
header information as specified in section 7. That is, the R, LLL,
NNN, TOC entries, etc. are present. Since there is no RTP header,
and hence no timestamp, packets must be in order.
Following the magic number octets, the file is formatted as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|R|R|0|0|0|0|0|0|R|R|Frame Count| TOC | ... | TOC |padding|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| one or more codec data frames, one per TOC entry |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The meaning of the fields is specified in section 7.1. The LLL and
NNN fields MUST both be zero. The format of the frames, including
any padding, is identical to the normal mode specified in 7.1.
This format, while more complex than other designs, makes it easy
for an implementation to receive speech frames using RTP and store
them, more or less as-is, in a file. Conversely, it is simple for
an implementation to read frames out of a file and transmit them
using RTP.
Speech frames lost in transmission and non-received frames MUST be
stored as erasure frames (see definition in Section 9) to keep
synchronization with the original media.
13 IANA Considerations
This document registers three new MIME media type registrations.
The registration forms appear below.
The MIME media type names for each supported codec is allocated from
the IETF tree since PureVoice and EVRC codecs are already widely
deployed, and SMV is expected to be a widely used codec for
voice-over-IP applications.
Espelien & Gellens Expires April 2002 [Page 18]Internet Draft Common Payload Format October 2001
RTP format is described previously (see sections 7.1 and 7.8.)
13.1 Registration of MIME Media Type
13.1.1 audio/EVRC Media Type Registration
Media Type Name: audio
Media Subtype Name: EVRC
Required Parameters: none
Optional Parameters:
ptype: See Section 13.2.
maxptime: See Section 13.2.
maxinterleave: See Section 13.2.
Optional parameters for storage mode: none
Encoding considerations for RTP mode: see Section 13.2.
Encoding considerations for storage mode: see Section 13.2.
Security considerations: see Section 11.
Public specification: This document.
Additional information for storage mode (see also section 12.2):
Magic number (network byte order):
ASCII character string "#!EVRC\n", that is, 0x2321455652430a
in hexadecimal.
File extensions: EVC, evc
Macintosh file type code: not specified
Object identifier or OID: none
Intended usage: COMMON.
It is expected that many VoIP applications (as well as mobile
applications) will use this type.
Person & email address to contact for further information:
The authors of this document.
Author/Change controller:
The IESG.
13.1.2 audio/SMV Media Type Registration
Media Type Name: audio
Espelien & Gellens Expires April 2002 [Page 19]Internet Draft Common Payload Format October 2001
Media Subtype Name: SMV
Required Parameters: none
Optional Parameters:
ptype: See Section 13.2.
maxptime: See Section 13.2.
maxinterleave: See Section 13.2.
Optional parameters for storage mode: none
Encoding considerations for RTP mode: see Section 13.2.
Encoding considerations for storage mode: see Section 13.2.
Security considerations: see Section 11.
Public specification: This document.
Additional information for storage mode (see also section 12.2):
Magic number (network byte order):
ASCII character string "#!SMV\n", that is, 0x2321534d560a in
hexadecimal.
File extensions: smv, SMV
Macintosh file type code: not specified
Object identifier or OID: none
Intended usage: COMMON. It is expected that many VoIP applications
(as well as mobile applications) will use this type.
Person & email address to contact for further information:
The authors of this document.
Author/Change controller:
The IESG.
13.1.3 audio/qcelp-common Media Type Registration
Media Type Name: audio
Media Subtype Name: qcelp-common
Required Parameters: none
Optional Parameters:
ptype: See Section 13.2.
Espelien & Gellens Expires April 2002 [Page 20]Internet Draft Common Payload Format October 2001
maxptime: See Section 13.2.
maxinterleave: See Section 13.2.
Optional parameters for storage mode: none
Encoding considerations for RTP mode: see Section 13.2.
Encoding considerations for storage mode: see Section 13.2.
Security considerations: see Section 11.
Public specification: This document.
Additional information for storage mode (see also section 12.2):
Magic number (network byte order):
ASCII character string "#!PVC\n", that is, 0x23215056430a in
hexadecimal.
File extensions: pvc, PVC
Macintosh file type code: not specified
Object identifier or OID: none
Intended usage: COMMON. It is expected that many VoIP applications
(as well as mobile applications) will use this type.
Person & email address to contact for further information:
The authors of this document.
Author/Change controller:
The IESG.
13.2 Optional Media Type Parameters
These parameters are applicable to all three media and submedia
types described above.
Optional parameters for RTP mode:
ptype:
Ptype indicates the type of RTP/media subtype packet. The
default value is 1. Valid values are 1 or 2. Ptype value 1
indicates normal format (see section 7.1), while ptype value
2 indicates optimized header compressed codec format (see
section 7.8).
maxptime:
The maximum amount of media which can be encapsulated in
each packet, expressed as time in milliseconds. The time
SHALL be calculated as the sum of the time the media present
in the packet represents. The time SHOULD be a multiple of
the frame size. If not signaled, the default maxptime value
is ten frames of the native codec frame length (in
Espelien & Gellens Expires April 2002 [Page 21]Internet Draft Common Payload Format October 2001
milliseconds) times the sampling rate; for 20msec / 8kHz
vocoders, this is 200 ms.
maxinterleave:
Maximum number for interleaving value. The interleaving
values used in the entire session MUST not exceed this
maximum value. If not signaled, the default maxinterleave
value is 5.
Optional parameters for storage mode: none
Encoding considerations for RTP mode: see Section 7, and Section
7.3 and 7.4 of this document.
Encoding considerations for storage mode:
Storage mode is identical to RTP mode. A stored file is made up
of essentially multiple RTP packets without the RTP, UDP, etc
headers.
Normal (type 1) encoded speech frames MUST be stored in RTP
sequence number order. Furthermore, missing frames and
non-received frames during non-speech period MUST be
encapsulated into a compound codec payload as blank frames or
erasures. Each receiving entity that accepts this MIME type
MUST be able to decode all codec coding modes.
For normal codec frames, bundling and interleaving information
is included in each grouping.
Security considerations: see Section 11.
Public specification: This document.
Intended usage: COMMON. It is expected that many VoIP applications
(as well as mobile applications) will use this type.
Person & email address to contact for further information:
The authors of this document.
Author/Change controller:
The IESG.
14 Mapping to SDP Parameters
Please note that this section applies to packets transmitted using
RTP.
Parameters are mapped to [SDP] as usual.
Example usage in SDP, for PureVoice vocoder run in normal format:
m = audio 49120 RTP/AVP 97
a = rtpmap:97 qcelp-common
Espelien & Gellens Expires April 2002 [Page 22]Internet Draft Common Payload Format October 2001
a = fmtp:97 ptype=1; maxptime=80 ms
Example usage in SDP, for SMV vocoder run in optimized single frame
format:
m = audio 49120 RTP/AVP 98
a = rtpmap:98 SMV
a = fmtp:98 ptype=2; maxptime=20 ms
Since all optimized single frames (ptype = 2) for the currently
supported vocoders are 20 ms long, maxptime MUST be 20 ms. If a new
vocoder is added with a different frame duration, maxptime for that
Vocoder MUST equal the vocoder's frame time.
15 Acknowledgements
This document heavily borrows from "RTP Payload Format for
PureVoice(tm) Audio" by Kyle McKay (RFC 2658, August 1999).
Material has also been used from "An RTP Payload Format for EVRC
Speech", Adam Li (editor), a work in progress. The authors and
others who contributed to these two documents made this document
possible.
The authors thank the following colleagues for contributing to this
document: Rusty Sanders, Trevor Bourget, Eric Rosen, Harleen Gill,
Kirti Gupta.
16 References
[PureVoice] TIA/EIA/IS-733, "High Rate Speech Service Option for
Wideband Spread Spectrum Communication Systems", January 1997. May
be ordered online at http://www.eia.tia.org/eng.
[EVRC] TIA/EIA/IS-127, "Enhanced Variable Rate Codec, Speech Service
Option 3 for Wideband Spread Spectrum Digital Systems", January
1997.
[SMV] TIA/EIA/IS-893, "Selectable Mode Vocoder", August 2001
published as PNSP-4575.
[RTP] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
"RTP: A Transport Protocol for Real-Time Applications", RFC 1889,
January 1996.
[KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[PROFILE] Schulzrinne, H., "RTP Profile for Audio and Video
Conferences with Minimal Control", RFC 1890, January 1996.
[RFC 2658] McKay, K., "RTP Payload Format for PureVoice(tm) Audio",
RFC 2658, August 1999.
Espelien & Gellens Expires April 2002 [Page 23]Internet Draft Common Payload Format October 2001
[SDP] M. Handley and V. Jacobson, "SDP: Session Description
Protocol", RFC 2327, April 1998.
[IGMP] Deering, S., "Host Extensions for IP Multicasting", STD 5,
RFC 1112, August 1989.
17 Authors' Addresses
Magdalena L. Espelien
QUALCOMM Incorporated
5775 Morehouse Drive
San Diego, CA 92121-1714
USA
Phone: +1 858 651-6733
Email: magda@qualcomm.com
Randall Gellens
QUALCOMM Incorporated
5775 Morehouse Drive
San Diego, CA 92121-1714
USA
Phone: +1 858 651-5115
Email: rg+ietf@qualcomm.com
Espelien & Gellens Expires April 2002 [Page 24]