Internet DRAFT - draft-guillemot-avt-mpeg4visual
draft-guillemot-avt-mpeg4visual
Internet Engineering Task Force Audio Visual Transport WG
Internet-Draft C.Guillemot, P.Christ
draft-guillemot-avt-mpeg4visual-00.txt INRIA / Univ. Stuttgart - RUS
March, 1 2000
Expires: September, 1 2000
RTP payload format for MPEG-4 Visual Advanced Profiles
(scalable, core, main, N-bits)
STATUS OF THIS MEMO
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet- Drafts as refer-
ence material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document describes a payload format for the transport of MPEG-4
visual Elementary Streams, applicable for multimedia applications
not restricted to the simple visual profile. It is an application
of the format described in [1] to specialized cases of video which
are applicable in the H.323 context (using for example MPEG-4 lay-
ered streams). It is therefore intended to support advanced MPEG-4
visual profiles (simple scalable, core, main, N-bits profiles), by
allowing protection against loss of key segments of the elementary
streams. The simple scalable profile supports temporal and spatial
scalability, features important for rate or congestion control on
the Internet, especially in multicast. The core and the main visual
profiles target multimedia applications on the Internet and allow,
in addition to scalability, the usage of sprite objects and of
arbitrary shape objects.
C. Guillemot, P. Christ [Page 1]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
1 Introduction
An MPEG-4 scene is composed of media objects. The MPEG-4 dynamic-
scene description framework, which defines the spatio-temporal rela-
tion of the media objects as well as their contents, is inspired by
VRML. The compressed binary representation of the scene description
is called BIFS (Binary Format for Scenes) [2]. The compressed scene
description is conveyed through one or more Elementary Streams (ES).
A compression layer produces the compressed representations of the
audio-visual objects that will be inserted into the scene. These
compressed representations are organized into Elementary Streams
(ES). Elementary Stream Descriptors provide information relative to
the stream, such as the compression scheme used. Elementary stream
data is partitioned into Access Units. The delineation of an Access
Unit is completely determined by the compression layer that gener-
ates the elementary stream. An Access Unit is the smallest data en-
tity to which timing information can be attributed. Two Access Units
shall never refer to the same point in time.
Natural and animated synthetic objects may refer to an Object De-
scriptor (OD), which points to one or more Elementary Streams that
carry the coded representation of the object or its animation data.
An OD serves as a grouping of one or more Elementary Stream Descrip-
tors that refer to a single media object. The OD also defines the
hierarchical relations and properties of the Elementary Streams De-
scriptors. The Object Descriptors are conveyed through one or more
Elementary Streams. By conveying the session (or resource) descrip-
tion as well as the scene description through their own Elementary
Streams, it becomes possible to change portions of scenes and/or
properties of media streams separately and dynamically at well-known
instants of time.
In order to allow effective implementations of the standard, subsets
of the MPEG-4 Systems, Visual, and Audio tool sets have been
identified, that can be used for specific applications. Profiles
exist for various types of media content (audio, visual, and
graphics) and for scene descriptions. The visual part of the
standard defines five profiles for natural video: the simple
profile, the simple scalable profile, the core profile, the main
profile and the N-bits profile [3].
Considering the visual elementary streams, an important entry point
in the elementary stream data is the videoObjectPlane(), starting by
corresponding configuration information. Depending on the different
visual profiles, different sets of parameters will be present in the
header of the VideoObjectPlane(). These parameters are essential to
configure the decoders and not covered by the HEC-based error
resilience mechanism.
After analysing the impact of the coding options provided by the
different profiles with respect to loss resilience, this document
C. Guillemot and P. Christ [Page 2]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
specifies an RTP payload format as an application of the generic
format proposed in [1], specialized to cases of video, applicable
e.g. in the H.323 context (see recommendation H.323 annex B). The
document defines packetization rules as well as protocol support for
protection of key segments of MPEG-4 visual streams. The design
goals of this RTP payload format are to provide the following:
- a unified solution for all the visual profiles, with protection
against loss of key segments of the elementary streams.
- a solution independent of the usage or the non-usage of the
MPEG-4 OD framework.
- protection against packet loss with a protocol support easily
adaptable to varying network conditions, for both 'live' and
'pre-recorded' visual contents.
- flexible support of a range of error control mechanisms, from no
protection to redundant data (key segments) and FEC.
The list of key segments (VisualObjectsequence header, VisualObject
header, VisualObjectLayer header, Group_of_videoobjectplane header,
VideoObjectPlane header) included in the payload header as redundant
data, as well as possibly additional protection schemes supported
will be announced via an out-of-band signaling at the beginning of
the session, using for example SDP [4]. The protection scheme used
at a specific instant during the session will be signaled via the ex-
tension type (XT) field in the payload header.
2 MPEG-4 visual profiles
Five profiles have been defined for natural video content [3]:
@ The Simple Visual Profile provides efficient, error resilient
coding of rectangular video objects, suitable for applications on
mobile networks, such as PCS and IMT2000.
@ The Simple Scalable Visual Profile adds support for coding of
temporal and spatial scalable objects to the Simple Visual
Profile, It is useful for applications which provide services at
more than one level of quality due to bit-rate or decoder resource
limitations.
@ The Core Visual Profile adds support for coding of arbitrary-
shaped and temporally scalable objects to the Simple Visual
Profile. It is useful for applications such as those providing
relatively simple content-interactivity (Internet multimedia
applications).
@ The Main Visual Profile adds support for coding of interlaced,
semi-transparent, and sprite objects to the Core Visual Profile.
It is useful for interactive and entertainment-quality broadcast
and DVD applications.
@ The N-Bit Visual Profile adds support for coding video objects
having pixel-depths ranging from 4 to 12 bits to the Core Visual
Profile. It is suitable for use in surveillance applications.
The profiles for synthetic and synthetic/natural hybrid visual
content are:
C. Guillemot and P. Christ [Page 3]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
@ The Simple Facial Animation Visual Profile provides a simple means
to animate a face model, suitable for applications such as
audio/video presentation for the hearing impaired.
@ The Scalable Texture Visual Profile provides spatial scalable
coding of still image (texture) objects useful for applications
needing multiple scalability levels, such as mapping texture onto
objects in games, and high-resolution digital still cameras.
@ The Basic Animated 2-D Texture Visual Profile provides spatial
scalability, SNR scalability, and mesh-based animation for still
image (textures) objects and also simple face object animation.
@ The Hybrid Visual Profile combines the ability to decode
arbitrary-shaped and temporally scalable natural video objects (as
in the Core Visual Profile) with the ability to decode several
synthetic and hybrid objects, including simple face and animated
still image objects. It is suitable for various content-rich
multimedia applications.
3 Impact of the profiles on the MPEG-4 visual syntax
A set of error resilience tools has been defined in the MPEG-4 vis-
ual syntax in order to recover corrupted headers [3]. In particular,
the VideoObjectPlane data is structured in video packets, the entry
point being defined by the function video_packet_header(), and
delimited by resync_markers. Header information is inserted at the
start of a video packet. Contained in this header is the
information necessary to restart the decoding process (provided key
parameters from the VOP header have been corectly received).
Following the quant_scale is the Header Extension Code (HEC). HEC is
a bit used to indicate whether additional information is available.
If the HEC is equal to one, then basic configuration parameters can
be inserted in the packet header. This section analyses the
parameters which can be potentially protected by the HEC mechanism
and stresses the coding options, therefore visual profiles, not
addressed by the above mechanism.
Depending on the different visual profiles, different sets of
parameters will be present in the header of the VideoObjectPlane().
In the simple profiles, essential VOP header parameters are:
vop_coding_type, modulo_time_base, marker_bit, vop_time_increment,
fcodes when the VOP_coding_type is P or B, VOP_reduced_resolution if
reduced_resolution_vop_enable is equal to 1.
Let us now consider the simple scalable profile supporting temporal
and spatial scalability. Scalable or layered coding is very
interesting for rate or congestion control on the Internet,
especially in multicast. A key parameter in order to be able to
decode a VOP in an enhancement layer is "ref_select_code" which
signals the VOP that has been taken as a reference for the
prediction.
The core and the main visual profiles target multimedia applications
on the Internet and allow the usage of sprite objects and of
arbitrary shape objects, in addition to the scalable features
C. Guillemot and P. Christ [Page 4]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
provided by the simple scalable profile. Sprite decoding operates in
two modes: basic and low-latency. The low latency mode allows to
update the sprite or transmit new pieces of the sprite which can be
then used as reference information for decoding subsequent S-VOPs,
and for construction of subsequent parts of sprites. Therefore, it
is important to be able to protect this information by allowing
repetition of sprite data in consecutive packets. The decoding of
arbitrary shape VOP requires the dimensions of its bounding
rectangle, its horizontal and vertical spatial position, as well as
the shape coding type. This information is respectively provided
by the parameters "vop_width", "vop_height",
"vop_horizontal_mc_spatial_ref" and "vop_vertical_mc_spatial_ref",
"vop_shape_coding_type" in the VOP header. The parameter
"change_conv_ratio_disable" is also needed to be able to decode
properly the video packet. The "vop_constant_alpha parameter" as
well as the "vop_constant_alpha_value" (if vop_constant_alpha==1),
scaling factor applied to the decoded VOP before display need also
protection.
When scalability is applied on arbitrary shape objects, extra
parameters need to be protected. For the shape decoding these
parameters are "load_backward_shape", "backward_shape_width",
"backward_shape_height", "backward_shape_horizontal_mc_spatial_ref",
"backward_shape_vertical_mc_spatial_ref", "load_forward_shape",
"forward_shape_width", "forward_shape_height",
"forward_shape_horizontal_mc_spatial_ref",
"forward_shape_vertical_mc_spatial_ref". Another important
parameter is "background_composition" which signals the usage of
background composition in conjunction with scalability. A final pa-
rameter important to be protected is "Vop_rounding_type" which sig-
nals the rounding mechanism used in the pixel value interpolation in
motion compensation for P and S(GMC)-VOPs.
4 Design Consideration
The syntax of the visual bitstreams defines two types of
information: the configuration information and the elementary stream
data [2]. The configuration information includes:
@ the global configuration information refering to the whole group
of visual objects (visualobjectsequence()),
@ the object configuration information refering to a single visual
object (visualobject())
@ and the object layer configuration information
(visualobjectlayer()).
Two modes of transmission of configuration and elementary stream
information are specified: The separate mode consists in
transmitting the configuration information in 'containers' provided
by MPEG-4 systems (ODs). The combined mode consists in tansmitting
the configuration information together with the elementary stream
data.
C. Guillemot and P. Christ [Page 5]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
The solution recommended in draft-jnb-mpeg4av-rtp-01.txt, when using
the combined mode, consists in transporting this configuration
information in separate RTP packets, and in possibly repeating the
corresponding RTP packets periodically if needed for protection
purposes. However, this vital information being restricted to a few
bytes, to transport it in separate RTP packets leads to unnecessary
overhead. More efficient transport can be reached by grouping this
data with elementary stream data inside packets. The same remark
applies to the Group_of_VideoObjectPlane() entry point and to its
corresponding header or configuration information.
The compression layer organizes the ES data in Access Units (AU).
The AUs are the smallest entities that can be attributed individual
timestamps. The timestamps may be obtained directly, through the
ESI. If the SLConfigDescriptor indicates that timestamps are ab-
sent, the timestamps may be obtained indirectly, for example, by us-
ing the frame rate.
The compression layer passes full or partial Access Units, together
with indications of AU boundaries, random access points, desired
timing information, directly to the network adaptation layer or in-
directly via the sync layer. It is however preferable, for imple-
mentation efficiency, to pass the ES data directly to the network
adaptation layer, i.e. to avoid producing the full SL packets. Par-
tial AUs or typed segments are - in terms of the encoding syntax -
syntactical and semantically meaningful parts of an AU - cf. [1],
7.2.3, "Such partial AUs may have significance for improved error
resilience".)
Depending on the visual profiles, different sets of key parameters
will be present in the header of the VideoObjectPlane(), as
described above. Key parameters for the simple scalable, main and
core profiles are not covered by the error resilience tools defined
in MPEG-4. This document advocates the need for protection support
in the packetization format, that would be applicable for the
different visual profiles, independently of the usage or non-usage
of the OD framework.
Although, the protection support would benefit most the simple
scalable, core, main and N-bits profiles, therefore a large range of
multimedia applications, it is also applicable to simple
videotelephony and videoconferencing applications relying on the
simple profile (despite the existence of the HEC mechanism). To
include the redundancy in the payload header instead of inserting it
at the level of the video packet brings more flexibility in the
redundancy insertion (avoid for example parsing the different video
packets in the RTP packets), and in the adaptation of level of
redundancy to the network characteristics, especially in the case of
pre-encoded streams. Several video packets can be transmitted in the
same RTP packets.
The payload format also specifies a mechanism for grouping an AU or
a partial AU together with protection data (redundant data, FEC).
This mechanism makes it possible to adapt the protection of the dif-
ferent partial AUs to varying network conditions during the session.
C. Guillemot and P. Christ [Page 6]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
Consecutive segments (e.g. video packets [3]) of the same type will
be packed consecutively in the same RTP payload without using the
grouping mechanism.
The compression layer should provide partial AUs of a size small
enough so that the resulting RTP packet can fit the MTU size. RTP
packets that transport fragments belonging to the same AU will have
their RTP timestamp set to the same value.
5 Payload Format specification
The packet will consist of an RTP header followed by possibly
multiple payloads.
5.1 RTP Header Usage
Each RTP packet starts with a fixed RTP header. The following fields
of the fixed RTP header are used:
- Marker bit (M bit): The marker bit of the RTP header is set to 1
when the current packet carries the end of an access unit AU, or
the last fragment of an AU.
- Payload Type (PT): The payload type shall be set to a value as-
signed to this format or a payload type in the dynamic range
should be chosen.
- Timestamp: The RTP timestamp encodes the presentation time of the
first AU contained in the packet. The RTP timestamp may be the
same on successive packets if an AU occupies more than one packet.
If the packet contains only 'extension' data objects (see below),
then the RTP timestamp is set at the value of the presentation
time of the AU to which the first extension data object (e.g. FEC
or redundant data) applies.
The RTP timestamp is set to the composition timestamp (CTS), if its
presence is indicated by the SLConfigDescriptor, and if its length
is not more than 32 bits. Otherwise, the RTP timestamp should be
set to the sampling instant of the first AU contained in the packet.
SSRC: A mapping between the ES identifiers and the SSRCs should be
provided via out-of-band signaling (e.g. SDP).
C. Guillemot and P. Christ [Page 7]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
5.2 Payload Header
The payload header is always present, with a variable length, and is
defined as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|G|E| XT | LENGTH |EBITS| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| .
+ Extension data +-+-+-+-+-+-+-+-+
. |G|E| res |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LENGTH | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| .
. Media Payload |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 1 RTP payload format.
G (Group) (1 bit): If this field is 1, it indicates that the object
associated to the current header is followed by another
object.
E (Extension) (1 bit): If its value is 1 then the next object
contains Extension data. If its value is 0, then the next object
contains AU data (full AU or partial AU).
LENGTH (13 bits): this field specifies the length in bytes of the
next object. If the object is an AU or partial AU object (E=0), then
the field is not present. If the object is the last object of the
payload (G=0) then this field is not present.
EBITS (3 bits): Indicates the number of bits that shall be ignored
in the last byte of the extension data. If the object is the last
object of the payload (G=0) then this field is not present.
res (Reserved) (6 bits): this field is only present if the E-field
is 0, resulting in always 1 byte for {G,E=1,XT} or {G,E=0,res}
XT (Extension type) (6 bits): This field is only present if E is
set to 1. It then specifies the type of extension data. Examples of
types will be the different headers (VisualObjectsequence header,
VisualObject header, VisualObjectLayer header,
Group_of_videoobjectplane header, VideoObjectPlane header) or possi-
bly FEC data with the specification of the FEC coding scheme (parity
codes, block codes such as Reed Solomon codes,...).
C. Guillemot and P. Christ [Page 8]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
6 Multiplexing
MPEG-4 applications can involve a large number of ESs, and thus also
a large number of RTP sessions. A multiplexing scheme allowing se-
lective bundling of ES may therefore be necessary for some applica-
tions. The multiplexing problem is outside the scope of this payload
format.
7 Security Considerations
RTP packets transporting information with the proposed payload for-
mat are subject to the security considerations discussed in the RTP
specification [8]. This implies that confidentiality of the media
streams is achieved by encryption.
If the entire stream (extension data and AU data) is to be secured
and all the participants are expected to have the keys to decode the
entire stream, then the encryption is performed in the usual manner,
and there is no conflict between the two operations (encapsulation
and encryption).
The need for a portion of stream (e.g. extension data) to be en-
crypted with a different key, or not to be encrypted, would require
application level signaling protocols to be aware of the usage of
the XT field, and to exchange keys and negotiate their usage on the
media and extension data separately.
8 Authors Addresses
Christine Guillemot
INRIA
Campus Universitaire de Beaulieu
35042 RENNES Cedex, FRANCE
email: Christine.Guillemot@irisa.fr
Paul Christ
Computer Center - RUS University of Stuttgart
Allmandring 30
D70550 Stuttgart, Germany.
email: Paul.Christ@rus.uni-stuttgart.de
9 References
[1] C. Guillemot, P. Christ, S. Wesner, A. Klemets, 'RTP Payload
format for MPEG-4 with scaleable and flexible error
resiliency', draft-guillemot-avt-genrtp-02.txt, March 2000.
[2] ISO/IEC 14496-1 FDIS MPEG-4 Systems November 1998
[3] ISO/IEC 14496-2 FDIS MPEG-4 Visual November 1998
C. Guillemot and P. Christ [Page 9]
Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000
[4] Mark Handley, Van Jacobson, 'SDP: Session Description Proto-
col', draft-ietf-mmusic-sdp-07.txt, 2nd Apr 1998.
[5] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson 'RTP: A
Transport Protocol for Real Time Applications', RFC 1889,
Internet Engineering Task Force, January 1996.
[6] J. Rosenberg, H. Schulzrinne, 'An RTP Payload format for
Generic Forward Error Correction', draft-ietf-avt-fec-05.txt,
26 Feb. 1999.
C. Guillemot and P. Christ [Page 10]