Internet DRAFT - draft-harrison-avt-precision-av
draft-harrison-avt-precision-av
Network Working Group C.F. Harrison
Internet-Draft Far Field Associates, LLC
Expires: August 23, 2001 February 22, 2001
Audiovisual Transport with Precision Timing
draft-harrison-avt-precision-av-00
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on August 23, 2001.
Copyright Notice
Copyright (C) The Internet Society (2001). All Rights Reserved.
Abstract
This memo discusses methods for transporting audiovisual content
over the Internet while meeting professional-quality temporal
performance. This memo gives information about the timing
requirements and synchronization practices in professional
audiovisual production and exhibition. It is intended to initiate a
discussion which may result in new or modified IETF standards which
address the needs of this field.
Harrison Expires August 23, 2001 [Page 1]
Internet-Draft Precision AVT Timing February 2001
1. Background
When audiovisual content is acquired or rendered by a computer
system, there are strong "real-time" requirements on the process.
For example, in order to provide a satisfactory listening
experience, the timing jitter in audio acquisition and playback
clocks must be small; this is especially true for high-fidelity or
professional applications. Furthermore, when two or more signal
sources ("tracks") must be combined (e.g. audio mixing, video
cross-fades, soundtrack-to-picture "lip sync"), the system must
ensure that time synchronization is maintained among all the tracks.
The degree of required precision varies from tens of milliseconds
(general purpose lip-sync) to tens of microseconds (musical audio)
to a few nanoseconds (broadcast video).
Professional audiovisual equipment has achieved this level of
accuracy by a variety of means over the last century. Usually a
dedicated isochronous transport method is used, and a separate
channel (e.g. "black burst" or "SMPTE time code") is used as a
timing reference. These methods have historically been specialized
and application-specific. Heroic efforts are sometimes required to
obtain synchronization between sound and picture elements which
originated under incompatible systems.
It is now possible to carry audiovisual streams over general-purpose
network transport protocols such as IP. In such networks,
non-deterministic delays occur between transmitter and receiver. The
transport-delay jitter can be removed by providing adequate buffer
capacity at the receiving terminal. The correct signal timing can be
recovered by means of timestamp information embedded in the data
stream.
General-purpose networks also suffer from errors, congestion, and
packet loss. However, these concerns are outside the scope of this
memo, which discusses synchronization and timing.
The RTP[1] protocol was developed for multimedia teleconference
applications and provides a flexible framework for transporting
multimedia content over the Internet. In the existing RTP model,
each source contains an internal free-running timebase, from which
its video or audio sampling is derived. In the case of an audio
channel and video channel which must maintain lip sync, the receiver
must correlate the independent media timestamps on the audio and
video streams. This may be achieved by referring to the "wallclock"
timestamps which are periodically provided by each source in its SR
(Sender Report) messages.
RTP has provided acceptable service for its original "live
teleconference" application. However, RTP is being applied in a
Harrison Expires August 23, 2001 [Page 2]
Internet-Draft Precision AVT Timing February 2001
range of applications, including playback of prerecorded streaming
content, for which the original RTP timing model is insufficient. It
is possible to build on the existing RTP framework and thereby
support audiovisual transport timing at full professional precision.
This memo discusses overall design considerations to achieve this
goal, and speculates on some specific implementations. It suggests
that new profiles and messages should be added to the existing RTP
toolkit. In principle, two new concepts need to be incorporated:
o multiple, concurrent timebases referring to a single stream
o the ability for source timebases to speed up or slow down in
response to external commands
Harrison Expires August 23, 2001 [Page 3]
Internet-Draft Precision AVT Timing February 2001
2. Concurrent Timebases
In a traditional teleconference situation, the content is transient:
it is created, transported, and rendered in real time, then lost. In
this situation, the idea of a single timeline -- "now" -- is
adequate for most purposes. However, when content has lasting value,
it is likely to be recorded, edited, and played back many times.
Immediately, four timebases become apparent:
1. Capture time (wallclock time during original recording).
2. Program time (offset from the start of this album or show).
3. Presentation time (wallclock time during this playback).
4. Sampleclock time (numerical count of samples, with arbitrary
zero reference).
Depending on the application, additional concurrent timebases (e.g.
offset from the start of this song) may be relevant as well.
We may assume that, by definition, a "timebase" advances uniformly
and monotonically. Usually, then, the three or more timebases
attached to a particular stream are moving in lock-step, and their
relationship can be described by constant offsets. Only at certain
instants -- e.g. at the end of one song and the beginning of another
-- will the offsets change. This suggests that a low-bandwidth
information stream carrying this offset data would be adequate to
express everything that needs to be known here about the stream
timing. This timing information stream (TIS) can provide precise
synchronization information among the several media streams in a
session by correlating the sampleclock time of each media stream to
a common program timeline.
There are certain situations in which prerecorded programs are
intentionally played out off-speed. For example, material originated
on film at 24.00 fps may be played in a European television
environment at 25 fps, or in a U.S. television environment at 23.98
fps. In these situations the timing information stream would carry
offset information which changes slowly over time as the
presentation timebase drifts uniformly relative to the capture
and/or program timebase.
A proposed implementation of the concurrent timebase model, using
the RTP framework, would support a new media stream type: TIS. A
single TIS stream can carry information about several timebases. A
timebase which belongs to an RTP media stream may be identified by
its SSRC (Synchronization Source) identifier. "Virtual" timebases,
like program time, may be identified by a label, unique within that
Harrison Expires August 23, 2001 [Page 4]
Internet-Draft Precision AVT Timing February 2001
TIS. A typical message within a TIS stream would state, effectively,
"point A on timebase X is coincident with point B on timebase Y."
Fractional clock resolution in these messages is appropriate.
It is highly desirable that new sources be able to join an ongoing
session and synchronize properly. That is one reason that multiple
concurrent TIS streams should be supported. A reference to the label
of a virtual timebase may be made unique within a session by pairing
it with the SSRC of the corresponding TIS stream.
While genuine sampleclock timebases are constrained by the RTP
standard to move smoothly forward in time, this is not generally the
case with the virtual timebases which appear in a TIS. For example,
the "time offset within song" timebase will jump back to zero at the
beginning of each song. It may be useful to append an arbitrary
instance identifier to each virtual-timebase label, so that this
type of event is treated as the termination of one timebase instance
and the initiation of another timebase instance. This is one way to
retain monotonicity. Messages about the upcoming initiation and
termination of timebases could be embedded in the TIS data stream.
A particularly important timebase in multimedia applications is the
presentation timebase. A presentation timebase exists at each
location where content is seen or heard. Sometimes there are several
separate pieces of playback hardware at the same location; such
cases can lead to critical requirements for inter-equipment
synchronization. For example, two broadcast-video streams might be
brought back to a TV studio, and converted to analog video by two
separate workstations. The two analog video signals are connected to
a video switcher, where an operator may perform wipes, crossfades,
or cuts. This functionality requires that the two video signals be
synchronized within a few tens of nanoseconds. In practice, each
workstation will receive a reference signal (black burst) which
serves as a presentation timebase for this studio. Such a reference
signal may carry abolute time code in accordance with SMPTE
standards and this time code can be referenced as the presentation
timebase in a TIS data stream. Any existing professional audiovisual
"hardwired" synchronization scheme can be linked with an RTP session
in a similar way.
Harrison Expires August 23, 2001 [Page 5]
Internet-Draft Precision AVT Timing February 2001
3. Controllable Source Timebases
When several sources contribute to a single session, source
synchronization becomes a concern. If the individual source
timebases free-run, in practice, they will drift in phase relative
to each other. At a subsequent stage of digital mixing, some signals
will therefore need to be resampled. The resampling process
interpolates new data at time points between the original signal
samples. Resampling is surprisingly difficult when professional
quality standards are to be maintained. This is particularly true
for video signals.
There is a wide range of applications in which existing resampling
techniques are adequate. Teleconferencing falls in this category.
However, in the professional audiovisual world resampling is always
a last choice, and considerable effort and ingenuity is expended in
avoiding it. Primarily, this means controlling the timebases of all
sources -- speeding some up slightly, or slowing others down -- so
that all sources are sampling in a phase-coherent way. It is
worthwhile to note that it is not so important that all the sources
are precisely "on spec" -- e.g. 48000.000 samples/sec for digital
audio -- rather, it is critical that all clocks are running together.
For this reason, professional audio and video gear provides some
type of speed controllability. An edit controller or chase
controller connects (often over a proprietary interface) to the guts
of the tape deck, and provides the "hooks" that allow a room full of
equipment to operate in perfect sync. A similar functionality,
provided over a generic network transport, would be very useful. In
essence, we need to support messages to a source, commanding it to
speed up or slow down slightly. Such messages might be carried over
the existing RTCP port assignment, by adding a new packet type to
the RTCP[1] standard.
It is useful to distinguish two situations in which timebase control
is used: digital playback and digital recording. In the first
situation, playing back prerecorded digital material, there is
little need for precise short-term control of the playback speed. A
typical RTP implementation is designed with a large buffer which
removes the effect of jitter, regardless of whether it occurs at the
playback device or in the network. Timing alignment can be
sample-perfect, guided by timestamps, at the output of the buffer.
In this case, relatively crude control of the source timebase can be
perfectly satisfactory, provided that the buffer does not over- or
underflow.
In the second case, the recording timebase is being used for
digitization of a real-world, analog signal. In particular, when
audio signals are being digitized, the sampling timebase must have
Harrison Expires August 23, 2001 [Page 6]
Internet-Draft Precision AVT Timing February 2001
very low jitter. A few hundred picoseconds of random sampling jitter
can introduce audible distortion. Thus, the clock generating the
sampling timebase must respond very smoothly to speed-change
commands. Obtaining such performance is the responsibility of the
manufacturer of the recording equipment; similar problems have been
successfully faced in the manufacture of digital studio microphones.
Harrison Expires August 23, 2001 [Page 7]
Internet-Draft Precision AVT Timing February 2001
4. Security Considerations
The proposals in this memo present few new security considerations.
It is possible that a defective or malicious application could
disrupt the performance of a signal source by means of source
timebase control messages.
Harrison Expires August 23, 2001 [Page 8]
Internet-Draft Precision AVT Timing February 2001
References
[1] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
"RTP: A Transport Protocol for Real-Time Applications", RFC
1889, January 1996.
Author's Address
Chuck Harrison
Far Field Associates, LLC
18815 111th Pl SE
Snohomish, WA 98290
US
Phone: +1 360 863 8340
EMail: chuck_harrison@iname.com
Harrison Expires August 23, 2001 [Page 9]
Internet-Draft Precision AVT Timing February 2001
Full Copyright Statement
Copyright (C) The Internet Society (2001). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgement
Funding for the RFC editor function is currently provided by the
Internet Society.
Harrison Expires August 23, 2001 [Page 10]