Internet DRAFT - draft-hoene-avt-acs-requirements

draft-hoene-avt-acs-requirements



AVT                                                     Christian Hoene 
Internet Draft                                  University of Tuebingen 
Intended status: Informational                          August 17, 2009 
Expires: February 2010 
                                    
 
                                      
            Requirements of an Audio Communication System (ACS) 
                  draft-hoene-avt-acs-requirements-00.txt 


Status of this Memo 

   This Internet-Draft is submitted to IETF in full conformance with the 
   provisions of BCP 78 and BCP 79. This document may not be modified, 
   and derivative works of it may not be created, except to publish it 
   as an RFC and to translate it into languages other than English. 

   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that 
   other groups may also distribute working documents as Internet-
   Drafts. 

   Internet-Drafts are draft documents valid for a maximum of six months 
   and may be updated, replaced, or obsoleted by other documents at any 
   time.  It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 

   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/ietf/1id-abstracts.txt 

   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html 

   This Internet-Draft will expire on February 17, 2010. 

Copyright Notice 

   Copyright (c) <insert year> IETF Trust and the persons identified as 
   the document authors.  All rights reserved.    

   Copyright (c) 2009 IETF Trust and the persons identified as the 
   document authors. All rights reserved. 

   This document is subject to BCP 78 and the IETF Trust's Legal 
   Provisions Relating to IETF Documents in effect on the date of 
   publication of this document (http://trustee.ietf.org/license-info).  

 
 
 
Hoene                 Expires February 17, 2010                [Page 1] 

Internet-Draft           Requirements of ACS                August 2009 
    

   Please review these documents carefully, as they describe your rights 
   and restrictions with respect to this document. 

Abstract 

   This document describes the requirements of an audio communication 
   system (ACS) for acoustic content, especially speech and music. The 
   ACS consists of all components above the IP layer and below a digital 
   PCM audio interface. These include codec, jitter buffer, and 
   transport. 

   The goal of the ACS is to provide a bidirectional acoustic 
   communication between any two Internet hosts at a good quality, 
   constrained only by the available resources at the hosts and the 
   characteristics of the transmission path between both hosts. 

   The intention of the document is to provide the requirements for a 
   codec that is solely intended for the Internet, to provide the 
   requirements for the codec's payload specification, and to define the 
   requirements on the transport protocol. 

Table of Contents 

   1. Introduction...................................................3 
      1.1. Basics Architectural Guidelines of the Public Internet....4 
      1.2. Problem Statement.........................................5 
   2. Usage Scenarios................................................7 
      2.1. Scenario 1: Person-to-person calls (VoIP).................8 
      2.2. Scenario 2: High quality interactive audio transmissions 
      (AoIP).........................................................8 
      2.3. Scenario 3: Ensembles performing over a network (MMoIP)...9 
      2.4. Scenario 3: Push-to-talk like service (PTT)...............9 
   3. High-Level Requirements.......................................10 
      3.1. Low cost and licensing free..............................10 
      3.2. Reliable on the Internet.................................11 
      3.3. Quality..................................................11 
   4. Technical Requirements........................................12 
      4.1. Audio content............................................12 
      4.2. Quality..................................................12 
      4.3. Reliability and congestion control.......................12 
      4.4. Coding bit rate..........................................13 
      4.5. Sampling rate............................................13 
      4.6. Complexity...............................................13 
      4.7. Latency..................................................14 
      4.8. Packet rate..............................................14 
      4.9. Packet loss resilience...................................15 
      4.10. Frame erasure concealment...............................15 
 
 
Hoene                 Expires February 17, 2010                [Page 2] 

Internet-Draft           Requirements of ACS                August 2009 
    

      4.11. Jitter compensation and playout buffer..................15 
      4.12. Playout adjustments.....................................16 
      4.13. Concealment of mode switches............................16 
      4.14. Extrapolation...........................................16 
      4.15. Interpolation...........................................17 
      4.16. DTX.....................................................17 
      4.17. Testing.................................................17 
      4.18. Licensing and source code...............................17 
      4.19. Versioning and software updates.........................18 
      4.20. RFC Type................................................18 
      4.21. Side channel............................................18 
      4.22. Layered coding..........................................18 
      4.23. Interoperability with PSTN..............................19 
      4.24. Conferencing and speech recognition.....................19 
      4.25. Self-testing support....................................19 
      4.26. Self-awareness..........................................19 
   5. Out of scope..................................................19 
      5.1. Multichannel.............................................19 
      5.2. Repacketization..........................................19 
      5.3. Support for circuit-switched transmissions...............19 
      5.4. Support of packet networks other than the Internet.......20 
      5.5. Support of streaming.....................................20 
      5.6. Random packet losses.....................................20 
      5.7. Packet loss differentiation..............................20 
      5.8. Robustness against bit errors............................20 
      5.9. IRS and other kind of bandwidth filters..................20 
      5.10. Support of voice band data, fax and DTMF................20 
      5.11. Idle noise..............................................21 
      5.12. Tandem coding...........................................21 
      5.13. FEC.....................................................21 
   6. Security Considerations.......................................21 
   7. IANA Considerations...........................................21 
   8. References....................................................21 
      8.1. Normative References.....................................21 
      8.2. Informative References...................................22 
   9. Acknowledgments...............................................23 
    
1. Introduction 

   This document is based mainly on the discussions on the Codec BOF 
   mailing list, which took place in 2009. It also based on the internal 
   requirement documents of ITU-T G.718 [SG16 314-WP3], on the ITU-T 
   G.719 standard, on the 3GPP document [TS26.114-830], and on existing 
   IETF codec drafts. 

   It is intended as basis of a requirement document that should lead to 
   the design of an audio codec for the Internet. However, this document 
 
 
Hoene                 Expires February 17, 2010                [Page 3] 

Internet-Draft           Requirements of ACS                August 2009 
    

   address the requirements of the entire system not only of a single 
   component because we want to ensure that the system as a hole works 
   well not only some parts of it.  

   We introduce the term audio communication system to describe the 
   parts of an IP based telephone which are care for the bidirectional 
   transmission of acoustic content between two Internet hosts. These 
   include the encoder, the payload encapsulation, guide lines on how to 
   use transport protocols (RTP, UDP, TCP, DCCP), the playout buffer, 
   the decoder, the concealment of packet loss, time adjustments, 
   changes of encoding parameter, and various mechanisms to manage, 
   control and monitor the acoustic transmission. 

   The ACS is intended mainly for the use on the public Internet and 
   should be as easily distributable as most other Internet protocols 
   that run on virtually all kind of devices and on all kind of 
   communication links. Also, the ACS shall be affordable by all humans 
   that have Internet access. If possible, it should be royalty free and 
   available as open-source software. If these requirements are given, 
   then the ACS can fulfill its goal of providing acoustic transmission 
   between _any_ two Internet hosts.  

1.1. Basics Architectural Guidelines of the Public Internet 

   The ACS is intended for the public Internet and follows similar 
   architectural design guidelines as those which are valid for other 
   Internet protocols, too. These include: 

   o  End-to-end semantics saying that transport protocol units are 
      transmitted from one end (an Internet host) to the other end 
      without any intermediate changes.  

   o  Network neutrality.  

   o  Best effort service that try to transmit packets as good as 
      possible but that cannot guaranty any minimal transmission 
      bandwidth or maximal transmission delay. Instead one has to cope 
      with any end-to-end transmission quality that is provided. 

   o  Congestion control to prevent congestion collapse of the Internet 
      (such as TCP or DCCP). Typically, TCP controls the number of 
      packets that are sent during periods of congestion. Thus, one has 
      to consider that the number of packets per second might be an 
      important constraining limitation and not only the bits per 
      second. 


 
 
Hoene                 Expires February 17, 2010                [Page 4] 

Internet-Draft           Requirements of ACS                August 2009 
    

   o  Internet protocols are scalable to wide degree. They work on links 
      having a very low bandwidth (in the order of bits per second) and 
      with very high bandwidth (in the order of gigabits per second). 
      The transmission latency can range from microsecond up to seconds. 
      Also, the Internet hosts might have very low processing and memory 
      capabilities (such as an 8-bit micro controller). However, even 
      then they can communicate with any other hosts. Flow control (such 
      as in TCP) is used to cope with hosts that have limited resources. 

   o  Functions to help monitoring the communication (such as the 
      features provided ICMP) 

   o  The most important Internet protocols can be used without paying 
      royalties. 

   o  The public Internet allows global communication between any two 
      hosts connected to the public Internet. Typically, the user only 
      has to pay for getting access to the public Internet not for the 
      distance that the IP packets have to travel. 

   o  Internet standards should be as simple as possible (but no 
      simpler). 

1.2. Problem Statement 

   The ACS should enable an acoustic communication between any two 
   Internet hosts considering the features of the Internet as described 
   above. We see the need for designing the ACS because we see the 
   following weaknesses in the existing codec and VoIP designs. 

   o  Many standardized speech and audio codecs require the payment of 
      royalty fees. Only codecs such as G.711, G.722, G722.1, and 
      G.722.1C that have mediocre performances can be used license free. 
      Thus, one cannot ensure that a good codec can be afforded by all 
      owners of all Internet hosts. 

   o  All known codecs have a small operational range, in addition they 
      do not adapt to a wide range of bandwidth. For example, AMR 
      support bit rates between 4.75 and 12.2 kbps and ITU G.719 support 
      rates between 32 kbps and 128 kbps. 







 
 
Hoene                 Expires February 17, 2010                [Page 5] 

Internet-Draft           Requirements of ACS                August 2009 
    

   o  An acoustic communication at superb transmission quality is not 
      supported. Especially, if the latency is very low and the 
      bandwidth is very high, we do not have a standardized codec that 
      support hifi quality at ultra low delays. Only the SBC audio codec 
      standardized by Bluetooth SIG [A2DPV10] can be considered for the 
      usage scenario.  
      Ultra-low delay transmissions at hifi quality are especially 
      useful for distributed ensemble performances or distributed 
      choruses. 

   o  Similar, if the transmission quality is very bad, no standardized 
      audio codec supports a grateful degradation. If the loss rate 
      becomes too high then all speech and audio codecs become useless. 
      However, in those cases one can use half-duplex, push-to-talk like 
      transmission of short audio segments that would still allow a very 
      slow communication at very low bitrates. 

   o  Frequently, a PSTN call needs to be transcoded. Transcoding 
      reduces the speech quality and increases latency. Thus, most 
      codecs are designed to work well in conditions of transcoding. 
      However, in case end-to-end IP transmission, the need for 
      transcoding vanishes. It might only be needed for teleconferencing 
      applications or for connecting to the PSTN network. 

   o  The quality of a PSTN call has hardly increased during the last 
      decade. Often, it is even worse because of IP based 
      interconnections and support of cellular networks. Even those 
      support of wideband speech transmission system have been 
      developed, the lack of the willingness of users to pay more has 
      limited the introduction of wideband speech. Also on the Internet 
      we do not expect users to pay more for high quality phone calls. 
      However, we believe that they will be delighted if they can 
      communicate at nearly perfect quality. 

   o  Neither a single standardized codec nor its RTP payload RFC 
      specifies how to cope with time varying bandwidth and latency nor 
      this is considered as required feature. This hinders the wide 
      spread use of an adaptive coding mode selection and thus reduces 
      the quality of many Internet phone calls. 

   o  Not a single standardized codec supports varying complexities to 
      support devices with low resources.   





 
 
Hoene                 Expires February 17, 2010                [Page 6] 

Internet-Draft           Requirements of ACS                August 2009 
    

   o  Standardized codecs do not support any functionality for self-
      observation and self-monitoring. Also, they do not provide 
      information about how well they encoded and decoded the audio 
      content under a given set of coding parameters and packet loss 
      rates. However, this information is important for the transport in 
      order to rightly adapt the codec's transmission parameters. 

   o  Packet losses occur in the Internet, the transmission time of 
      packets and the playout time varies and the coding mode is changed 
      in response to changed available transmission bandwidths. All 
      these things cause the audio stream to be temporally distorted. 
      The codec shall support concealment algorithms to limit the 
      perceptual distortion. However, none existing standardized codec 
      support the concealment of the adjustment of the playout time. 
      Also, standardized PLC work on extrapolation of previous audio 
      segments and do not support the interpolation. Lastly, often one 
      cannot distinguish between delaying the playout time and packet 
      loss because the missing packet might still arrive. Thus, an 
      algorithm that uses the same extrapolation for packet loss 
      concealment and time stretching might be beneficial. 

   o  None of the standardized interactive speech and audio codec 
      supports mechanisms to decrease the packet rate. Usually, packet 
      rates are reduced by putting multiple speech frames into on RTP 
      packet. However, the codecs do not take advantage of the high 
      algorithmic delay that can be utilized then. Thus, they work less 
      efficient in situations of congestion.  

2. Usage Scenarios 

   The ACS should be optimized towards real-time communications over the 
   Internet. It should support applications like collaborative network 
   music performance, high-quality teleconferencing, wireless audio 
   equipment, low-delay links for broadcast applications, network sound 
   servers for using multimedia applications remotely, telepresence 
   (enterprise) and the digital living room (consumer), and other.  

   The ACS shall be general enough to support multiple and quite diverse 
   network conditions. For example, if network latency is low and 
   bandwidth is plenty, it can be used for quasi-simultaneous music 
   transmissions allowing distributed ensemble performances. It is also 
   applicable interactive hifi quality audio transmission. If the 
   network connection worsens, the transmission quality degrades to 
   (wide-band) interactive speech transmission. As a last resort, it 
   emulates a high-delay, half-duplex push-to-talk like communication 
   service. 

 
 
Hoene                 Expires February 17, 2010                [Page 7] 

Internet-Draft           Requirements of ACS                August 2009 
    

   In the following, we enlist four main scenarios and describe their 
   quality requirements. 

2.1. Scenario 1: Person-to-person calls (VoIP) 

   The classic scenario is that of the phone usage to which we will 
   refer in this document as Voice over IP (VoIP). Human speech is 
   transmitted interactively between two Internet hosts. Typically, 
   beside speech some background noise is present, too. 

   The quality of a telephone call is traditionally judged with 
   subjective tests such as those describe in [ITU-T P.800]. The ACR 
   scale used in MOS-LQS sometimes might not be very suitable for high 
   quality, then - for example - the MUSHRA [ITU-T BS.1534-1] rating can 
   be applied.  

   A telephone call is considered good if it has a maximal mouth-to-ear 
   delay of 150ms [ITU-T G.107] and a speech quality of MOS-LQS 4 or 
   above. However, interhuman communication is still possible if the 
   delay is much larger.  

   This scenario does not include the use case of using a VoIP-PSTN 
   gateway to connected to legacy telephone systems. In those cases, the 
   gateway would make an audio conversion from broadband Internet voice 
   to the frugal 1930's 3.1 kHz audio bandwidth. Interconnections to the 
   PSTN will most likely stick with its legacy codecs to avoid 
   transcoding. 

2.2. Scenario 2: High quality interactive audio transmissions (AoIP) 

   In this first scenario we consider a telephone call having a very 
   good audio quality at modest acoustic one-way latencies ranging from 
   50 and 150 ms [ITU-T G.107], so that music can be listened over the 
   telephone while two persons talk interactively.  

   The Absolute Category Rating (ACR) (refer to ITU-T P.800) can be 
   used, too. However, it might be more efficient to measure quality 
   with the MUSHRA tests given in [ITU-T BS.1534-1], which is intended 
   for intermediate audio qualities. 

   Also, for today's teleconferencing and videoconferencing systems 
   there is a strong and increasing demand for audio coding providing 
   the full human auditory bandwidth of 20 Hz to 20 kHz. This rising 
   demand for high quality audio is due to the following:  



 
 
Hoene                 Expires February 17, 2010                [Page 8] 

Internet-Draft           Requirements of ACS                August 2009 
    

   o  Conferencing systems are increasingly used for more elaborate 
      presentations, often including music and sound effects which 
      occupy a wider audio bandwidth than that of speech. For example, 
      Web conferences such as WebEx, GoToMeeting, Adobe Acrobat Connect 
      are based on an IP based transmission and benefit from a IP 
      optimized ACS. 

   o  The new "Telepresence" video conferencing systems, providing High 
      Definition video and audio quality to the user, are giving the 
      experience of being in the same room by introducing high quality 
      media delivery (such as from Cisco).  

   o  The emerging Digital Living Rooms will likely be interconnected 
      and might require a constant acoustic transmission at high 
      qualities. 

2.3. Scenario 3: Ensembles performing over a network (MMoIP) 

   In some usage scenarios, users want to act simultaneously and not 
   just interactively. For example, if persons sing in a chorus, if 
   musicians jam, or if e-sportsmen play computer games in a team 
   together they need to acoustically communicate. We call it the Make 
   Music Over IP (MMoIP) scenario. 

   In this scenario, the latency requirements are much harder than for 
   interactive usages. For example, if two musicians are placed more 
   than 10 meters apart, they can hardly keep synchronized. Empirical 
   studies [Gurevich2004] have shown that if ensembles playing over 
   networks, the optimal acoustic latency is around 11.5 ms with 
   targeted range from 10 to 25 ms.  

   In addition to the MUSHRA tests, the recommendation [ITU-R BS.1116] 
   can be used for audio transmissions that just have minor impairments. 

2.4. Scenario 4: Push-to-talk like service (PTT) 

   In spite of the development of broadband access (xDSL), a lot of 
   users would only have service access via PSTN modems or mobile links. 
   Also, on these links the available bandwidth might be shared among 
   multiple flows and is subjected to congestion. Then, even low coding 
   rates at about 8 kbps are too high. 

   If transmission capacity hardly exists, one still can degrade the 
   quality of a telephone call to something like a push-to-talk (PTT) 
   like service having very high latencies. Technically, this scenario 
   takes advantage of bandwidth gains due to disruptive transmission 

 
 
Hoene                 Expires February 17, 2010                [Page 9] 

Internet-Draft           Requirements of ACS                August 2009 
    

   (DTX) modes and very large packets containing multiple speech frames 
   causing a very low packetization overhead. 

   The quality requirements of a push to talk like service have been 
   hardly studied. The OMA lists as a requirement of a Push To Talk over 
   Cellular service a transmission delay of 1.6 s and a MOS values of 
   above 3.0 that typically should be kept [OMAPoCReq]. However, as long 
   as an understandable transmission of speech is possible, the delay 
   can be even higher. For example, [OMAPoCReq] allows a delay of 
   typically up 4s for the first talk-burst. 

   Also, [OMAPoCReq] describes a maximum duration of speaking. If a 
   participant speaking reaches the time limit, the participant's right-
   to-speak shall be automatically revoked. 

   If the quality of a telephone call is very low, then instead of 
   listening-only speech quality the degree of understandability can be 
   chosen as performance metric. For example, objective tests of the 
   understandability use automatic speech recognition (ASR) systems and 
   measure the amount of correctly detected words. 

   In any case, the participant shall be informed about the quality of 
   connection, the presence of high delays, the half-duplex style of 
   communication, and its (limited) right-to-speak. For example this can 
   be achieved by a simulated talker echo. 

3. High-Level Requirements 

   Based on the four scenarios, we list the following high-level 
   requirements that the ACS should fulfill. 

3.1. Low cost and licensing free 

   The codec shall be affordable by all humans having Internet access.  

   Thus, one of the key requirements is patent/licensing free 
   technology. However, it cannot be seen as "legally binding 
   requirement" but rather as a desired working goal. Typically, one 
   cannot verify 100% whether a codec is totally free of unknown IPRs. 
   Some patents may be overlooked. It can also be assured that the known 
   IPRs are "license-free" and "free from the need to sign licensing 
   agreement(s) before use" (The ability for any user to get the codec 
   and use it without signing any paperwork).  

   If one is practicing potentially patented technologies, there is no 
   real mechanism to protect oneself from a patent troll at claims 
   license fee for a standardized ACS. We have to assume that there is a 
 
 
Hoene                 Expires February 17, 2010               [Page 10] 

Internet-Draft           Requirements of ACS                August 2009 
    

   certain probability that the designed ACS is covered by patents what 
   the IETF is not aware of. Thus, one has to define proper procedures 
   on how to cope with IPR claims even if the ACS is already 
   standardized. 

   Because of the lack of financial income, the codecs design, testing 
   and standardization process must be cost effective, too. A cheap 
   approach is needed to characterize the ACS, which might include tests 
   having volunteer participants. For example, codecs can be provided to 
   thousands of users in public to test them. Also, potential 
   performance comparisons must not be as precise and proven as beyond 
   any doubts because nobody wins or loses IPR fees if one solution wins 
   or fails. 

3.2. Reliable on the Internet 

   The ACS must be optimized towards acoustic real-time communications 
   over the Internet, and must have the flexibility to adjust to the 
   environment it operates in. Based on the quality of the end-to-end 
   speech packet transmission, the codec should adapt its quality and 
   delay to achieve an optimal benefit for the user.  

   As most Internet transport, it should be used with a wide range of 
   condition allowing a high reliability regardless the networking 
   condition. The reliability of the audio transmission should be high, 
   even in cases of low and varying bandwidth. This implies that the 
   codec is used on top of a transport protocol that implements a 
   congestion control algorithm and that the ACS adapts to changes of 
   available bandwidth. For example, if the available transmission 
   bandwidth is too low to allow the codec to transmit audio at a high 
   quality, the application can lower the sampling, bit or frame rate of 
   the stream at the cost of higher algorithmic delay or a degraded 
   audio quality.  

3.3. Quality 

   The ACS must provide a quality/bitrate trade-off that is competitive 
   with other state-of-the-art codecs. Also, the codec must have a very 
   low algorithmic delay so that it can support the typical requirements 
   of its users. 

   The speech and audio quality of the ACS should not be significantly 
   worse than existing standardized codecs, if measures on the ACR 
   scale. 



 
 
Hoene                 Expires February 17, 2010               [Page 11] 

Internet-Draft           Requirements of ACS                August 2009 
    

4. Technical Requirements 

4.1. Audio content 

   At all bitrates the ACS must deliver speech in any language at good 
   quality. The ACS must be tested for different speakers and at least 
   with two languages and should support tonal languages as well. 

   Frequently, speech needs to be transmitted not only without 
   background noise but also at conditions including car, office and 
   street noise. Background signals shall be considered not as the noise 
   but as a part of the signals that convey information. Background 
   signal can include background music at a SNR of 25 dB, office noise 
   at a SNR of 20 dB, car noise at a SNR of 15 dB, babble Noise at a SNR 
   of 25 dB, interfering talker at a SNR of 15 dB and street noise at a 
   SNR of 20 dB. 

   At high bitrates the quality must be excellent for any audio signal, 
   especially music. Stereo is considered as a must. Also, for high 
   quality audio conferencing, reverberant input signals should be 
   considered for testing the modes. 

   The speech and audio signals might have varying loudness. The 
   transmission shall support a wide range of dynamics. The nominal 
   input level of -36 dB, -26 dB and -16dB with respect to the 
   overlapping bandwidth limit (OVL) point (-20 dBm0). 

4.2. Quality 

   At a given operational mode, the ACS must not have perfect quality 
   and must not perform better than any other standardized codec. 
   However, considering the most common network conditions, the ACS 
   shall perform better than any combination of existing codecs most of 
   the time. 

4.3. Reliability and congestion control 

   The acoustic transmission should be reliable and robust. The ACS 
   shall be not only robust against packet losses but also for periods 
   of low bandwidth.  

   The mean availability of the audio transmissions, calculated over all 
   users, might be one of the metrics for assessing the performance of 
   an Internet audio codec.  

   The ACS should adapt to the current network situation. Also, the 
   codecs of ACS themselves must be adaptable, because switching among 
 
 
Hoene                 Expires February 17, 2010               [Page 12] 

Internet-Draft           Requirements of ACS                August 2009 
    

   multiple codecs is difficult to negotiate and unlikely to work well 
   in situations of inter-operation.  

   Responding to congestion is a more complex issue and out of the scope 
   of this document. However, it shall be defined on how to use existing 
   congestion control protocols like DCCP and TCP. The ACS shall provide 
   the mechanisms that congestion control requires from the codec (i.e. 
   bitrate/framerate adaptability).  

   Because of the interactive nature of the acoustic transmission, the 
   bidirectional transmission of audio content can be used for 
   transmitting the required feedback and implementing a control loop. 
   As such, it can be considered as a requirement that the acoustic 
   transmission should be always bidirectional--even if the backward 
   channel just sends "compressed silence". 

4.4. Coding bit rate 

   The ACS must be capable of running at bitrates below 10 kbps. At low 
   bitrates it must deliver good quality for clean, noisy or hands-free 
   speech in any language. At high bitrates the quality must be 
   excellent for any audio signal, including music. The bitrate must be 
   adjustable in real-time. The bit rate can go up to 128 kbit/s per 
   channel or more. The bitrate must be adjustable in real-time and at a 
   fine granularity. 

   Variable bit rates depending on the content should be supported. 

4.5. Sampling rate 

   The codec must support multiple sampling rates, ranging from 8 kHz to 
   full band. Switching between sampling rates must be carried out in 
   real-time. 

4.6. Complexity 

   The ACS should have a complexity that is adjustable in real-time, 
   where a higher complexity setting improves the quality/bitrate trade-
   off. 

   As a lower limit, the ACS shall run on hosts that common in 
   developing countries. These may include OLPC XO-1s or other low-end 
   (refurbished) computers (refer to Computer Aid International) and 
   smart phones like those based on Texas Instruments Open Multimedia 
   Application Platform (OMAP), which include both a host ARM CPU and 
   one or more DSP. 

 
 
Hoene                 Expires February 17, 2010               [Page 13] 

Internet-Draft           Requirements of ACS                August 2009 
    

   On those devices, the ACS must not be capable of running at highest 
   quality but at least at 8 kHz sampling rate.  

4.7. Latency 

   To maintain a good quality of services requiring interactivity, it is 
   necessary to maintain the overall delay as low as possible. But the 
   delay requirement tends to have less importance in applications 
   involving VoIP, possibly combined with other media and/or in 
   heterogeneous network environment. A trade-off must be found between 
   low delays and flexibility (scalability, ability to operate in 
   various conditions with many types of signals etc.). 

   In interactive scenarios, the codec should be capable of running with 
   an algorithmic delay of no more than 30 milliseconds. 

   For the making music scenario, the algorithmic delay must be between 
   3 to 9 ms. Still, given the speed of light as the fundamental limit 
   of speed of information exchange, distributed ensembles can perform 
   only regionally if latency budget of 25 ms must be kept. Typically, 
   an optical fiber has a refractive index of 1.46 and thus in an 
   optical fiber bits travel about 5136 km one-way in 25 ms. 

   The total codec delay consists of the algorithmic delay and the 
   processing delay. Algorithmic delay includes the frame size delay 
   plus any other delays inherent in the algorithm (look-ahead, noise 
   suppression and error correcting codes for algorithm purposes and any 
   algorithmic decoding delay). Processing delay is the additional delay 
   caused by implementation with a finite speed processor.  

4.8. Packet rate 

   The ACS must support a variable and dynamic changeable packet rate. 
   Putting several frames into one packet is useful for packet grouping, 
   which in turn is very useful for bandwidth adaptation and network 
   usage efficiency.  

   This is because of the fact that a lot of bandwidth is used for 
   protocol packet headers like those of Ethernet, IP, UDP, and RTP and 
   thus to overhead at the MAC layer. If even IP header compression is 
   applied, still many layer 2 protocols introduce an additional 
   overhead that is not compressed [Hoene2005]. 

   Classically, it is usually specified in the RTP payload 
   specification, not in the codec specification itself. In general, a 
   codec can take advantage of a larger frame size. This is especially 
   true for a transform codec, where a larger frame means better 
 
 
Hoene                 Expires February 17, 2010               [Page 14] 

Internet-Draft           Requirements of ACS                August 2009 
    

   frequency resolution. The gain is somewhat smaller time-domain codec 
   especially for > 20 ms frames. However, in larger packets the inter-
   frame dependencies can be adjusted on the fly to choose a trade-off 
   between bitrate and amount of error propagation. It may even be 
   possible to just make use of more inter-frame correlation for frames 
   2...N in a packet of N frames and get most of the benefits it would 
   get from a larger frame size. Thus, the ACS codec should support 
   large frame sizes (up to a MTU). 

4.9. Packet loss resilience 

   The codec must be capable of running with little error propagation, 
   meaning that the decoded signal after one or more packet losses is 
   close to the decoded signal without packet losses after no more than 
   two additional packets.  The codec must have a packet loss resilience 
   that is adjustable in real-time, where a lower packet loss resilience 
   setting improves the quality/bitrate trade-off. 

   Also, the codec may add inter-frame redundancies to achieve better 
   loss robustness. 

4.10. Frame erasure concealment 

   The ACS must have a packet loss concealment algorithm. The PLC must 
   be standardized to know how well the decoder can cope with packet 
   losses in cases when the transmission parameters must be adjusted. 
   However, the ACS may implement a PLC that performs better than the 
   standardized PLC. 

   The purpose of standardizing the PLC (and the other concealment 
   algorithms) is to guarantee a certain quality level over a range of 
   conditions. For good results, a PLC operates on decoder-internal 
   parameters and states, which requires tight algorithmic integration. 
   So the PLC is as much part of a decoder as any other decoder module. 
   The above also applies to time compression/stretching methods for 
   handling network jitter and other kind of concealment algorithms (as 
   mentioned below). 

4.11. Jitter compensation and playout buffer  

   The ACS must cope with jitter. It must be able to receive the out of 
   order de-packetized frames and present them in order for decoder 
   consumption. It must be able to receive duplicate speech frames and 
   only present unique speech frames for decoder. It must be able to 
   handle clock drift between the encoding and decoding end-points. 


 
 
Hoene                 Expires February 17, 2010               [Page 15] 

Internet-Draft           Requirements of ACS                August 2009 
    

   The playout buffer should minimize the buffering time at all times 
   while still conforming to the minimum performance requirements. If 
   the limit of jitter induced concealment operations cannot be met, it 
   is always preferred to increase the buffering time in order to avoid 
   growing jitter induced concealment operations. 

4.12. Playout adjustments 

   The ACS should support time scale modifications especially for jitter 
   compensations such as time stretching and time shrinking because on 
   the Internet jitter is the norm not a special case.  

   Because the operations going on in time scale modification algorithms 
   are similar as those for the PLC, these operations should be combined 
   into a single algorithm.  

   Also, the ACS shall be able to determine a desired length of a time 
   scale modification (so it can e.g. leave out or add one or more pitch 
   periods), to keep a 'backup' decoder state of the previous frame or 
   to add one more frame length of decoding latency - otherwise you can 
   not compress the voice of the previous packet and for stretching its 
   suboptimal.  

   In general, the use of a high-quality time scaling algorithm is 
   recommended. The amount of scaling should be as low as possible, 
   scaling should be applied as infrequently as possible, and 
   oscillating behavior is not allowed. 

4.13. Concealment of mode switches 

   The ACS should also support the concealment of distortions caused by 
   switching coding modes [Hoene2005]. Also, the negative effect of 
   switching the coding mode shall be low. 

   For example, the transmission and coding mode might change several 
   times (up to 5Hz) per second after getting feedback from the decoder. 

4.14. Extrapolation 

   Sometimes, it is not possible to distinguish between a packet that 
   arrives too late and packet that is lost and needs to be concealed. 
   The decision on whether to conceal the loss or whether to conduct 
   time stretching cannot be made yet. Thus, the ACS should support a 
   general extrapolation of the audio signal which allows for late 
   decision on whether to play out a delayed packet or whether to use a 
   loss concealment operation 

 
 
Hoene                 Expires February 17, 2010               [Page 16] 

Internet-Draft           Requirements of ACS                August 2009 
    

4.15. Interpolation 

   If a packet n has not arrived but the previous packet n-1 and the 
   following packet n+1, when the packet n shall be interpolated using 
   the frame of the previous and following packets. 

4.16. DTX 

   The codec must be capable of using Discontinuous Transmission (DTX) 
   where packets are sent at a reduced rate when the input signal 
   contains only background noise. 

4.17. Testing 

   The testing of ACS and the quality characterization shall be 
   performed with real network profiles such as with [TIA-921] or those 
   given in the appendix [TS.26114-830], not with fixed set of "average 
   distributed errors and losses". Later do not clearly reflect the 
   Internet nature.  

   Also, test vectors might be provided to check the correctness of the 
   implementations. 

4.18. Licensing and source code 

   The usage of ACS should not require paying royalties and signing NDA. 
   At the time of standardization it should be available for royalty 
   free (RF) and at reasonable and non-discriminatory terms (RAND). The 
   codec should be available as open source allowing implementation 
   under BSD, LGPL and/or GPL. 

   The codec specification description and implementation shall be based 
   on a bit-exact fixed-point modular ANSI-C code using basic operators 
   set provided in the ITU-T Software Tool Library to follow. In 
   addition, an interoperable floating-point implementation can be 
   provided. 

   The source code shall be normative because of a number of 
   reasons. One is ease of implementation (either using the reference 
   code directly, or being able to use it to validate the ported code).  
   Another is that it assures that the characterization tests actually 
   measure the standard's performance. Even if it is not officially 
   normative, readily available reference code becomes de facto 
   normative, since most implementers will simply use the code and 
   ignore the text in the RFC. 
    

 
 
Hoene                 Expires February 17, 2010               [Page 17] 

Internet-Draft           Requirements of ACS                August 2009 
    

    
    

4.19. Versioning and software updates 

   In order to cope with changes in the bitstream format, which might be 
   required due to errors in the specification or - more important - due 
   to newly claimed IPR, it must be possible to update the ACS online. 

   Also, it must be indicated, which bitstream format is going to be 
   used. 

4.20. RFC Type 

   It should become a standard, not an experimental RFC.  

4.21. Side channel 

   Congestion control should be must for all Internet applications also 
   for the ACS. [RFC3550] suggests in Chapter 10 somewhere that the RTP 
   profile should care for rate adaptation. Thus, the ACS should take 
   advantage of a feedback loop for variable coding parameter control in 
   order to allow a wide range of operation and to adapt to the the 
   current available bandwidth and processing power. 

   Congestion control per se is outside the review of this group, but 
   providing the hooks for a congestion-control mechanism to interact 
   with the codec is quite important. For example, running this codec on 
   a TFRC-enabled or DCCP RTP stream - TFRC and DCCP need to be able to 
   adjust (via the application) the bitrate of the codec in order to 
   implement congestion control and perhaps adjust packetization 
   periods/packet-rates.  

   A side channel for adaptation can be added. This would make sense 
   because in usage scenarios audio is always transmitted in both 
   directions. Adding a control channel would give a real advantage to 
   existing codec designs. Alternatively, such as side channel can be 
   also added with alternative solutions, such as handling that 
   communication in SIP/SDP and in RTP/RTCP.  

4.22. Layered coding 

   The ACS can support a layered encoding like in G.729.1 and G.718. 
   Layered coding can be seen as a method for computationally efficient 
   transcoding. Layered coding make sense in the conferencing 
   environment as such stripping should be done at the sender after 

 
 
Hoene                 Expires February 17, 2010               [Page 18] 

Internet-Draft           Requirements of ACS                August 2009 
    

   encoding. Then, for all receivers the encoding has to be done only 
   once.  

   However, for bidirectional transmissions, you do not need layered 
   encoding as most codecs now are VBR, its enough already to adapt 
   codec (at the source) to the bandwidth. Also, layered coding comes at 
   additional cost (about 10% of the coding rate) 

4.23. Interoperability with PSTN 

   The ACS might be developed to be interoperability to existing PSTN 
   systems. Especially interoperability with 2G and 3G mobile radio 
   systems is desirable. Also, the interoperability with G.722.2 @ 12,65 
   kb/s and with G.722 (for DECT devices) are of particular interest.  

4.24. Conferencing and speech recognition 

   A teleconference server should be able to mix the audio signals at 
   lower complexity than decoding + encoding. The ACS shall be capable 
   of support automatic speech recognition. 

4.25. Self-testing support 

   ACS should support means of testing the quality of a connection by 
   feedback loops and quality feedbacks. 

4.26. Self-awareness 

   The ACS should be aware on how well it can transmit acoustic content 
   at various coding parameters and packet loss rates. 

5. Out of scope 

5.1. Multichannel 

   5.1 is worth supporting but that would most likely be through 
   multiple independent channels/pairs, so that's probably not that much 
   of an issue.  

5.2. Repacketization 

   The ACS needs not to support repacketization in a network because 
   this would violate the end-to-end semantic of the Internet.  

5.3. Support for circuit-switched transmissions 

   The ACS needs not to support circuit-switched transmission. 
 
 
Hoene                 Expires February 17, 2010               [Page 19] 

Internet-Draft           Requirements of ACS                August 2009 
    

5.4. Support of packet networks other than the Internet 

   The ACS needs not to support other packet networks (VoATM, private 
   networks) beside the Internet. 

5.5. Support of streaming 

   The ACS needs not to support multimedia streaming (e.g. video + audio 
   involving bit-rate tradeoff), multicast content distribution 
   (offline/online) and message retrieval systems. 

5.6. Random packet losses 

   The usage of random packet losses to measure the concealment 
   performance is meaningless because it does not reflect the nature of 
   the Internet. Thus, the codec needs not be optimized nor tested using 
   these criteria. Instead, real packet loss and delay traces should be 
   considered. Also, short and long bursts of packet losses, which occur 
   during due to handoffs, fast fading, congestion events, and route 
   changes, should be considered. 

5.7. Packet loss differentiation 

   The ACS cannot assume that the quality of packet transmission changes 
   one per packet basis. For example, in layered coding the core layers 
   cannot expect to be less subjected to packet losses than enhancement 
   layers. 

5.8. Robustness against bit errors 

   The ACS needs not to support bit errors because they are quite seldom 
   on top of Ethernet. This is especially true as long as UDP-Lite is 
   not supported widely. 

5.9. IRS and other kind of bandwidth filters 

   The ACS must not consider bandwidth filters like the IRS because they 
   are based on the traditions of circuit-switched connections. 

5.10. Support of voice band data, fax and DTMF 

   The ACS needs not to support voice band data such as fax or DTMF. 
   Instead, alternative ways of communication or other RTP payload 
   format should be considered. 



 
 
Hoene                 Expires February 17, 2010               [Page 20] 

Internet-Draft           Requirements of ACS                August 2009 
    

5.11. Idle noise 

   The generation of idle channel noise should not be used to indicate 
   that the call is still active. Instead, in case of transmission 
   problems an acoustic notification can be given. 

5.12. Tandem coding 

   The ACS needs not to be optimized for tandem coding conditions 
   because one can assume an end-to-end transmission of IP packets.  

   Tandem coding might only be used for PSTN gateways and for conference 
   bridges. 

5.13. FEC 

   RTP support of Forward Error Correction (FEC) needs not to be 
   considered. Also, support of adding "redundant speech frames", which 
   have been transmitted in preceding packets, in a RTP packet is not 
   required. Instead, the redundancy can be added by the encoder which 
   does this in a more efficient way. 

6. Security Considerations 

   To do. 

7. IANA Considerations 

   To do. 

8. References 

8.1. Normative References 

   [ITU-T BS.1534-1] "BS.1534 : Method for the subjective assessment of 
             intermediate quality levels of coding systems", ITU-T 
             Recommendation BS.1534-1 (01/03). 

   [ITU-T G.107] "G.107 : The E-model, a computational model for use in 
             transmission planning", ITU-T Recommendation G.107 (04/09). 

   [ITU-T P.800] "P.800 : Methods for subjective determination of 
             transmission quality", ITU-T Recommendation P.800 (08/96). 

   [ITU-R BS.1116] "BS.1116 : Methods for the subjective assessment of 
             small impairments in audio systems including multichannel 
             sound systems", ITU-R Recommendation BS.1116 (10/97). 
 
 
Hoene                 Expires February 17, 2010               [Page 21] 

Internet-Draft           Requirements of ACS                August 2009 
    

   [OMAPoCReq] "Push to talk over Cellular Requirements", Open Mobile 
             Alliance, Approved Version 1.0, 09 Jun 2006, OMA-RD-PoC-
             V1_0-20060609-A.pdf   

   [TIA-921] TIA-921-A Document Information: "Network Model for 
             Evaluating Multimedia Transmission Performance Over 
             Internet Protocol", Publisher: Telecommunications Industry 
             Association, Publication Date: Jun 18, 2008  

   [TS26.114-830] 3GPP TS 26.114 V8.3.0, "IP Multimedia Subsystem (IMS); 
             Multimedia telephony; Media handling and interaction", 
             Rapporteur: Per Froejdh, Version 8.3.0, 2009-06-12,  
             RTS/TSGS-0426114v830. 

8.2. Informative References 

   [A2DPV10] Bluetooth SIG, "Advanced Audio Distribution Profile", Audio 
             Video WG, adopted specification, revision V1.0, May 22th, 
             2003.  

   [celt-draft] J-M. Valin, T. Terriberry, G. Maxwell, C. Montgomery, 
             "Constrained-Energy Lapped Transform (CELT) Codec", 
             Internet draft, draft-valin-celt-codec-01, work in 
             progress, July 13, 2009. 

   [Gurevich2004] Gurevich, M., Chafe, C., Leslie, G., and Tyan, S., 
             "Simulation of Networked Ensemble Performance with Varying 
             Time Delays: Characterization of Ensemble Accuracy", 
             Proceedings of the 2004 International Computer Music 
             Conference, Miami, USA, 2004.  

   [Hoene2005] Hoene, C., and Karl, H., and Wolisz, A., "A perceptual 
             quality model intended for adaptive VoIP applications", 
             International Journal of Communication Systems, Wiley, 
             August 2005. 

   [SG16 314-WP3] ITU-T SG16, "Agenda and list of documents for Q9/16", 
             Temporary Document 314-WP3, Received on 2008-04-22 From 
             Rapporteur Q9/16. 

   [silk-draft] K. Vos, S. Jensen, K. Soerensen, "SILK Speech Codec", 
             Internet draft, draft-vos-silk-00.txt, work in progress, 
             July 6, 2009. 

   [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 
             Jacobson, "RTP: A Transport Protocol for Real-Time 
             Applications", STD 64, RFC 3550, July 2003. 
 
 
Hoene                 Expires February 17, 2010               [Page 22] 

Internet-Draft           Requirements of ACS                August 2009 
    

9. Acknowledgments 

   The authors like to thank the various contributors taking part at the 
   discussion on the Codec BOF mailing list in the period till September 
   2009. Also, this document is based on the SILK [silk-draft] and CELT 
   drafts, the internal requirement documents of ITU-T G.718 [SG16 314-
   WP3] and the 3GPP document [TS26.114-830].  

   The author likes to thank Henry Sinnreich for his valuable feedback 
   and support.  

   Funding for this draft has been provided by the University of 
   Tuebingen within the "Projektfoerderung fuer Nachwuchswissen-
   schaftler".   

   This document was prepared using 2-Word-v2.0.template.dot. 































 
 
Hoene                 Expires February 17, 2010               [Page 23] 

Internet-Draft           Requirements of ACS                August 2009 
    

Author's Address 

   Christian Hoene 
   University of Tuebingen 
   WSI-RI 
   Sand 13 
   72076 Tuebingen 
   Germany 
       
   Phone: +49 7071 2970532 
   Email: hoene@ieee.org 
    



































 
 
Hoene                 Expires February 17, 2010               [Page 24]