Internet DRAFT - draft-czezowski-optical-recovery-reqs

draft-czezowski-optical-recovery-reqs




 
   CCAMP Working Group                                  P. Czezowski (FLA)
   Internet Draft                                         T. Soumiya (FLL)
   draft-czezowski-optical-recovery-reqs-01.txt                  (Editors)
   Expires: August 2003 
                                                             February 2003
 
 
               Optical Network Failure Recovery Requirements 
 
               draft-czezowski-optical-recovery-reqs-01.txt 
 
 
Status of this Memo 
 
   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of RFC2026 [1].  
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that      
   other groups may also distribute working documents as Internet-
   Drafts. 
    
   Internet-Drafts are draft documents valid for a maximum of six months 
   and may be updated, replaced, or obsoleted by other documents at any 
   time.  It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 
    
   The list of current Internet-Drafts can be accessed at 
        http://www.ietf.org/ietf/1id-abstracts.txt 
   The list of Internet-Draft Shadow Directories can be accessed at 
        http://www.ietf.org/shadow.html. 
 
 
Abstract 
    
   This draft presents requirements for control plane-based recovery 
   from data plane failures in pre-OTN networks.  pre-OTN networks are 
   transport networks that have a GMPLS-based control plane and various 
   transport plane technologies (such as Optical Cross Connects and 
   Optical Add/Drop Multiplexers, etc.)  An important feature of these 
   networks is timely recovery from failures - using either a protection 
   or restoration scheme.  However, achieving recovery under strict time 
   constraints is a difficult problem.  Shared mesh-based recovery is 
   especially desirable for reducing spare capacity and because it 
   allows more flexible recovery scenarios than ring-based networks.  
   Following a brief overview and consideration of the requirements, 
   they are presented in an itemized list in section 3.4 of this 
   document. 
    
 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 1] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
 
Conventions used in this document 
    
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in RFC-2119 [2]. 
    
Table of Contents 
    
   1. Introduction...................................................2 
   2. Glossary of Terms Used.........................................3 
   3. Failure Recovery Requirements..................................4 
      3.1 Overview of Recovery Requirements..........................4 
      3.2 Shared Mesh-based Recovery.................................6 
      3.3 Failure Notification Mechanisms............................6 
      3.4 pre-OTN Network Failure Recovery Requirements..............8 
   4. Security Considerations.......................................10 
   5. Conclusions...................................................10 
   References.......................................................10 
   Acknowledgments..................................................11 
   Editors' Addresses...............................................12 
   Contributing Authors.............................................12 
    
    
1. Introduction 
    
   This draft describes requirements for control plane-based recovery 
   from data plane failures in pre-OTN Networks.  pre-OTN Networks are 
   transport networks that have a GMPLS-based [3] control plane and 
   various transport plane technologies (such as Optical Cross Connects 
   (OXC), Optical Add/Drop Multiplexers (OADM), etc).  Service recovery 
   from failures, using either a protection or restoration scheme, is an 
   important feature of these networks to ensure high-availability and 
   uninterrupted service.  Achieving service recovery under strict time 
   constraints is a difficult problem.  Several mechanisms for recovery 
   in mesh and ring topologies have been devised.  Protection and 
   restoration algorithms can be used for local repair (around failed 
   spans or nodes) or edge-to-edge recovery of an LSP.  Shared mesh-
   based recovery is especially desirable for reducing spare capacity 
   requirements and achieving flexible service recovery scenarios. 
          
   While edge-to-edge based recovery has the potential for efficient 
   redundancy requirements, it also entails the potentially lengthy 
   delay incurred in notifying all nodes along the recovery path of the 
   failure of a remote resource.  For some applications, recovery paths 
   must be chosen carefully to meet strict recovery time requirement 
   (e.g., 50ms).  
        
 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 2] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
   There are currently several Internet Drafts in the Sub-IP Area 
   related to recovery in GMPLS networks. They cover the topics of 
   terminology [4], functional specification [5] and mechanisms analysis 
   [6] for recovery in GMPLS-based networks, and survivability 
   requirements and considerations for traffic engineered or 
   hierarchical networks [7,8].  As a set, these documents provide their 
   readers with detailed descriptions of the concepts and mechanisms 
   used in network recovery.  However, the list of requirements for 
   control plane-based recovery has not been specifically detailed in 
   any one document. 
 
 
2. Glossary of Terms Used 
    
   The following acronyms are used in this document: 
    
      o  GMPLS:   Generalized Multiprotocol Label Switching [3] 
      o  LMP:     Link Management Protocol [9] 
      o  LSP:     Label Switched Path  
      o  LSR:     Label Switched Router 
      o  OADM:    Optical Add/Drop Multiplexer 
      o  OTN:     Optical Transport Network 
      o  OXC:     Optical Cross-Connect 
      o  RSVP-TE: Resource Reservation Protocol-Traffic Eng. [10] 
 
   The terminology for GMPLS-based recovery is documented in [4]. These 
   terms are borrowed from a work in progress at the ITU-T [11]. Here, 
   we use the following terms from that document: 
 
      o  Detecting Entity (Failure Detection): An entity that detects a 
         failure or group of failures; providing thus a non-correlated 
         list of failures. 
      o  Reporting Entity (Failure Correlation and Notification): An 
         entity that can make an intelligent decision on fault 
         correlation and report the failure to the deciding entity. 
      o  Deciding Entity (part of the failure recovery decision 
         process): An entity that makes the recovery decision or select 
         the recovery resources.  This entity communicates the decision 
         regarding the recovery actions to be performed to the impacted 
         LSPs/spans. 
      o  Recovery Entity (part of the failure recovery activation 
         process): Any entity that participates in the recovery of the 
         LSPs/spans. 
      o  Bridge: A bridge is the function that connects the normal 
         traffic and extra traffic to the working and recovery LSP/span, 
         respectively.  There are three types of bridges (Permanent 
         Bridge, Broadcast Bridge and Selector Bridge). 
      o  Selector: A selector is the function that extracts the normal 
 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 3] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
         traffic either from the working or the recovery LSP/span. 
         There are two types of selectors (Selective selector and 
         Merging Selector). 
      o  Recovery phases: 1. Failure Detection, 2. Failure Localization 
         and Isolation, 3. Failure Notification, 4. Recovery (Protection 
         or Restoration), 5. Reversion (Normalization) 
 
 
3. Failure Recovery Requirements 
    
   Even though some requirements for fault recovery have been discussed 
   in working groups of the Sub-IP area, several additional aspects 
   should be examined and mentioned regarding recovery in pre-OTN 
   networks.  In this section, we describe the fault recovery 
   requirements that we see.  For purposes of completeness, we do not 
   try to avoid restatement of requirements listed in other drafts. 
    
3.1 Overview of Recovery Requirements 
    
   This subsection summarizes the survivability requirements for pre-OTN 
   networks.  Greater details on the requirements are provided in the 
   subsequent subsections. 
    
   The following classes (types) of recovery are required for span, LSP 
   segment, and LSP recovery: 
    
      o  Protection 
         - pre-computed route and pre-selected (i.e., cross- 
           connected) resources 
      o  Restoration 
         - pre-computed route and on-demand selection of resources 
         - on-demand route and on-demand selection of resources 
    
   A recovery scheme uses either protection or restoration (or both), 
   together with failure detection and notification mechanisms and 
   protocols.  Depending on the service specification, the timing bounds 
   for the recovery schemes range from 50ms (for local repair of 
   services carrying voice calls) to seconds (for low priority path-
   based repair). 
    
   For multi-layered networks, hold-off timers are required to allow 
   recovery at lower layers, and escalation must be supported.  Support 
   for horizontal hierarchy must also be included, because large 
   networks are usually segmented [7]. 
    
   In general, recovery schemes are required to operate in a stable and 
   cooperative manner to maximize the network's reliability and 
   availability.  Such requirements entail that the recovery schemes 
 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 4] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
   also be resource efficient and as flexible as possible with respect 
   to types of failures, service classes, and the network operator's 
   policies. 
    
   A temporal model of fault recovery is shown in Figure 1 below.  The 
   diagram is adapted from [11]. 
 
 
          +-Network Impairment   
          |    +-Fault Detection   
          |    |    +-Start of Fault Notification 
          |    |    |    +-Start of Traffic Switching 
          |    |    |    |    +-Recovery Operation Complete 
          |    |    |    |    |    +-Traffic Recovered 
          |    |    |    |    |    | 
          v    v    v    v    v    v 
         -----------------------------------------------> 
          | T1 | T2 | T3 | T4 | T5 |               time 
 
   Figure 1. Recovery temporal model. 
    
    
   The five recovery phases shown in the figure are (using the terms 
   from [4]): 
 
      1. Failure Detection - The time between the network impairment and 
         the detection at the control plane (via a technology dependant 
         interface of the transport plane). 
 
      2. Failure Localization and Isolation - The time between when the 
         detecting entity has detected a fault, and when the reporting 
         entity starts the fault-recovery process.  This time assumes 
         that the fault-recovery process at a given layer may wait for 
         restoration or recovery to occur at another layer.  The 
         reporting entity also performs failure correlation to reduce 
         the number of notifications to be sent to the deciding entity. 
    
      3. Failure Notification - The time between when the reporting 
         entity starts the notifications and when all the necessary 
         deciding and recovering entities have received the failure 
         notifications. 
 
      4. Recovery (Protection or Restoration) - The time between the 
         first and last recovery actions, after which the recovery path 
         is carrying traffic. 
 
 
 
 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 5] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
      5. Reversion (Normalization) - The time (after recovery) until the 
         original working path has been repaired and begins to carry the 
         traffic again. 
 
   Together, phases 1 and 2 are called Fault Management.  It is evident 
   that the critical component in guaranteeing the time constraints for 
   the service recovery is the Failure Notification phase.  A recovery 
   scheme should follow these steps.  The scheme should also allow the 
   network operator to choose whether or not reversion is performed. 
 
3.2 Shared Mesh-based Recovery 
    
   In non-WDM optical networks, such as Synchronous Optical Network /  
   Synchronous Digital Hierarchy (SONET/SDH), conventional protection 
   techniques are currently the most commonly used.  These techniques 
   are based on linear and ring network topologies.  Linear protection 
   can be categorized as 1+1 and 1:N protection.  Ring protection can be 
   categorized as uni-directional path switched ring (UPSR) and bi-
   directional line switched ring (BLSR). 
    
   However, linear 1+1 protection requires 100% redundancy in the spare 
   resources for every working path.  For ring-based protection, the 
   available topology is restricted to a ring, and it requires 100% 
   redundancy in the spare resources for every working path.  Even with 
   1:N based link protection, it is difficult to select different routes 
   flexibly.  From this point of view, 1+1 and 1:N protection are 
   extravagant in resource usage and have low flexibility, even though 
   the level and speed of recovery from a failure can be assured.  For 
   reasons of efficiency and flexibility, pre-OTN network recovery 
   schemes should support shared mesh-based recovery.  
    
   Shared mesh recovery can save resources by sharing recovery capacity 
   among multiple working paths.  This approach increases the system 
   flexibility, because the possibility of sharing recovery resources 
   may allow for more options when routing working paths and recovery 
   paths.  Furthermore, this flexibility facilitates fast recovery 
   because the shared mesh provides more (suitable intermediate) nodes 
   for the routing of the recovery paths.  Having more candidates 
   increases the chances of finding shorter recovery paths, which 
   reduces the notification time. 
 
3.3 Failure Notification Mechanisms 
    
   In general, there are two alternatives for control plane based 
   failure notification: 
    
      o  Failure notification messages based on modified GMPLS signaling 
      o  Controlled flooding of failure notification messages 
 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 6] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
    
   The GMPLS signaling protocol, RSVP-TE [10], supports notification 
   using a Notify message.  Under this scheme, the deciding entity pre-
   arranges to receive the notifications by sending a Notify Request 
   object in the Path or Resv messages.  Since additional (extra) Notify 
   Request objects in a RSVP-TE message are ignored, a detecting (or 
   reporting) entity sends Notify messages to only one deciding entity 
   per LSP. 
    
   The recovery process uses a 2 or 3-phase method. In the first phase, 
   the reporting entity sends the notification to the deciding entity.  
   The deciding entity then begins a 1 or 2-phased signaling down (or 
   down and back) the recovery LSP. 
    
   The controlled flooding of fiber link failure notification messages 
   on the control plane, perhaps by extending LMP [9], is another 
   alternative for failure notification.  Flooding the notifications in 
   one shot to an appropriate portion of the network ensures their 
   timely delivery.  This supports recovery schemes that require policy 
   or priority-based decisions at multiple decision entities that may be 
   distributed, within the network, off the working path. 
    
   To meet the time constraints for recovery, failure correlation/ 
   aggregation time for the computations to be performed at the 
   reporting entity must be minimized, and the time that elapses prior 
   to all entities involved in the recovery receiving a failure 
   notification (or recovery action) signal must also be minimized.  The 
   flooded messages will take the shortest available paths to all these 
   entities. 
    
    
                              +---+  
                         .....| E |.............. 
                         :    +---+             : 
                         :                      : 
              +---+    +---+   \ /   +---+    +---+ 
           ===| A |====| B |====X====| C |====| D |=== 
              +---+    +---+   / \   +---+    +---+ 
                :                               : 
                :      +---+         +---+      : 
                :......| F |.........| G |......: 
                       +---+         +---+ 
    
   Figure 2. Multiple (partial) recovery paths protecting against the 
             failure of link BC. 
    
    

 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 7] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
   Figure 2 above shows a network when a failure occurs on link BC.  The 
   working LSPs follow the route ABCD, and two (dotted) recovery paths 
   have been reserved, but not activated.  Recovery paths BED and AFGD 
   are each responsible for recovering a portion of the working capacity 
   on link BC.  In this case, nodes A, B, D, E, F, and G must all 
   receive a notification of the failure and make reconfiguration 
   actions.  A flooding-based approach to fault notification not only 
   has the benefit of reaching all recovery nodes in the shortest time 
   possible, but also has a beneficial side effect that all nodes in the 
   vicinity of the failure receive the notification.  Therefore, it is 
   possible for other nodes, say Node H and Node I, in the neighborhood 
   of the failure to use this information in making policy or priority-
   based decisions such as dynamically rerouting low-priority LSPs 
   around the neighborhood to free-up capacity, or blocking new LSP 
   requests that do not have a high enough priority value. 
    
3.4 pre-OTN Network Failure Recovery Requirements 
    
   This is our list of recovery requirements: 
    
   o  Requirements on the efficiency of working and recovery bandwidth 
    
      (1) A recovery scheme SHOULD allow efficient use of working LSP 
   bandwidth using such measures as route optimization, taking into 
   account route dependencies between a working path and its recovery 
   path. 
    
      (2) A recovery scheme SHOULD allow efficient use of recovery LSP 
   bandwidth using such measures as route optimization, taking into 
   account route dependencies between a working path and its recovery 
   path. 
    
      (3) A recovery scheme SHOULD, when possible, allow sharing of 
   recovery bandwidth among multiple recovery paths to enable efficient 
   use of recovery bandwidth. 
 
   o  Requirements on recovery actions 
    
      (4) A recovery scheme SHOULD allow suppression of fault 
   notification messages, so that spurious fault notification messages 
   and recovery action messages are suppressed and are not broadcast 
   within the network, ensuring scalability of the fault recovery 
   mechanism. 
    
      (5) A recovery scheme SHOULD ensure reliable transmission of fault 
   recovery messages, providing the control plane is connected. 
    

 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 8] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
      (6) A recovery scheme SHOULD allow fallback operations of its 
   recovery actions. For example, when the system encounters a fault 
   class (eg., multiple simultaneous failures) which was not 
   anticipated, the system should execute a best-effort recovery, such 
   that as many working paths as possible are restored under the 
   circumstances. 
    
      (7) A recovery scheme SHOULD allow the network operator to choose 
   whether or not the reversion actions are to be performed. 
 
      (8) A recovery scheme SHOULD support recovery within bounded time 
   constraints and MAY be compliant with generally used recovery times 
   like 50ms for SONET/SDH protection. 
    
      (9) A recovery scheme SHOULD allow testing and verification of the 
   availability of the recovery path before its actual use.  This 
   testing may occur when the recovery path is provisioned, or after it 
   is provisioned but before actual recovery action occurs, causing the 
   path to be used. 
    
      (10) A recovery scheme SHOULD guarantee that recovery actions 
   correctly deliver traffic from working paths to the respective 
   recovery paths, such that the recovery actions do not result in any 
   unintended connections or unintended diversion of traffic. 
 
   o  Requirements on recovery schemes 
    
      (11) A recovery scheme SHOULD support and be compliant with 
   generally used protection schemes such as 1+1, 1:1, 1:N, M:N, and 
   unprotected. 
    
      (12) A recovery scheme SHOULD support recovery of failed LSPs even 
   if the LSPs have different endpoints. 
    
      (13) A recovery scheme SHOULD support priority-based recovery of 
   failed LSPs.  This means that path restoration should be ordered 
   according to each LSP's recovery priority.  
    
   o  Requirements on recovery priority of service classes 
    
      (14) A recovery scheme SHOULD allow recovery of service classes 
   based on their recovery priority, which is a continuous spectrum from 
   lowest priority (best effort) to the highest priority (guaranteed), 
   based on the service class usage and a carrier's agreements with its  
   customers. 
    
      (15) A recovery scheme SHOULD allow support of service classes 
   with different recovery time guarantee. For example, the authors 
 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003           [Page 9] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
   estimate that a service class carrying voice calls requires a 
   recovery time of less than 50ms to avoid loss of connections, whereas 
   a service class carrying private lines requires a recovery time on 
   the order of several seconds. 
    
   o  Requirements on recovery granularity 
    
      (16) A recovery scheme SHOULD allow recovery of traffic on an 
   aggregated basis, ensuring scalability. 
    
   o  Requirements on failure notification delivery 
    
      (17) A recovery scheme SHOULD be equipped with a failure 
   notification mechanism that guarantees prompt and reliable delivery 
   of notification of faults in the data plane to a deciding entity that 
   is in charge of recovering the fault. 
 
 
4. Security Considerations 
    
   This draft does not introduce any new security issues. 
    
    
5. Conclusions 
    
   This draft describes requirements for control plane-based recovery 
   from data plane failures in Optical IP Networks.  While there are 
   currently several Internet Drafts in the Sub-IP Area related to 
   service recovery in GMPLS networks, the list of requirements for 
   control plane-based recovery has not been specifically detailed in 
   any one document.  We identify that most important requirements are 
   meeting the potentially strict timing, enabling flexible recovery 
   schemes, and facilitating the efficient use of resources. 17 
   requirements are listed in section 3.4. 
    
    
References
                     
   [1]  Bradner, S., "The Internet Standards Process -- Revision 3", BCP 
        9, RFC 2026, October 1996. 
    
   [2]  Bradner, S., "Key words for use in RFCs to Indicate Requirement 
        Levels", BCP 14, RFC 2119, March 1997. 
    
   [3]  Mannie, E. (Ed.), "Generalized Multi-Protocol Label Switching 
        (GMPLS) Architecture", Internet Draft, work in progress, draft-
        ietf-ccamp-gmpls-architecture-03.txt, August 2002. 

 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003          [Page 10] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
                                                                         
    
   [4]  Mannie, E. and D. Papadimitriou (Eds.), "Recovery (Protection 
        and Restoration) Terminology for GMPLS", Internet Draft, work in 
        progress, draft-ietf-ccamp-gmpls-recovery-terminology-01.txt, 
        November 2002. 
    
   [5]  Lang, J.P. and B. Rajagopalan (Eds.), "Generalized MPLS Recovery 
        Functional Specification", Internet Draft, work in progress, 
        draft-ietf-ccamp-gmpls-recovery-functional-00.txt, January 2003. 
    
   [6]  Papadimitriou, D. and E. Mannie (Eds.), "Analysis of Generalized 
        MPLS-based Recovery Mechanisms (including Protection and 
        Restoration)", Internet Draft, work in progress, draft-ietf-
        ccamp-gmpls-recovery-analysis-00.txt, January 2003. 
    
   [7]  Lai, W.S., and D. McDysan (Eds.), "Network Hierarchy and 
        Multilayer Survivability", RFC 3386, November 2002. 
    
   [8]  Owens, K., et al., "Network Survivability Considerations for 
        Traffic Engineered IP Networks", Internet Draft, work in 
        progress, draft-owens-te-network-survivability-03.txt, May 2002. 
    
   [9]  Lang, J. (Ed.), "Link Management Protocol (LMP)", Internet 
        Draft, draft-ietf-ccamp-lmp-07.txt, November 2002. 
    
   [10] Berger, L. (Ed.), "Generalized MPLS Signaling - RSVP-TE 
        Extensions", Internet Draft, work in progress, draft-ietf-mpls-
        generalized-rsvp-te-09.txt", September 2002. 
    
   [11] ITU-T Draft Recommendation G.gps, "Generic Protection 
        Switching", work in progress, April 2002. 
 
 
Acknowledgments 
    
   The following individuals provided valuable input to this draft: 
   Richard Rabbat, Ching-Fong Su and Takafumi Chujo of Fujitsu Labs of 
   America, Inc., Norihiko Shinomiya and Akira Chugo of Fujitsu 
   Laboratories, Ltd. 
 
 
 
 
 
 
 

 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003          [Page 11] 
             draft-czezowski-optical-recovery-reqs-01.txt February 2003 
 
 
Editors' Addresses 
    
   Peter Czezowski                  Toshio Soumiya 
   Fujitsu Labs of America, Inc.    Fujitsu Laboratories Ltd. 
   595 Lawrence Expressway          1-1, Kamikodanaka 4-Chome 
   Sunnyvale, CA 94085              Nakahara-ku, Kawasaki 
   United States of America         211-8588, Japan 
   Phone: +1-408-530-4516           Phone: +81-44-754-2765 
   Email: peterc@fla.fujitsu.com    Email: soumiya.toshio@jp.fujitsu.com 
 
 
Contributing Authors 
    
   Peter Czezowski (see address information above) 
    
   Toshio Soumiya (see address information above) 
    
   Kohei Shiomoto                          
   NTT Network Innovation Laboratories                 
   Midori-machi 3-9-11, Musashino-shi 
   Tokyo, Japan 180-8585 
   Phone: +81-422-59-4402 
   Email: Shiomoto.Kohei@lab.ntt.co.jp 
    
   Shoichiro Seno 
   Mitsubishi Electric Corporation 
   5-1-1 Ofuna, Kamakura 
   Kanagawa, Japan 247-8501 
   Phone: +81-467-41-2430 
   Email: senos@isl.melco.co.jp 


















 
 
Czezowski & Soumiya (Eds.)    Expires - August 2003          [Page 12]