Internet DRAFT - draft-czezowski-optical-recovery-reqs
draft-czezowski-optical-recovery-reqs
CCAMP Working Group P. Czezowski (FLA)
Internet Draft T. Soumiya (FLL)
draft-czezowski-optical-recovery-reqs-01.txt (Editors)
Expires: August 2003
February 2003
Optical Network Failure Recovery Requirements
draft-czezowski-optical-recovery-reqs-01.txt
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026 [1].
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This draft presents requirements for control plane-based recovery
from data plane failures in pre-OTN networks. pre-OTN networks are
transport networks that have a GMPLS-based control plane and various
transport plane technologies (such as Optical Cross Connects and
Optical Add/Drop Multiplexers, etc.) An important feature of these
networks is timely recovery from failures - using either a protection
or restoration scheme. However, achieving recovery under strict time
constraints is a difficult problem. Shared mesh-based recovery is
especially desirable for reducing spare capacity and because it
allows more flexible recovery scenarios than ring-based networks.
Following a brief overview and consideration of the requirements,
they are presented in an itemized list in section 3.4 of this
document.
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 1]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [2].
Table of Contents
1. Introduction...................................................2
2. Glossary of Terms Used.........................................3
3. Failure Recovery Requirements..................................4
3.1 Overview of Recovery Requirements..........................4
3.2 Shared Mesh-based Recovery.................................6
3.3 Failure Notification Mechanisms............................6
3.4 pre-OTN Network Failure Recovery Requirements..............8
4. Security Considerations.......................................10
5. Conclusions...................................................10
References.......................................................10
Acknowledgments..................................................11
Editors' Addresses...............................................12
Contributing Authors.............................................12
1. Introduction
This draft describes requirements for control plane-based recovery
from data plane failures in pre-OTN Networks. pre-OTN Networks are
transport networks that have a GMPLS-based [3] control plane and
various transport plane technologies (such as Optical Cross Connects
(OXC), Optical Add/Drop Multiplexers (OADM), etc). Service recovery
from failures, using either a protection or restoration scheme, is an
important feature of these networks to ensure high-availability and
uninterrupted service. Achieving service recovery under strict time
constraints is a difficult problem. Several mechanisms for recovery
in mesh and ring topologies have been devised. Protection and
restoration algorithms can be used for local repair (around failed
spans or nodes) or edge-to-edge recovery of an LSP. Shared mesh-
based recovery is especially desirable for reducing spare capacity
requirements and achieving flexible service recovery scenarios.
While edge-to-edge based recovery has the potential for efficient
redundancy requirements, it also entails the potentially lengthy
delay incurred in notifying all nodes along the recovery path of the
failure of a remote resource. For some applications, recovery paths
must be chosen carefully to meet strict recovery time requirement
(e.g., 50ms).
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 2]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
There are currently several Internet Drafts in the Sub-IP Area
related to recovery in GMPLS networks. They cover the topics of
terminology [4], functional specification [5] and mechanisms analysis
[6] for recovery in GMPLS-based networks, and survivability
requirements and considerations for traffic engineered or
hierarchical networks [7,8]. As a set, these documents provide their
readers with detailed descriptions of the concepts and mechanisms
used in network recovery. However, the list of requirements for
control plane-based recovery has not been specifically detailed in
any one document.
2. Glossary of Terms Used
The following acronyms are used in this document:
o GMPLS: Generalized Multiprotocol Label Switching [3]
o LMP: Link Management Protocol [9]
o LSP: Label Switched Path
o LSR: Label Switched Router
o OADM: Optical Add/Drop Multiplexer
o OTN: Optical Transport Network
o OXC: Optical Cross-Connect
o RSVP-TE: Resource Reservation Protocol-Traffic Eng. [10]
The terminology for GMPLS-based recovery is documented in [4]. These
terms are borrowed from a work in progress at the ITU-T [11]. Here,
we use the following terms from that document:
o Detecting Entity (Failure Detection): An entity that detects a
failure or group of failures; providing thus a non-correlated
list of failures.
o Reporting Entity (Failure Correlation and Notification): An
entity that can make an intelligent decision on fault
correlation and report the failure to the deciding entity.
o Deciding Entity (part of the failure recovery decision
process): An entity that makes the recovery decision or select
the recovery resources. This entity communicates the decision
regarding the recovery actions to be performed to the impacted
LSPs/spans.
o Recovery Entity (part of the failure recovery activation
process): Any entity that participates in the recovery of the
LSPs/spans.
o Bridge: A bridge is the function that connects the normal
traffic and extra traffic to the working and recovery LSP/span,
respectively. There are three types of bridges (Permanent
Bridge, Broadcast Bridge and Selector Bridge).
o Selector: A selector is the function that extracts the normal
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 3]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
traffic either from the working or the recovery LSP/span.
There are two types of selectors (Selective selector and
Merging Selector).
o Recovery phases: 1. Failure Detection, 2. Failure Localization
and Isolation, 3. Failure Notification, 4. Recovery (Protection
or Restoration), 5. Reversion (Normalization)
3. Failure Recovery Requirements
Even though some requirements for fault recovery have been discussed
in working groups of the Sub-IP area, several additional aspects
should be examined and mentioned regarding recovery in pre-OTN
networks. In this section, we describe the fault recovery
requirements that we see. For purposes of completeness, we do not
try to avoid restatement of requirements listed in other drafts.
3.1 Overview of Recovery Requirements
This subsection summarizes the survivability requirements for pre-OTN
networks. Greater details on the requirements are provided in the
subsequent subsections.
The following classes (types) of recovery are required for span, LSP
segment, and LSP recovery:
o Protection
- pre-computed route and pre-selected (i.e., cross-
connected) resources
o Restoration
- pre-computed route and on-demand selection of resources
- on-demand route and on-demand selection of resources
A recovery scheme uses either protection or restoration (or both),
together with failure detection and notification mechanisms and
protocols. Depending on the service specification, the timing bounds
for the recovery schemes range from 50ms (for local repair of
services carrying voice calls) to seconds (for low priority path-
based repair).
For multi-layered networks, hold-off timers are required to allow
recovery at lower layers, and escalation must be supported. Support
for horizontal hierarchy must also be included, because large
networks are usually segmented [7].
In general, recovery schemes are required to operate in a stable and
cooperative manner to maximize the network's reliability and
availability. Such requirements entail that the recovery schemes
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 4]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
also be resource efficient and as flexible as possible with respect
to types of failures, service classes, and the network operator's
policies.
A temporal model of fault recovery is shown in Figure 1 below. The
diagram is adapted from [11].
+-Network Impairment
| +-Fault Detection
| | +-Start of Fault Notification
| | | +-Start of Traffic Switching
| | | | +-Recovery Operation Complete
| | | | | +-Traffic Recovered
| | | | | |
v v v v v v
----------------------------------------------->
| T1 | T2 | T3 | T4 | T5 | time
Figure 1. Recovery temporal model.
The five recovery phases shown in the figure are (using the terms
from [4]):
1. Failure Detection - The time between the network impairment and
the detection at the control plane (via a technology dependant
interface of the transport plane).
2. Failure Localization and Isolation - The time between when the
detecting entity has detected a fault, and when the reporting
entity starts the fault-recovery process. This time assumes
that the fault-recovery process at a given layer may wait for
restoration or recovery to occur at another layer. The
reporting entity also performs failure correlation to reduce
the number of notifications to be sent to the deciding entity.
3. Failure Notification - The time between when the reporting
entity starts the notifications and when all the necessary
deciding and recovering entities have received the failure
notifications.
4. Recovery (Protection or Restoration) - The time between the
first and last recovery actions, after which the recovery path
is carrying traffic.
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 5]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
5. Reversion (Normalization) - The time (after recovery) until the
original working path has been repaired and begins to carry the
traffic again.
Together, phases 1 and 2 are called Fault Management. It is evident
that the critical component in guaranteeing the time constraints for
the service recovery is the Failure Notification phase. A recovery
scheme should follow these steps. The scheme should also allow the
network operator to choose whether or not reversion is performed.
3.2 Shared Mesh-based Recovery
In non-WDM optical networks, such as Synchronous Optical Network /
Synchronous Digital Hierarchy (SONET/SDH), conventional protection
techniques are currently the most commonly used. These techniques
are based on linear and ring network topologies. Linear protection
can be categorized as 1+1 and 1:N protection. Ring protection can be
categorized as uni-directional path switched ring (UPSR) and bi-
directional line switched ring (BLSR).
However, linear 1+1 protection requires 100% redundancy in the spare
resources for every working path. For ring-based protection, the
available topology is restricted to a ring, and it requires 100%
redundancy in the spare resources for every working path. Even with
1:N based link protection, it is difficult to select different routes
flexibly. From this point of view, 1+1 and 1:N protection are
extravagant in resource usage and have low flexibility, even though
the level and speed of recovery from a failure can be assured. For
reasons of efficiency and flexibility, pre-OTN network recovery
schemes should support shared mesh-based recovery.
Shared mesh recovery can save resources by sharing recovery capacity
among multiple working paths. This approach increases the system
flexibility, because the possibility of sharing recovery resources
may allow for more options when routing working paths and recovery
paths. Furthermore, this flexibility facilitates fast recovery
because the shared mesh provides more (suitable intermediate) nodes
for the routing of the recovery paths. Having more candidates
increases the chances of finding shorter recovery paths, which
reduces the notification time.
3.3 Failure Notification Mechanisms
In general, there are two alternatives for control plane based
failure notification:
o Failure notification messages based on modified GMPLS signaling
o Controlled flooding of failure notification messages
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 6]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
The GMPLS signaling protocol, RSVP-TE [10], supports notification
using a Notify message. Under this scheme, the deciding entity pre-
arranges to receive the notifications by sending a Notify Request
object in the Path or Resv messages. Since additional (extra) Notify
Request objects in a RSVP-TE message are ignored, a detecting (or
reporting) entity sends Notify messages to only one deciding entity
per LSP.
The recovery process uses a 2 or 3-phase method. In the first phase,
the reporting entity sends the notification to the deciding entity.
The deciding entity then begins a 1 or 2-phased signaling down (or
down and back) the recovery LSP.
The controlled flooding of fiber link failure notification messages
on the control plane, perhaps by extending LMP [9], is another
alternative for failure notification. Flooding the notifications in
one shot to an appropriate portion of the network ensures their
timely delivery. This supports recovery schemes that require policy
or priority-based decisions at multiple decision entities that may be
distributed, within the network, off the working path.
To meet the time constraints for recovery, failure correlation/
aggregation time for the computations to be performed at the
reporting entity must be minimized, and the time that elapses prior
to all entities involved in the recovery receiving a failure
notification (or recovery action) signal must also be minimized. The
flooded messages will take the shortest available paths to all these
entities.
+---+
.....| E |..............
: +---+ :
: :
+---+ +---+ \ / +---+ +---+
===| A |====| B |====X====| C |====| D |===
+---+ +---+ / \ +---+ +---+
: :
: +---+ +---+ :
:......| F |.........| G |......:
+---+ +---+
Figure 2. Multiple (partial) recovery paths protecting against the
failure of link BC.
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 7]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
Figure 2 above shows a network when a failure occurs on link BC. The
working LSPs follow the route ABCD, and two (dotted) recovery paths
have been reserved, but not activated. Recovery paths BED and AFGD
are each responsible for recovering a portion of the working capacity
on link BC. In this case, nodes A, B, D, E, F, and G must all
receive a notification of the failure and make reconfiguration
actions. A flooding-based approach to fault notification not only
has the benefit of reaching all recovery nodes in the shortest time
possible, but also has a beneficial side effect that all nodes in the
vicinity of the failure receive the notification. Therefore, it is
possible for other nodes, say Node H and Node I, in the neighborhood
of the failure to use this information in making policy or priority-
based decisions such as dynamically rerouting low-priority LSPs
around the neighborhood to free-up capacity, or blocking new LSP
requests that do not have a high enough priority value.
3.4 pre-OTN Network Failure Recovery Requirements
This is our list of recovery requirements:
o Requirements on the efficiency of working and recovery bandwidth
(1) A recovery scheme SHOULD allow efficient use of working LSP
bandwidth using such measures as route optimization, taking into
account route dependencies between a working path and its recovery
path.
(2) A recovery scheme SHOULD allow efficient use of recovery LSP
bandwidth using such measures as route optimization, taking into
account route dependencies between a working path and its recovery
path.
(3) A recovery scheme SHOULD, when possible, allow sharing of
recovery bandwidth among multiple recovery paths to enable efficient
use of recovery bandwidth.
o Requirements on recovery actions
(4) A recovery scheme SHOULD allow suppression of fault
notification messages, so that spurious fault notification messages
and recovery action messages are suppressed and are not broadcast
within the network, ensuring scalability of the fault recovery
mechanism.
(5) A recovery scheme SHOULD ensure reliable transmission of fault
recovery messages, providing the control plane is connected.
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 8]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
(6) A recovery scheme SHOULD allow fallback operations of its
recovery actions. For example, when the system encounters a fault
class (eg., multiple simultaneous failures) which was not
anticipated, the system should execute a best-effort recovery, such
that as many working paths as possible are restored under the
circumstances.
(7) A recovery scheme SHOULD allow the network operator to choose
whether or not the reversion actions are to be performed.
(8) A recovery scheme SHOULD support recovery within bounded time
constraints and MAY be compliant with generally used recovery times
like 50ms for SONET/SDH protection.
(9) A recovery scheme SHOULD allow testing and verification of the
availability of the recovery path before its actual use. This
testing may occur when the recovery path is provisioned, or after it
is provisioned but before actual recovery action occurs, causing the
path to be used.
(10) A recovery scheme SHOULD guarantee that recovery actions
correctly deliver traffic from working paths to the respective
recovery paths, such that the recovery actions do not result in any
unintended connections or unintended diversion of traffic.
o Requirements on recovery schemes
(11) A recovery scheme SHOULD support and be compliant with
generally used protection schemes such as 1+1, 1:1, 1:N, M:N, and
unprotected.
(12) A recovery scheme SHOULD support recovery of failed LSPs even
if the LSPs have different endpoints.
(13) A recovery scheme SHOULD support priority-based recovery of
failed LSPs. This means that path restoration should be ordered
according to each LSP's recovery priority.
o Requirements on recovery priority of service classes
(14) A recovery scheme SHOULD allow recovery of service classes
based on their recovery priority, which is a continuous spectrum from
lowest priority (best effort) to the highest priority (guaranteed),
based on the service class usage and a carrier's agreements with its
customers.
(15) A recovery scheme SHOULD allow support of service classes
with different recovery time guarantee. For example, the authors
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 9]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
estimate that a service class carrying voice calls requires a
recovery time of less than 50ms to avoid loss of connections, whereas
a service class carrying private lines requires a recovery time on
the order of several seconds.
o Requirements on recovery granularity
(16) A recovery scheme SHOULD allow recovery of traffic on an
aggregated basis, ensuring scalability.
o Requirements on failure notification delivery
(17) A recovery scheme SHOULD be equipped with a failure
notification mechanism that guarantees prompt and reliable delivery
of notification of faults in the data plane to a deciding entity that
is in charge of recovering the fault.
4. Security Considerations
This draft does not introduce any new security issues.
5. Conclusions
This draft describes requirements for control plane-based recovery
from data plane failures in Optical IP Networks. While there are
currently several Internet Drafts in the Sub-IP Area related to
service recovery in GMPLS networks, the list of requirements for
control plane-based recovery has not been specifically detailed in
any one document. We identify that most important requirements are
meeting the potentially strict timing, enabling flexible recovery
schemes, and facilitating the efficient use of resources. 17
requirements are listed in section 3.4.
References
[1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP
9, RFC 2026, October 1996.
[2] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
[3] Mannie, E. (Ed.), "Generalized Multi-Protocol Label Switching
(GMPLS) Architecture", Internet Draft, work in progress, draft-
ietf-ccamp-gmpls-architecture-03.txt, August 2002.
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 10]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
[4] Mannie, E. and D. Papadimitriou (Eds.), "Recovery (Protection
and Restoration) Terminology for GMPLS", Internet Draft, work in
progress, draft-ietf-ccamp-gmpls-recovery-terminology-01.txt,
November 2002.
[5] Lang, J.P. and B. Rajagopalan (Eds.), "Generalized MPLS Recovery
Functional Specification", Internet Draft, work in progress,
draft-ietf-ccamp-gmpls-recovery-functional-00.txt, January 2003.
[6] Papadimitriou, D. and E. Mannie (Eds.), "Analysis of Generalized
MPLS-based Recovery Mechanisms (including Protection and
Restoration)", Internet Draft, work in progress, draft-ietf-
ccamp-gmpls-recovery-analysis-00.txt, January 2003.
[7] Lai, W.S., and D. McDysan (Eds.), "Network Hierarchy and
Multilayer Survivability", RFC 3386, November 2002.
[8] Owens, K., et al., "Network Survivability Considerations for
Traffic Engineered IP Networks", Internet Draft, work in
progress, draft-owens-te-network-survivability-03.txt, May 2002.
[9] Lang, J. (Ed.), "Link Management Protocol (LMP)", Internet
Draft, draft-ietf-ccamp-lmp-07.txt, November 2002.
[10] Berger, L. (Ed.), "Generalized MPLS Signaling - RSVP-TE
Extensions", Internet Draft, work in progress, draft-ietf-mpls-
generalized-rsvp-te-09.txt", September 2002.
[11] ITU-T Draft Recommendation G.gps, "Generic Protection
Switching", work in progress, April 2002.
Acknowledgments
The following individuals provided valuable input to this draft:
Richard Rabbat, Ching-Fong Su and Takafumi Chujo of Fujitsu Labs of
America, Inc., Norihiko Shinomiya and Akira Chugo of Fujitsu
Laboratories, Ltd.
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 11]
draft-czezowski-optical-recovery-reqs-01.txt February 2003
Editors' Addresses
Peter Czezowski Toshio Soumiya
Fujitsu Labs of America, Inc. Fujitsu Laboratories Ltd.
595 Lawrence Expressway 1-1, Kamikodanaka 4-Chome
Sunnyvale, CA 94085 Nakahara-ku, Kawasaki
United States of America 211-8588, Japan
Phone: +1-408-530-4516 Phone: +81-44-754-2765
Email: peterc@fla.fujitsu.com Email: soumiya.toshio@jp.fujitsu.com
Contributing Authors
Peter Czezowski (see address information above)
Toshio Soumiya (see address information above)
Kohei Shiomoto
NTT Network Innovation Laboratories
Midori-machi 3-9-11, Musashino-shi
Tokyo, Japan 180-8585
Phone: +81-422-59-4402
Email: Shiomoto.Kohei@lab.ntt.co.jp
Shoichiro Seno
Mitsubishi Electric Corporation
5-1-1 Ofuna, Kamakura
Kanagawa, Japan 247-8501
Phone: +81-467-41-2430
Email: senos@isl.melco.co.jp
Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 12]