Internet DRAFT - draft-asveren-dime-state-recovery
draft-asveren-dime-state-recovery
Network Working Group T. Asveren
Internet-Draft Sonus Networks
Expires: June 13, 2008 U. Bodin
Operax
December 11, 2007
Diameter State Recovery Considerations
draft-asveren-dime-state-recovery-02.txt
Status of this Memo
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on June 13, 2008.
Copyright Notice
Copyright (C) The IETF Trust (2007).
Abstract
This document discusses parameters to consider, different approaches
and design strategies to synchronize and/or recover state in Diameter
applications after failure of an active instance.
Asveren & Bodin Expires June 13, 2008 [Page 1]
Internet-Draft Diameter State Recovery Considerations December 2007
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Session State and the Need for Recovery . . . . . . . . . . . 4
4. Proprietary Mechanisms . . . . . . . . . . . . . . . . . . . . 5
5. Protocol Assisted State Recovery . . . . . . . . . . . . . . . 6
5.1. Service Models . . . . . . . . . . . . . . . . . . . . . . 6
5.2. Parameters to Consider . . . . . . . . . . . . . . . . . . 8
5.2.1. Notification of the Peer About Failure . . . . . . . . 8
5.2.2. Transfer of Session Data . . . . . . . . . . . . . . . 8
5.2.3. Backup Server Selection . . . . . . . . . . . . . . . 9
5.2.4. Timing of State Reconstruction . . . . . . . . . . . . 10
5.3. Approaches . . . . . . . . . . . . . . . . . . . . . . . . 10
5.3.1. Using a New Session . . . . . . . . . . . . . . . . . 11
5.3.2. Backup Instance Triggered Recovery . . . . . . . . . . 11
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12
7. Security Considerations . . . . . . . . . . . . . . . . . . . 12
8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 12
9. Normative References . . . . . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12
Intellectual Property and Copyright Statements . . . . . . . . . . 14
Asveren & Bodin Expires June 13, 2008 [Page 2]
Internet-Draft Diameter State Recovery Considerations December 2007
1. Introduction
There are a variaety of Diameter applications defined to perform
different tasks. For some of these tasks, synchronizing and/or
recovering state for ongoing sessions after failure of a Diameter
endpoint is desirable, e.g. Diameter Credit Control Application.
The recovery could be achieved by a proprietary mechanism, could be
assisted by protocol mechanisms or could be a combination thereof.
This document focuses on issues associated with protocol assisted
state recovery.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [2].
The following terms defines the functionality used in describing
entities in this document.
Ongoing Session
A Diameter session, for which at least the first transaction has
been completed but not the last transaction according to the
application message flow.
Terminated Session
A Diameter session that existed in the past, for which the last
transaction according to the application message flow has been
completed.
Initial message
A Diameter message used to create a new Diameter session.
Mid-session message
A Diameter message used to refresh or modify an existing Diameter
session.
Service Instance
An instance of service provided by a Diameter application to
another entity, e.g. charging, authentication services.
Asveren & Bodin Expires June 13, 2008 [Page 3]
Internet-Draft Diameter State Recovery Considerations December 2007
Diameter Transaction
A Diameter request/answer pair.
3. Session State and the Need for Recovery
Some Diameter applications make use of sessions consisting of
multiple transactions. The context necessary to be able to process/
trigger further messages in an ongoing session constitutes the
session state.
In multi-transaction sessions, it is possible that one of the
endpoints fail during a session. Depending on the application, it
may not be possible/desirable to terminate the corresponding service
instance. In such a case, it is necessary to utilize a backup node
which can process messages for the ongoing session or to use a new
session without terminating the service instance.
Diameter Active Backup
Peer Instance Instance
| | |
|----REQ1---->| |
| (session1) | |
| | |
|<---ANS1-----| |
| (session1) | |
| | |
| Active |
| Instance |
| Fails |
| | |
|----REQ2----------------->|
| (session1) | |
| | |
|<---ANS2------------------|
| (session1) | |
| | |
Figure 1: Session Failover to Backup Instance
Another important aspect related with failing instances is the
possibility of hanging resources on the peer Diameter entity. This
could happen if the peer Diameter entity does not clean up session
state unless the session is terminated according to the expected
application message flow. It should be noted that while state
recovery is a desirable feature for certain applications, hanging
Asveren & Bodin Expires June 13, 2008 [Page 4]
Internet-Draft Diameter State Recovery Considerations December 2007
resources is an unacceptable situation for all applications, hence
although some of the mechanisms described in this document could be
used to prevent the occurance of such a case, it is recommended that
application layer mechanisms, e.g. application layer timers, are used
for this purpose. Nonetheless, certain strategies mentioned in this
document could be used to expedite session state cleanup after
failovers.
4. Proprietary Mechanisms
Proprietary mechanisms do not assume any specific behavior from their
peers. They usually rely on some form of state replication between
active and backup instances.
+---------+ +----------+
| Diameter|<------------->| Active |
| Peer | Session | Instance |
+---------+ Messaging +----------+
^
| Session
| State
| Replication
V
+----------+
| Backup |
| Instance |
+----------+
Figure 2: Data Replication with a Proprietary Machanism
It should be noted that Figure 2 is just an abstract representation
of proprietary data replication between active and backup instances.
Actual implementation may vary depending on the mechanims used.
Proprietary state synchronization is a common technique utilized by
Public Switched Telephone Network equipment vendors to provide 5 9's
reliability. There are also initiatives to define a standard set of
APIs for platforms/middleware providing data synchronization
services, e.g. Application Interface Specification of Service
Availability Forum.
Proprietary data replication between active and backup instances may
be asynchronous in nature. This means that they may not provide
loss-less state replication at all times. Hence, after a failover to
a backup instance, some session states may have been lost and other
states may be wrongly kept by the backup instance. That is, states
may have been terminated through session signalling to the initially
Asveren & Bodin Expires June 13, 2008 [Page 5]
Internet-Draft Diameter State Recovery Considerations December 2007
active instance but the removal of the corresponding session states
were not properly reflected in the data replication process.
5. Protocol Assisted State Recovery
Protocol assisted state recovery relies on contents of the messages
exchanged between Diameter entities.
5.1. Service Models
For each Diameter session Diameter messaging happens between a client
and server. Although not a sender/receiver of Diameter messages,
physical service/resource provided is also a parameter when designing
state recovery mechanisms. The physical resource/service is
application dependent and could be bandwith allocated on a router for
QoS application, voice transfer resources used for a prepaid voice
call application etc.
Depending on Diameter application, physical resource/service could be
at the client or server side. For example for Diameter Credit
Control Application the physical resource is controlled by the
client, whereas for QoS application with a push scenario it is
controlled by the server.
In case a proprietary data replication mechanism which is not loss-
less is used between active and backup instances to support failover,
it may be desirable to make use of the data present in the physical
resource/service. This case can benefit from a synchronization phase
before session data is transfered for purposes of rebuilding lost
state.
Physical resource/service could be used to extract some information
regarding session state to be reconstructed. For certain scenarios
this information could be enough for state reconstruction or could be
used in addition to information obtained via other means, e.g. in a
proprietary data replication mechanism, failovers could be followed
by a synchronization phase based on information obtained from the
physical resource/service.
Below is given a conceptual diagram for the DCCA client side state
recovery utilizing the state kept by service control logic.
+-----+
| +-------+
| | (2) |
---(1)--->| |Service|
Service | | Data-1|
Asveren & Bodin Expires June 13, 2008 [Page 6]
Internet-Draft Diameter State Recovery Considerations December 2007
Start | +-------+ +---------+
Request | | | |
| |-----(3)------->| |
| |Credit Control | DCCA |
| | Request for | Client |---(4)----->
| | Service Data-1 | Logic | CCR(Initial)
| | | (Active)|
| | | |<---(5)------
| |<-----(6)-------| | CCA(Initial)
| | Grant Service +---------+
| |
| S | (7)
| e | DCCA Client
| r | Logic (Active)
| v | fails
| i |
| c | (8)
| e | DCCA Client
| | Logic (Standby)
| C | detects failure
| o |
| n | +---------+
| t |<-----(9)-------| |
| r | Request for | |
| o | State Retrieval| DCCA |
| l | | Client |
| |-------(10)---->| Logic |
| | Credit Control |(Standby)|---(11)---->
| | Request for | | CCR(Initial)
| | Service Data-1 | |
| | | |<---(12)-----
| | | | CCA(Initial)
| | | |
| | | |---(13)---->
| | | | CCR(Update)
| | | |
| | | |<---(14)-----
---(15)-->| | | | CCA(Update)
Service | | | |
End | | | |---(16)---->
Request | | | | CCR(Terminate)
| | | |
| | | |<---(17)-----
+-----+ +---------+ CCA(Terminate)
Figure 3: Using Service Information for DCCA Client Side State
Recovery
Asveren & Bodin Expires June 13, 2008 [Page 7]
Internet-Draft Diameter State Recovery Considerations December 2007
5.2. Parameters to Consider
There are several aspects which may be important for a protocol
assisted session state recovery mechanism. They may or may not be
part of the design choices for a protocol assisted session state
recovery mechanism, depending on the strategy utilized.
5.2.1. Notification of the Peer About Failure
Usually it is necessary for the remote peer to be informed about the
failure of the active instance in the context of protocol assisted
state recovery. This could be achieved in different ways:
Application Layer Timers
Application layer timers could be utilized to send new requests
periodically. Lack of a new request or a corresponding answer for
a sent request/receipt or UNABLE_TO_DELIVER error answer could
indicate that the peer Diameter entity has failed.
Notification from Standby Instance
After failure of the active instance, standby instance can send a
message to the remote Diameter peer to inform it about failure of
the active instance. This method requires standby instance to
know the identities of the remote Diameter peers, with which the
failed active instance had ongoing sessions. This information
could be exchanged by a proprietary data replication mechanism.
Alternatively, standby instance could have a configured list of
remote peers and notify all of them.
5.2.2. Transfer of Session Data
For protocol assisted recovery it is necessary to supply enough
information to the backup instance so that session state can be
constructed. What constitutes session state data needs to be defined
on a per application basis. Also, in certain cases (e.g. when a
separate mechanism for state replication is used in combination with
protocol assisted state recovery) the transfer of session data may be
preceeded by a state synchronization phase. For example, a generic
message providing a list of all active sessions could be used for
such a synchronization phase.
Some approaches to transfer session data include:
Asveren & Bodin Expires June 13, 2008 [Page 8]
Internet-Draft Diameter State Recovery Considerations December 2007
Using a New Session
Upon detection of the failure of the active instance, remote
Diameter peer may start a new session without terminating the
service instance.
Using Application Messages
Data necessary to reconstruct the session state may be transferred
in an application defined message by AVP(s) specifically defined
for that purpose. Alternatively, an AVP may be used to flag that
all data carried in the message is sent for the purposes of state
synchronization.
Using a Generic Message
Data necesary to reconstruct session state may be transferred in a
message specifically defined for that purpose. Such a message may
carry state information for one or multiple sessions.
5.2.3. Backup Server Selection
A Diameter peer needs to know the identity of the backup instance, so
that it can send the necessary data to reconstruct session state.
Furthermore, loadbalancing of the ongoing sessions to different
backup instances may be necessary as well, to prevent overloading of
backup entities.
Active Instance Guided Selection
Active instance could communicate the identity of the backup
instance(s) to the peer Diameter entity with an AVP. Information
about how the load should be distributed among multiple backup
instances could be communicated as well.
Backup Instance Guided Selection
If the notification of the peer Diameter entity about the failure
of the active instance is performed via a message sent by the
standby instance, the identity of the backup instance would be
known to the the peer Diameter entity. This message could carry
information about other backup instances and loadsharing
information too.
Selection Based on Configuration
The Diameter peer may know the identities of backup servers
through configuration and try to loadshare ongoing session based
Asveren & Bodin Expires June 13, 2008 [Page 9]
Internet-Draft Diameter State Recovery Considerations December 2007
on a locally defined algorithm. For requests, which are rejected
by a standby instance with TOO_BUSY_HERE error answer, another
standby instance could be tried.
5.2.4. Timing of State Reconstruction
When state reconstruction should happen may vary depending on the
application. The following two models are foreseen:
State Reconstruction After Failure
It may be necessary to reconstruct the state after the backup
instance detects failure of the active instance. This model is
useful when the state for ongoing sessions is necessary to
generate answers for requests belonging to new sessions. Care
should be taken when determining the necessary information for
such cases, it could be the case that what is needed is some
cumulative data based on session states rather than the per
session information and this could impact the design choices to
recover/replicate the data or even the choice between a
proprietary mechanism and protocol assisted recovery.
Another use case is when autonomous requests need to be generated
from the side, where the active instance has failed. In such a
situation, backup instance needs to know ongoing sessions
immediately after it detects failure of the active instance so
that it can generate such requests.
If state reconstruction after failure is needed, notification of
the Diameter peer about failure should be done by the backup
instance.
State Reconstruction Upon Receipt of a Request
For certain applications, it could be enough if a backup server
can reply for requests for ongoing sessions after the failure of
the active instance. In such scenarios, state information
contained in the new requests for ongoing sessions (i.e. mid-
session messages) could be used to reconstruct session state on
the standby instance.
5.3. Approaches
The choice between a proprietary and protocol assisted state recovery
mechanism is not a straightforward one. Depending on the application
and the reliability level required a detailed analysis needs to be
done to justify usage of one of the methods.
Asveren & Bodin Expires June 13, 2008 [Page 10]
Internet-Draft Diameter State Recovery Considerations December 2007
If it is desired to use protocol assisted recovery, parameters
discussed in Section 5.2 need to be considered. It should be noted
that choices made for different parameters are not always independent
of each other, e.g. if state reconstruction immediately after failure
detection is necessary, using a new session to transfer session data
strategy can't be utilized. Below, two different approaches are
discussed in detail.
5.3.1. Using a New Session
As mentioned in Section 5.2.2 a new session can be used to rebuild
state after failure. This approach can be sufficient if immediate
state reconstruction after failure is not needed. That is, knowledge
of the history of the session are not needed to proceed providing the
service of the failed over Diameter node. An example diagram is
given in Figure 3. It focuses on events happening on the client side
for a DCCA session. On the server side, the sessions which were
created by the active instance are cleaned up after expiry of Tcc
timer.
A variant of using a new session for rebuilding state is to use
application messages. For example, regular mid-session messages
maintaining soft-state can be used if they contain enough information
for the desired state reconstruction. Such messages could contain an
AVP carrying a flag indicating that it's a mid-session message and
not an initial message issued to create a completely new session.
The ability to separate between recreated session and new session can
be important to some applications. For example, it may be desirable
to give recreated sessions preference over new session to resources
controlled by a Diameter server.
5.3.2. Backup Instance Triggered Recovery
In case immediate state reconstruction is desired or strictly needed
by a backup Diameter instance, this instance may need to trigger
transfer of session data to recover state. This requires session
data to be available and reachable to the backup Diameter instance.
Possible locations of such data include the physical resource/service
controlled by the failed over Diameter instance and the entities
utilizing the service offered by the Diameter instance (i.e. entities
issuing Diameter requests for the offered service).
As mentioned in Section 5.2.2 application application messages or a
generic message can be used to transfer session data for state
reconstruction. Application messages or a generic message
transferring the desired session data could be preceeded by a generic
synchronization message providing the backup Diameter instance with a
complete list of all active sessions. By that the backup Diameter
Asveren & Bodin Expires June 13, 2008 [Page 11]
Internet-Draft Diameter State Recovery Considerations December 2007
instance can distribute the recovery of session data over time. This
may be useful if this instance is to start provide its service
imediately instead of waiting until the state reconstruction process
is completed. Requesting session data in parallel with answering to
service requests requires however that period with incomplete session
state after that the backup Diameter instance starts providing the
service is acceptable.
A generic synchronization message can also be useful in a combined
solution using both a proprietary mechanism for state replication and
protocol aided state recovery. The complete list of all active
sessions provided in such a message providing can be compared with
the list of sessions replicated through a proprietary mechansism.
Thereby a potential mis-match can be identified and missing session
data can be explicitly requested by the backup Diameter instance.
6. IANA Considerations
This document does not require any IANA action.
7. Security Considerations
Certain procedures in protocol assisted state recovery, e.g.
notification of the Diameter peer about failure of an active instance
by the standby instance, could introduce security risks. It is
expected that use of IPSec/TLS together with a transitive trust model
should eliminate these concerns.
8. Acknowledgments
9. Normative References
[1] Calhoun, P., Loughney, J., Guttman, E., Zorn, G., and J. Arkko,
"Diameter Base Protocol", RFC 3588, September 2003.
[2] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
Asveren & Bodin Expires June 13, 2008 [Page 12]
Internet-Draft Diameter State Recovery Considerations December 2007
Authors' Addresses
Tolga Asveren
Sonus Networks
4400 Route 9 South
Freehold, NJ, 07728
USA
Email: tasveren@sonusnet.com
Ulf Bodin
Operax
Aurorum Science Park 8
SE-977 75 Lulea
Sweden
Email: uffe@operax.com
Asveren & Bodin Expires June 13, 2008 [Page 13]
Internet-Draft Diameter State Recovery Considerations December 2007
Full Copyright Statement
Copyright (C) The IETF Trust (2007).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Acknowledgment
Funding for the RFC Editor function is provided by the IETF
Administrative Support Activity (IASA).
Asveren & Bodin Expires June 13, 2008 [Page 14]