Internet DRAFT - draft-dicecco-vitcp
draft-dicecco-vitcp
Internet-Draft S. DiCecco
<draft-dicecco-vitcp-01.txt> J. Williams
Expires May 2001 Giganet, Inc.
Bill Terrell
TROIKA Networks, Inc.
John Scott
Network Appliance, Inc.
C. Sapuntzakis
Cisco Systems
November 17, 2000
VI / TCP (Internet VI)
Status of this memo
This document is an Internet-Draft and is offered in full accordance
with all provisions of Section 10 of RFC 2026. Internet-Drafts are
working documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups. Note that other groups may also
distribute working documents as Internet-Drafts.
Internet-Draft documents are valid for a maximum of six months and may
be updated, replaced, or rendered obsolete by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to site them other than as "work in progress".
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/lid-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.NHtml
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
memo are to be interpreted as described in RFC2119.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 1]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
Table of Contents
1 Abstract 3
2 Overview 3
2.1 VI Architectural Components 3
2.2 VI/TCP 4
2.2.1 Extensions to VI 4
2.2.2 VI/TCP Overview 4
2.2.2.1 Basic VI Components 5
2.2.2.2 Introduction to VI/TCP 5
2.2.2.2.1 VI/TCP Addressing 5
2.2.2.2.2 VI/TCP Connection Management 6
2.2.2.2.3 VI/TCP Protocol Messaging 6
2.2.2.2.4 TCP/IP Options and VI/TCP 7
2.2.2.2.5 VI/TCP Retransmissions 7
2.2.2.2.6 Note on Outstanding RDMA Reads 8
3 The VI/TCP Protocol 8
3.1 VI/TCP Segment Format 8
3.2 VI/TCP Segment Header 9
3.3 VI/TCP Connection Establishment (CE) Header 13
3.4 VI/TCP RDMA Header 16
3.5 VI Trailer 17
3.6 CRC option 18
3.7 Urgent Marker 19
4 VI/TCP Connection Establishment 20
4.1 Basic Connection Establishment Timeline 20
4.2 Connection Establishment - Active 21
4.3 Connection Establishment - Passive 22
5 Security Considerations 23
6 Intellectual Property 23
7 References 23
8 Author's Addresses 24
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 2]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
1. Abstract
The Virtual Interface (VI) architecture [VIAR] describes a high
performance design for interfacing distributed applications to
accelerated protocol processing. VI seeks to improve the performance
of such applications by reducing the latency and overheads associated
with standard communications protocol stack processing. VI greatly
reduces the processing overhead associated with traditional network
architectures by providing applications a protected, directly
accessible interface to network hardware - a Virtual Interface.
This memo describes extensions to the VI Architecture designed to
facilitate operation over TCP/IP. These extensions take the form of
enhancements to the VI Provider Library API defined in the VI
Architecture Developer's Guide [VIDG], and a "VI Protocol" which
supports VI functionality during operation over TCP/IP.
The extensions to the VI Architecture which support operation over
TCP/IP are intended to be fully compliant with the VI Architecture
[VIAR] and its associated Developer's Guide [VIDG].
2. Overview
This section contains a brief overview of VI components and a
functional overview of VI operation over TCP/IP
2.1. VI Architectural Components
VI is comprised of four architectural components - Virtual
Interfaces, Completion Queues, VI Providers, and VI Consumers.
Virtual Interfaces (VIs) are the mechanisms that allow VI Consumers
direct access to the data transfer services of VI Providers. VI
Consumers post data transfer requests, in the form of Descriptors,
directly to the VI Provider. Descriptors are structures that contain
the information necessary for the VI Provider to process the data
transfer (e.g.,data location). Descriptors are posted to Work Queues
(send and receive) associated with the VI. Facilities are provided to
signal VI Descriptor postings to the network adapter. Processing of
posted Descriptors is asynchronous and descriptors are marked when
processing completes. VI Consumers remove completed descriptors from
Work Queues for reuse in subsequent requests.
Completion Queues provide a facility whereby VI Consumers can create a
single point of notification for processing completed Descriptors.
Once a Work Queue is associated with a Completion Queue, handling of
all completions are handled via that Completions Queue.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 3]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
The VI Provider consists of a physical network interface (NIC) and
driver functionality. The VI NIC implements the Virtual Interfaces and
Completion Queues, and directly performs data transfers. VI NIC
drivers provide the control and resource management functions to
maintain the VI between consumers and VI NICs.
VI Consumers are typically applications programs and their supporting
operating system functions. VI Consumers represent the users of a
Virtual Interface. Access to the Virtual Interface is through a
library referred to as the VI Provider Library [VIDG]. The VI Provider
Library provides an application programming interface for hardware
connection, endpoint creation and destruction, connection management,
memory handling, data transfer, queue management, informational
queries, name services, and error handling.
2.2. VI/TCP
This section introduces the fundamentals of VI operation over TCP.
2.2.1. Extensions to VI
The proposed protocol supports the VI Architecture as currently
defined. In addition, the protocol supports certain enhancements to
VI. Extensions to the API defined in [VIDG] would be required to
exploit such enhancements. Proposed enhancements are as follows:
- Descriptor Flow Control: Transmit descriptors may be posted in
advance of the corresponding receive descriptors. The VI Provider will
supply flow control.
- Attribute Negotiation: VI Architecture requires that incoming
connection establishment attempts be rejected unless the calling and
called VI Attributes match (e.g., Maximum Transfer Unit Size). The
protocol permits downward negotiation of MTU sizes. The smaller
of the two VI MTU sizes proposed by the two ends at connection setup
is used for the connection.
2.2.2. VI/TCP Overview
This Section provides an overview of how the components of a
Virtual Interface are created, managed, and destroyed, and also
introduces the data transfer models.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 4]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
2.2.2.1. Basic VI Components
Operations on basic VI architectural components remain largely
unchanged with VI/TCP. VI functionality is invoked by a VI Consumer
through the API defined in [VIDG]. Access to a VI NIC is achieved by
opening a handle to the driver representing the NIC. This handle is
used in subsequent operations. All memory used in data transfer is
"registered" with the VI Provider. Memory handles are used to identify
the region and to qualify virtual memory addresses. VIs are created by
the VI Provider upon request by the VI Consumer. Connections are not
established by creation of a VI and no data transfer can occur until
the VI is connected to another. VI Work Queues may be associated with
Completion Queues to provide a single handling point for completed VI
Descriptors. VI provides a connection-oriented data transfer service.
Newly created VIs are not pre-associated with other VIs; a VI must be
explicitly connected to another to enter its data transfer phase. VI
provides two types of data transfers - traditional Send/Receive, and
Remote Direct Memory Access (RDMA).
2.2.2.2. Introduction to VI/TCP
This Section serves as an introduction to VI operation over
TCP/IP.
2.2.2.2.1. VI/TCP Addressing
The VI Architecture defines a generic "VI Network Address" format
consisting of an "address" portion and a "discriminator" portion. When
operating VI/TCP, the address portion contains an IP address and the
discriminator is per the VI Architecture [VIAR]. One transport layer
port is reserved for passive connection establishment. All incoming VI
connections are through this port and VI applications distinguish
themselves by the VI Network Address discriminator. For active
connection establishment, multiple transport layer ports are used.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 5]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
2.2.2.2.2. VI/TCP Connection Management
With VI/TCP, a VI connection is implemented over an underlying
TCP connection. The VI/TCP connection establishment process requires
an underlying TCP connection over which VI/TCP protocol may be
exchanged. VI connections have a one-to-one correspondence with TCP
connections. This is referred to as the VI/TCP connection. When a VI
connection is closed, the underlying TCP connection must be closed.
Similarly, when a TCP connection is closed, the associated VI
connection must be closed. VI Provider's handling VipConnectRequest
primitives [VIDG], first request TCP to establish its connection and
then perform VI/TCP protocol messaging over this underlying
connection. VI Providers must have accepted an underlying TCP
connection before the associated VI connection is accepted.
VI/TCP Provider's MUST check that address, handles, and attributes
are valid for the underlying connection. From the perspective of a
VI/TCP Provider, TCP connection setup is an atomic operation that
either succeeds or fails. If the operation succeeds, VI connection
establishment is initiated; otherwise, the VI connection is rejected.
2.2.2.2.3. VI/TCP Protocol Messaging
VI/TCP functionality is invoked by a VI Consumer through the API
defined in [VIDG]. The VI Provider supplies this functionality. The
VI Provider, through use of the VI Protocol, supports this VI/TCP
functionality. The VI Protocol defines "messages" to implement VI
these functions (e.g., connections establishment). Typically, there is
one message per Transmit Descriptor. Each message has a type (e.g.,
RDMA Write).
VI messages are divided into "segments". These segments are sent, in
order, over the associated TCP connection. It is recommended, but not
required, that there be exactly zero or one VI segment for each TCP
segment and that VI segments not be fragmented to span multiple TCP
segments. All segments for one VI message will be transmitted before
the next message is started. An exception is provided in that RDMA
Read Response segments may be interleaved with segments of any message
type other than another RDMA Read Response.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 6]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
Example of a valid sequence of VI segments contained in the TCP
stream:
type message number data offset
----------------------------------------------------------
RdmaWrite 0x123 0x0
Send 0x124 0x0
RdmaReadResponse 0x777660 0x0
Send 0x124 0x500
RdmaReadResponse 0x777660 0x333
Send 0x124 0xA00
RdmaReadResponse 0x777660 0x666
RdmaReadRequest 0x125 0x0
RdmaReadResponse 0x777660 0x999
Send 0x126 0x0
RdmaReadResponse 0x777661 0x0
2.2.2.2.4. TCP/IP Options and VI/TCP
It is strongly recommended that TCP connections supporting VI/TCP
implement the timestamp option for PAWS (protection against wrapped
sequence numbers) as defined in RFC1323, TCP Extensions for High
Performance [PAWS].
2.2.2.2.5. VI/TCP Retransmissions
VI/TCP will retransmit dropped segments, as required. All
retransmission is handled at the TCP layer. It is recommended that
retransmitted segments contain the same data as the original dropped
segment. In certain circumstances, this will not be possible without
undue burden on an implementation. The following exceptions are
noted:
- Retransmission is required, but data access results in an access
violation and retransmission cannot occur.
- Retransmission is required, but cannot occur because the VI
connection has been closed.
- A posted application buffer has changed. This is not allowed
per VI architecture and therefore constitutes an error.
If a VI NIC is unable to retransmit original data, it may pad
(substituting zero or arbitrary data for the original but maintaining
the correct size) and should set the "Transmit Error" bit in the
"Type" field of the "VI Segment Header".
With the exception of these error cases, the retransmitted data
MUST always be the same as the original data including
all VI layer headers and trailers.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 7]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
2.2.2.2.6. Note on Outstanding RDMA Reads
For each RDMA Read Request received, memory allocated for the
request must be held until the response is acknowledged. The number of
outstanding RDMA Reads must be limited to control resource exhaustion.
Discarding excessive RDMA Reads pending completions of outstanding
requests does not seem viable in the absence of a deadlock avoidance
mechanism. The VI/TCP Protocol provided negotiation of the number of
outstanding RDMA Reads during connection establishment. This number
represents a per VI limit and the negotiated value remains for the
lifetime of the VI/TCP connection.
3. The VI/TCP Protocol
This section provides the VI/TCP protocol data unit formats. All
multibyte formats are to be represented in network byte order (i.e.,
big-endian). Each VI PDU contains a VI Segment Header. Optionally,
an RDMA Header or CE (connection establishment) Header may be present.
The VI Segment Header provides sufficient features to support non-
RDMA send/receives. The RDMA Header must be included for RDMA
transfers. The CE Header must be included for connection
establishment. The TCP layer provides a reliable data stream
connection and the VI segments are placed in this stream.
3.1 VI Segment Format
+---------------+---------------+---------------+---------------+
| |
| VI Segment Header |
| |
+---------------+---------------+---------------+---------------+
| |
| RDMA Header |
| (Included in RdmaRead and RdmaWrite |
| segments only.) |
| |
+---------------+---------------+---------------+---------------+
| |
| CE Header |
| (Included in ConnectRequest and |
| and ConnectAccept segments only.) |
| |
+---------------+---------------+---------------+---------------+
| |
| VI Payload Data |
| (Included in Send, RdmaWrite and |
| RdmaReadResponse segments only.) |
| |
+---------------+---------------+---------------+---------------+
| |
| VI Trailer |
| |
+---------------+---------------+---------------+---------------+
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 8]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
3.2. VI Segment Header
The VI Segment Header is defined as follows.
| Byte 0 | Byte 1 | Byte 2 | Byte 3 |
|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
+---------------+---------------+---------------+---------------+
| Version | Type/Flags | Segment Length |
+---------------+---------------+---------------+---------------+
| Data Offset |
+---------------+---------------+---------------+---------------+
| Immediate Data |
+---------------+---------------+---------------+---------------+
| Message Number |
+---------------+---------------+---------------+---------------+
| Message ACK |
+---------------+---------------+---------------+---------------+
| Rx Descriptors Posted | Remote Error Code |
+---------------+---------------+---------------+---------------+
Version
This is an 8-bit field indicating the VI/TCP version.
This document describes version one, and this field should
contain the value 0x1.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 9]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
Type/Flags
This is an 8 bit field which contains a 5-bit field indicating
the packet type and three one bit flags. The defined values of the
type field are as follows:
- 0 : Send
- 1 : RdmaWrite
- 2 : RdmaReadRequest
- 3 : RdmaReadResponse
- 4 : NOP
- 5 : ConnectRequest
- 6 : ConnectAccept
- 7 : ConnectReject
- 8 : ConnectNoMatch
The flag bits are defined as follows (bit 7 is MSB, bit 0 is LSB):
- BIT 7 : End of Message
Indicates the current segment is the last of a message
- BIT 6 : Immediate Data Valid
As defined by the VI Architecture [VIAR]. This bit
MUST be correctly set in each segment of a message.
If "Immediate Data Valid" does not apply to a particular
message type, it MUST be set to zero by the sender and
ignored by the receiver.
- BIT 5 : Transmit Error
Indicates either transmit length or protection error.
If the "Transmit Error" bit is set in any segment of
a VI message, the receiver MUST regard the entire message
as in error, and notify the receiving application
accordingly.
7 6 5 4 3 2 1 0
+-----+-----+-----+-----+-----+-----+-----+-----+
| Eom | IdV | TrE | Type |
+-----+-----+-----+-----+-----+-----+-----+-----+
Segment Length
Segment Length is a 16-bit field containing the length of the VI
segment including the VI Segment Header and VI Segment trailer.
This length can be added to the byte location (within the TCP stream)
of the first byte of this segment to get the first byte position of
the next segment. The total length exceeds the value contained in the
Segment Length by the length of the trailer.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 10]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
Data Offset
For the initial segment of any message, this 32-bit field will contain
zero. For subsequent segments, it will contain the number of bytes
already transferred for this message in prior segments. Only segment
payload is included in this count; headers are specifically not
included. A Send, RdmaWrite, or RdmaReadResponse message may be
divided into multiple VI segments as they carry VI consumer data
which may be up to 4GB is size. All other VI messages MUST consist
of a single VI segment.
Immediate Data
May hold 32-bits of optional user data as described by the VI
Architecture [VIAR]. Each segment of a message must contain the
correct value for the immediate data. If "Immediate Data Valid"
is set to zero for a message, the sender MAY place the immediate
data from the send (or RDMA write) descriptor in this field.
Otherwise the sender MUST place zero in this field. If the message
type does not support immediate data, the sender must place zero
in this field.
A receiving end point must ignore the contents of the Immediate
Data field if the message type does not support immediate data.
If the message type does suport immediate data and the
"Immediate Data Valid" bit is not set, then the receiver MAY
deliver the contents of the Immediate Data field to the
user. If the "Immediate Data Valid" bit is set, then
the Immediate Data must be delivered to the user.
Message Number
Messages are sequentially numberd by the VI/TCP Provider. The initial
Message Number may be varied by an implementation. For RDMA Read
Responses, Message Number carries the message number of the
corresponding RDMA Read Request. Any two segments with the same type
are part of the same message if and only if their message numbers are
equal.
Rx Descriptors Posted
Indicates the number of receive Descriptors, modulo 2^16, that have
been posted during the lifetime of the VI/TCP connection. If
Descriptor flow control is in effect, the VI/TCP provider must delay
any transmission which would consume receive Descriptors until receive
descriptors complete and become available.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 11]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
Notes on NOP
NOPs are used to send a Message ACK or update RxDescriptorsPosted
when a VI has no data to transmit. If a VI has data to transmit,
the Message Ack and RxDescriptorsPosted number is included in the
transferred segments. However, if a transmitter is idle, a NOP is
utilized to permit conveyance of this information. An implementation
need not send a NOP to notify the remote end of each Rx descriptor
as it is posted, however sufficient notification SHOULD be done
so as not to unnecessarily impede the flow of data.
NOPs do not constitute a VI message and therefore do not occupy
space in the message numbering sequence. The message number
field of a NOP must contain the message number of the last (non
RDMA read response) message sent.
Message ACK
Message ACK is valid only for VI/TCP connections at the Reliable
Reception Level [VIAR]. Message ACK is used in conjunction with Remote
Error Code to provide information relating to memory protection or VI
Descriptor errors and also to provide facilities for implementation
specific error handling. If the VI Error subfield of the Remote Error
Code (Remote Error Codes, next section) indicates "No Error", then
Message ACK contains the Message Number from the last VI Message
received without error. If VI Error is OTHER THAN "No Error", then
Message ACK contains the Message Number of the message segment in error.
Message ACK should not be indicated for messages until both message
data has been written to host memory and the associated completion
information has been written (if applicable). Message ACKs may be
included in any VI Message Segment including that of a NOP message.
When a VI/TCP connection is supporting Reliable Reception level, the
Message ACK field must be valid and will be used determine when
transmit Descriptors will be completed. RDMA Reads are completed upon
receipt of a valid response. Message ACK are indicated for messages
received in error. In this case, the VI Error Type field of Remote
Error Code is set to reflect the appropriate VI error. Remote Error
Code is defined in the following paragraph. Message ACK is invalid on
subsequent messages.
When a VI/TCP connection is supporting level Reliable Delivery or
Unreliable Delivery, the contents of Message ACK are undefined and must
be ignored by a receiver. The sender may, for simplicity, choose
to send ACKs in a manner identical to Reliable Reception. Otherwise
the sender should set the value to zero.
Remote Error Code
Remote Error Code is comprised of two subfields - the VI Error Type,
and the IS Error Code. Both VI Error Type and IS Error Code apply to
VI message identified by the Message ACK field. These are defined in
the following paragraphs.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 12]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
IS Error Code
IS Error Code is an Implementation Specific error code, its semantics
are implementation dependent and considered outside the scope of this
document. If the IS Error Code is set, Message ACK must be set to
indicate the VI Message on which the error occurred. Note that VI
errors (see next paragraph) and local errors need not be mutually
exclusive and this field may be used to provide supplemental status
information.
VI Error Type
VI Error Type contains bits indicating specific error condition.
- Bit 0 : RDMA Memory Protection Error
- Bit 1 : VI Descriptor Error
- Bit 2 : Unrecoverable Transport Error
When a VI/TCP connection is supporting Reliable Reception level, the VI
Error Type field must be valid and is used to update the Status of the
VI Descriptor's Control Segment.
When a VI/TCP connection is supporting Reliable Delivery level or
Unreliable Delivery, VI Error Type is undefined and must be ignored
by the receiver. The sender, for simplicity, may choose to set
the VI Error Type as done for Reliable Reception, and should
set it to zero otherwise.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| IS error code | reserved |UTE|VDE|MPE|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 13]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
3.3 VI/TCP Connection Establishment (CE) Header
The VI/TCP CE Header is defined as follows.
| Byte 0 | Byte 1 | Byte 2 | Byte 3 |
|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
+---------------+---------------+---------------+---------------+
| Calling Attributes | Calling Discriminator Length |
+---------------+---------------+---------------+---------------+
| MTU Size |
+---------------+---------------+---------------+---------------+
| |
+- -+
| |
+- Calling Discriminator (64 bytes) -+
| |
+- -+
| |
+---------------+---------------+---------------+---------------+
| Calling RDMA Read Window | Called Discriminator Length |
+---------------+---------------+---------------+---------------+
| |
+- -+
| |
+- Called Discriminator (64 bytes) -+
| |
+- -+
| |
+---------------+---------------+---------------+---------------+
| |
+- Options -+
| |
+---------------+---------------+---------------+---------------+
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 14]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
Calling Attributes
The Calling Attributes field contains the following flag bits.
Bit 15 is MSB, bit 0 is LSB.
- Bit 0 : Unreliable
- Bit 1 : Reliable Delivery
- Bit 2 : Reliable Reception
- Bit 3 : RDMA Write Enable
- Bit 4 : RDMA Read Enable
- Bit 5 : Descriptor Flow Control Enabled
- Bit 6 : Peer-to-peer Connection Establishment
- Bits 7-15 : reserved
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| Reserved |P2P|DFC|RWE|RWE| RR| RD| UR|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
The reliability level of the two ends of a connection must match,
and the peer-to-peer bit must match. However the RDMA Write Enable,
RDMA Read Enable and Descriptor Flow Control Enabled bits do not need
to match and may be specified independently by each end of a connection.
If one end sets the "Descriptor Flow Control Enabled" bit, then that
end expects to support flow control by not processing sends until
the remote receive descriptor is available. The end receiving the
"Descriptor Flow Control Enabled" bit must use the "Rx Descriptors
Posted" field in the VI header to notify the other end when and if
sends (or RDMA writes with immediate data) may be done. If the
"Descriptor Flow Control Enabled" bit is not set in a received CE
header, the end receiving MAY, but is not required to, notify the
other end of receive descriptors posted via the "Rx Descriptors Posted"
field of the VI header.
Calling/Called Discriminator and Discriminator Lengths:
These fields are as defined by the VI Architecture [VIAR]
Calling/Called Discriminator:
These fields contain the discriminators as defined by the VI
Architecture. Although the actual length of the discriminator is
determined by the associated length field, a 64 byte field is used
to hold the discriminators thereby setting a maximum length of 64
bytes that may be used.
MTU Size
MTU Size is "proposed" in Connect Request PDUs and is considered an
"agreed" value in a Connect Accept. The agreed value must be the
lesser of the called/calling VI/TCP Provider's MTU capability.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 15]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
Options:
The option field contains zero or more options. Each options
has the following format
Byte Byte Byte Byte Byte ......... Byte
| 0 | 1 | 2 | 3 | 4 | | N-1 |
+------+------+------+------+------+------------+--------+
| Type | Length | Data |
+------+------+------+------+------+------------+--------+
The options defined as of this revision are as follows:
End of Option List Type = 0
(no length or data included)
CRC option: Type = 1
Length = 4
(no data)
Urgent Marker Option: Type = 2
Length = 4
(no data)
A receiver MUST ignore any unsupported or unknown options. The option
list is terminated by the end of the containing VI segment or by
the "End of option list". If the CRC option is specified, the
segment must have a valid CRC and therefore the "End Of Option List"
must be explicitly included so the CRC is not interpreted as an
option.
3.4 VI/TCP RDMA Header
The VI/TCP RDMA Header is defined as follows.
| Byte 0 | Byte 1 | Byte 2 | Byte 3 |
|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
+---------------+---------------+---------------+---------------+
| |
+- RDMA Address -+
| |
+---------------+---------------+---------------+---------------+
| Registered Memory Handle |
+---------------+---------------+---------------+---------------+
| RDMA Length |
+---------------+---------------+---------------+---------------+
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 16]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
RDMA Address
The RDMA Address field contains the 64-bit data address of the first
data segment from the VI Descriptor
Registered Memory Handle
The Registered Memory Handle field contains the Memory Handle returned
when the region of memory containing the data segment was registered
with the VI Provider. This is the same memory handle required by the
VI Descriptor.
RDMA Length
The RDMA Length field contains the length field from the VI Descriptor
that indicates the total number of bytes to be transferred across all
segments of a message.
3.5 VI Trailer
The format of the VI trailer is as follows. The CRC immediately
follows the last byte of the VI segment payload, or header
if there is no payload. It is not necessarily word aligned.
| Byte 0 | Byte 1 | Byte 2 | Byte 3 |
|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
+---------------+---------------+---------------+---------------+
| CRC |
+---------------+---------------+---------------+---------------+
The VI trailer is included in CE segments (segments of type
ConnectRequest and ConnectAccept) if and only if the CRC option
is included in that segment. A trailer is MAY be included in a
ConnectReject or ConnectNoMatch VI segment at the option of the
sender.
All other segments MUST include a trailer CRC if and only if the
ConnectRequest and ConnectAccept message which established the
connection both included the CRC option.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 17]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
3.6 CRC Option
VI/TCP allows for an optional CRC to be included in each segment.
In order for this option to be enabled, both ends of a connection
must include the CRC option in the option field of the ConnectRequest
and ConnectAccept messages.
A ConnectRequest segment will contain a CRC if and only if it contains
the CRC option in the options section.
A ConnectAccept MAY contain the CRC option only if the associated
ConnectRequest contained the CRC option, and MUST contain a computed
CRC if an only if it contains the CRC option.
ConnectReject and ConnectNoMatch segments MAY contain a CRC. Since
these types of segments contain no payload, a receiver can
determine by means of the segment length if there is a CRC
included.
All other segment types MUST contain a CRC if and only if the CRC
option was specified in both the ConnectRequest and ConnectAccept
segments which established the connection. Otherwise they MUST
NOT contain a CRC.
The CRC-32 is calculated across the entire VI segment (but does
not cover other segments of the same message, or lower level protocol
headers such as TCP). The algorithm used to calculate the CRC is
exactly that used for the ethernet CRC except that a different
generator polynomial is used. The generator polynomial for the VI/TCP
CRC is
x^32 + x^31 + x^30 + x^28 + x^27 + x^25 + x^24 + x^22 +
x^21 + x^20 + x^16 + x^10 + x^9 + x^6 + 1.
This polynomial is the standard ethernet polynomial with a left-right
reversal. (Or mathematically, substitute y = x^-1 and multiply by
y^32). In hex format with the x^32 term removed, this is 0xDB710641.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 18]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
The CRC computation is described mathematically as follows.
a) Start with the VI segment with zero inserted in the CRC field.
This is the entire VI segment including all VI headers and
trailers.
b) Complement the first 32 bits of the segment.
c) The n bits of the segment are then considered to be the
coefficients of a polynomial M(x) of degree n-1.
d) M(x) is divided by G(x), the generator polynomial defined
above, producing a remainder R(x) of degree less than or
equal to 31.
e) The bit sequence of R(x) is complemented and the result is
the CRC and placed in the CRC field of the VI segment.
If a VI segment is received with an incorrect value in the
CRC field one of the following two actions MUST be taken.
1. Drop the segment and do not send a TCP ack covering the
bad data. The TCP layer will then attempt to retransmit.
This can only be done if the implementation merges the VI
and TCP layers.
2. Deliver the data to the application with status indicating
transport error. In this case the connection must be
closed immediately if the mode is Reliable Delivery or
Reliable Reception. The connection may continue if in
Unreliable mode. In Reliable Reception mode, a message
ack MUST be sent to the remote end indicating an
"Unrecoverable Transport Error".
3.7 Urgent Marker Option
Either end may specify the Urgent Marker Option. The end receiving
the Urgent Marker Option MAY desginate the first byte of any VI
segment as TCP urgent data. As specified by RFC 1122, the urgent
pointer in the TCP header must point the the urgent byte (the first
byte of the VI segment) and not the byte following the urgent
byte as some implementations mistakenly do. (If an implementation
can't guarantee this, it MUST never designate any urgent data.)
A VI/TCP implementation MUST never designate any data other than the
first byte of a VI segment as urgent. Unless and until the Urgent
Marker Option was received from the remote end of the VI/TCP
connection, no TCP data may be designated as urgent.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 19]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
4. VI/TCP Connection Establishment
This section contains the state machines governing VI/TCP
connection establishment. Both active and passive (e.g., listens)
scenarios are presented. For peer-to-peer connection establishment,
the same connection establishment mechanism is used with one
end using active connection establishment and one end using passive.
The end of the connection with the "higher" address does an active
establishment and sends the ConnectRequest message. Comparison of
address is done by treating each end's IP address as an unsigned
binary number (32 bits or 128 bits for IPv4 and IPv6 respectively)
and doing a normal numerical comparison.
Receiving a Connect No Match VI message during peer connection
establishment results in repeated attempts for a period specified by
the VI Consumer's connection timeout value.
4.1. Basic Connection Establishment Timeline
VIPL API [VIDG] | VI/TCP Protocol | VIPL API
------------------------------------------------------------------
| |
| | VipConnectWait
VipConnectRequest | | <-----------------
-----------------> | |
| setup TCP connection |
| |
| Connect Request |
| -------------------> | VipConnectWait(ret)
| | ----------------->
| |
| | VipConnectAccept
| Connect Accept | <----------------
VipConnectReq (ret) | <------------------- |
<----------------- | or Connect Reject |
| or Connect No Match |
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 20]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
4.2. Connection Establishment - Active
The state machine governing active VI/TCP connection establishment
is as follows:
+----------------+ (Legend: event - action)
| Disconnected | <--------------------------------<+
+----------------+ ^
| VipConnectRequest |
| - Setup TCP connection |
| |
\|/ |
+----------------+ TCP setup fail ^
+------>| Connecting +>--------------------------------->+
| +----------------+ ^
| TCP Closes | TCP connection established |
| - | - ConnectRequest |
| Reestablish | |
| \|/ ConnectReject or |
| +----------------+ ConnectNoMatch or Timeout ^
+------<| Pending Accept |>--------------------------------->+
+----------------+ - close TCP connect. ^
| |
| ConnectAccept |
\|/ |
+----------------+ Vip or TCP disconnect ^
| Connected |>--------------------------------->+
+----------------+ - close TCP connection
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 21]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
4.3. "Connection Establishment - Passive"
The state machine governing passive VI/TCP connection
establishment is as follows:
+----------------+ +--------------+
| Listening on | + Disconnected +
| VI/TCP | +--------------+
| Well Known Port| ^
+----------------+ |
| |
| incoming TCP connection - Accept TCP connection |
| (Legend: event - action) |
\|/ |
+----------------+ Timeout - close TCP connection ^
| Incoming |---------------------------------------------->+
+----------------+ TCP connection closes ^
| |
| incoming Connect Request |
\|/ |
+----------------+ No Matching Discriminator ^
| Matching +---------------------------------------------->+
+----------------+ - Send Connect NoMatch; close TCP connection ^
| |
| Discriminator match |
\|/ |
+----------------+ ConnectReject - VipConnectReject, close TCP ^
| Pending Accept |---------------------------------------------->+
+----------------+ ^
| |
| VipConnectAccept - ConnectAccept |
\|/
+----------------+ Vip or TCP disconnect - close TCP connection ^
| Connected |---------------------------------------------->+
+----------------+
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 22]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
5. Security Considerations
No special security considerations exist at this time.
6. Intellectual Property
The existence of following US patents is acknowledged:
5,991,818 and 6,094,712. The authors offer no opinion regarding
these patents.
7. References
[VIAR] "Virtual Interface Architecture Specification", Compaq
Computer Corp., Intel Corporation, Microsoft Corporation,
1997.
[VIDG] "Intel Virtual Interface (VI) Architecture Developer's
Guide", Intel Corporation, September 1998.
[PAWS] Jacobsen, Braden, Borman, "TCP Extensions for High
Performance", RFC 1323, May 1992.
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 23]
Internet-Draft VI / TCP (Internet VI) November 17, 2000
8. Author's Addresses
Stephen DiCecco
James Williams
Giganet, Inc.
Concord Office Center
2352 Main Street
Concord, Massachusetts 01742
978.461.0402 (tel)
978.461.0430 (fax)
www.giganet.com
Email:
sdicecco@giganet.com
jimw@giganet.com
Bill Terrell
TROIKA Networks, Inc.
2829 Townsgate Road, Suite 200
Westlake Village, CA 91361
805.370.2612 (tel)
805.371.1344 (fax)
www.TroikaNetworks.com
Email:
terrell@TroikaNetworks.com
John A. Scott
627 Davis Drive, Suite 200
Morrisville, NC 27560
919.993.5626 (tel)
919.993.5604 (fax)
www.netapp.com
Email:
jscott@netapp.com
Costa Sapuntzakis
Cisco Systems, Inc.
170 W. Tasman Drive
San Jose, CA 95134, USA
Phone: +1 408 525 5497
www.cisco.com
Email: csapuntz@cisco.com
DiCecco, Williams, Terrell, Scott, Sapuntzakis [Page 24]