Internet DRAFT - draft-dicecco-vitcp

draft-dicecco-vitcp



Internet-Draft                                              S. DiCecco
<draft-dicecco-vitcp-01.txt>                               J. Williams
Expires May 2001                                         Giganet, Inc.
                   
                                                          Bill Terrell
                                                  TROIKA Networks, Inc.

                                                            John Scott
                                                Network Appliance, Inc.

                                                        C. Sapuntzakis 
                                                         Cisco Systems

                                                     November 17, 2000

                         VI / TCP (Internet VI)

Status of this memo

This document is an Internet-Draft and is offered in full accordance
with all provisions of Section 10 of RFC 2026.  Internet-Drafts are
working documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups.  Note that other groups may also 
distribute working documents as Internet-Drafts.

Internet-Draft documents are valid for a maximum of six months and may
be updated, replaced, or rendered obsolete by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference material
or to site them other than as "work in progress".

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/lid-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.NHtml

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
memo are to be interpreted as described in RFC2119.

















DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 1]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

Table of Contents

1  Abstract                                                        3
2  Overview                                                        3
2.1  VI Architectural Components                                   3
2.2  VI/TCP                                                        4
2.2.1  Extensions to VI                                            4
2.2.2  VI/TCP Overview                                             4
2.2.2.1  Basic VI Components                                       5
2.2.2.2  Introduction to VI/TCP                                    5
2.2.2.2.1  VI/TCP Addressing                                       5
2.2.2.2.2  VI/TCP Connection Management                            6
2.2.2.2.3  VI/TCP Protocol Messaging                               6
2.2.2.2.4  TCP/IP Options and VI/TCP                               7
2.2.2.2.5  VI/TCP Retransmissions                                  7
2.2.2.2.6  Note on Outstanding RDMA Reads                          8
3  The VI/TCP Protocol                                             8
3.1  VI/TCP Segment Format                                         8
3.2  VI/TCP Segment Header                                         9
3.3  VI/TCP Connection Establishment (CE) Header                  13
3.4  VI/TCP RDMA Header                                           16
3.5  VI Trailer                                                   17
3.6  CRC option                                                   18
3.7  Urgent Marker                                                19
4  VI/TCP Connection Establishment                                20
4.1  Basic Connection Establishment Timeline                      20
4.2  Connection Establishment - Active                            21
4.3  Connection Establishment - Passive                           22
5  Security Considerations                                        23
6  Intellectual Property                                          23
7  References                                                     23
8  Author's Addresses                                             24






















DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 2]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

1.  Abstract

     The Virtual Interface (VI) architecture [VIAR] describes a high
performance design for interfacing distributed applications to 
accelerated protocol processing.  VI seeks to improve the performance 
of such applications by reducing the latency and overheads associated 
with standard communications protocol stack processing.  VI greatly 
reduces the processing overhead associated with traditional network 
architectures by providing applications a protected, directly 
accessible interface to network hardware - a Virtual Interface.

This memo describes extensions to the VI Architecture designed to 
facilitate operation over TCP/IP.  These extensions take the form of 
enhancements to the VI Provider Library API defined in the VI 
Architecture Developer's Guide [VIDG], and a "VI Protocol" which 
supports VI functionality during operation over TCP/IP.

The extensions to the VI Architecture which support operation over
TCP/IP are intended to be fully compliant with the VI Architecture
[VIAR] and its associated Developer's Guide [VIDG].

2.  Overview

     This section contains a brief overview of VI components and a 
functional overview of VI operation over TCP/IP


2.1.  VI Architectural Components

     VI is comprised of four architectural components - Virtual 
Interfaces, Completion Queues, VI Providers, and VI Consumers.

Virtual Interfaces (VIs) are the mechanisms that allow VI Consumers
direct access to the data transfer services of VI Providers.  VI 
Consumers post data transfer requests, in the form of Descriptors, 
directly to the VI Provider.  Descriptors are structures that contain 
the information necessary for the VI Provider to process the data 
transfer (e.g.,data location).  Descriptors are posted to Work Queues 
(send and receive) associated with the VI.  Facilities are provided to 
signal VI Descriptor postings to the network adapter.  Processing of 
posted Descriptors is asynchronous and descriptors are marked when 
processing completes.  VI Consumers remove completed descriptors from 
Work Queues for reuse in subsequent requests.

Completion Queues provide a facility whereby VI Consumers can create a
single point of notification for processing completed Descriptors.  
Once a Work Queue is associated with a Completion Queue, handling of 
all completions are handled via that Completions Queue.






DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 3]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

The VI Provider consists of a physical network interface (NIC) and
driver functionality.  The VI NIC implements the Virtual Interfaces and
Completion Queues, and directly performs data transfers.  VI NIC 
drivers provide the control and resource management functions to 
maintain the VI between consumers and VI NICs.

VI Consumers are typically applications programs and their supporting
operating system functions.  VI Consumers represent the users of a 
Virtual Interface.  Access to the Virtual Interface is through a 
library referred to as the VI Provider Library [VIDG].  The VI Provider 
Library provides an application programming interface for hardware 
connection, endpoint creation and destruction, connection management, 
memory handling, data transfer, queue management, informational
queries, name services, and error handling.


2.2.  VI/TCP

     This section introduces the fundamentals of VI operation over TCP.

2.2.1.  Extensions to VI

     The proposed protocol supports the VI Architecture as currently
defined.  In addition, the protocol supports certain enhancements to 
VI. Extensions to the API defined in [VIDG] would be required to 
exploit such enhancements.  Proposed enhancements are as follows:

- Descriptor Flow Control:  Transmit descriptors may be posted in
advance of the corresponding receive descriptors.  The VI Provider will
supply flow control.

- Attribute Negotiation:    VI Architecture requires that incoming 
connection establishment attempts be rejected unless the calling and 
called VI Attributes match (e.g., Maximum Transfer Unit Size).  The 
protocol permits downward negotiation of MTU sizes.  The smaller
of the two VI MTU sizes proposed by the two ends at connection setup
is used for the connection.


2.2.2.  VI/TCP Overview

     This Section provides an overview of how the components of a 
Virtual Interface are created, managed, and destroyed, and also 
introduces the data transfer models.










DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 4]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

2.2.2.1.  Basic VI Components

     Operations on basic VI architectural components remain largely
unchanged with VI/TCP. VI functionality is invoked by a VI Consumer
through the API defined in [VIDG].  Access to a VI NIC is achieved by
opening a handle to the driver representing the NIC.  This handle is
used in subsequent operations.  All memory used in data transfer is
"registered" with the VI Provider.  Memory handles are used to identify
the region and to qualify virtual memory addresses. VIs are created by
the VI Provider upon request by the VI Consumer.  Connections are not
established by creation of a VI and no data transfer can occur until 
the VI is connected to another.  VI Work Queues may be associated with 
Completion Queues to provide a single handling point for completed VI
Descriptors.  VI provides a connection-oriented data transfer service.
Newly created VIs are not pre-associated with other VIs; a VI must be
explicitly connected to another to enter its data transfer phase. VI
provides two types of data transfers - traditional Send/Receive, and
Remote Direct Memory Access (RDMA).


2.2.2.2.  Introduction to VI/TCP

     This Section serves as an introduction to VI operation over 
TCP/IP.

2.2.2.2.1.  VI/TCP Addressing

     The VI Architecture defines a generic "VI Network Address" format
consisting of an "address" portion and a "discriminator" portion.  When
operating VI/TCP, the address portion contains an IP address and the
discriminator is per the VI Architecture [VIAR].  One transport layer
port is reserved for passive connection establishment.  All incoming VI
connections are through this port and VI applications distinguish 
themselves by the VI Network Address discriminator.  For active 
connection establishment, multiple transport layer ports are used.



















DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 5]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

2.2.2.2.2.  VI/TCP Connection Management

     With VI/TCP, a VI connection is implemented over an underlying
TCP connection.  The VI/TCP connection establishment process requires 
an underlying TCP connection over which VI/TCP protocol may be 
exchanged. VI connections have a one-to-one correspondence with TCP 
connections. This is referred to as the VI/TCP connection. When a VI 
connection is closed, the underlying TCP connection must be closed.  
Similarly, when a TCP connection is closed, the associated VI 
connection must be closed. VI Provider's handling VipConnectRequest 
primitives [VIDG], first request TCP to establish its connection and
then perform VI/TCP protocol messaging over this underlying
connection.  VI Providers must have accepted an underlying TCP
connection before the associated VI connection is accepted.
VI/TCP Provider's MUST check that address, handles, and attributes
are valid for the underlying connection.  From the perspective of a
VI/TCP Provider, TCP connection setup is an atomic operation that
either succeeds or fails.  If the operation succeeds, VI connection
establishment is initiated; otherwise, the VI connection is rejected.


2.2.2.2.3.  VI/TCP Protocol Messaging

     VI/TCP functionality is invoked by a VI Consumer through the API
defined in [VIDG].  The VI Provider supplies this functionality.  The 
VI Provider, through use of the VI Protocol, supports this VI/TCP 
functionality.  The VI Protocol defines "messages" to implement VI 
these functions (e.g., connections establishment).  Typically, there is 
one message per Transmit Descriptor.  Each message has a type (e.g., 
RDMA Write).

VI messages are divided into "segments".  These segments are sent, in
order, over the associated TCP connection.  It is recommended, but not
required, that there be exactly zero or one VI segment for each TCP 
segment and that VI segments not be fragmented to span multiple TCP 
segments.  All segments for one VI message will be transmitted before 
the next message is started.  An exception is provided in that RDMA 
Read Response segments may be interleaved with segments of any message 
type other than another RDMA Read Response.















DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 6]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

Example of a valid sequence of VI segments contained in the TCP
stream:

        type               message number          data offset
        ----------------------------------------------------------
        RdmaWrite          0x123                   0x0
        Send               0x124                   0x0
        RdmaReadResponse   0x777660                0x0
        Send               0x124                   0x500
        RdmaReadResponse   0x777660                0x333
        Send               0x124                   0xA00
        RdmaReadResponse   0x777660                0x666
        RdmaReadRequest    0x125                   0x0
        RdmaReadResponse   0x777660                0x999
        Send               0x126                   0x0
        RdmaReadResponse   0x777661                0x0

2.2.2.2.4.  TCP/IP Options and VI/TCP

     It is strongly recommended that TCP connections supporting VI/TCP
implement the timestamp option for PAWS (protection against wrapped
sequence numbers) as defined in RFC1323, TCP Extensions for High 
Performance [PAWS].


2.2.2.2.5.  VI/TCP Retransmissions

     VI/TCP will retransmit dropped segments, as required.  All
retransmission is handled at the TCP layer.  It is recommended that
retransmitted segments contain the same data as the original dropped
segment.  In certain circumstances, this will not be possible without
undue burden on an implementation.  The following exceptions are
noted:

- Retransmission is required, but data access results in an access 
violation and retransmission cannot occur.

- Retransmission is required, but cannot occur because the VI 
connection has been closed.

- A posted application buffer has changed.  This is not allowed
per VI architecture and therefore constitutes an error.

If a VI NIC is unable to retransmit original data, it may pad
(substituting zero or arbitrary data for the original but maintaining
the correct size) and should set the "Transmit Error" bit in the
"Type" field of the "VI Segment Header".

With the exception of these error cases, the retransmitted data
MUST always be the same as the original data including
all VI layer headers and trailers.



DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 7]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

2.2.2.2.6.  Note on Outstanding RDMA Reads

     For each RDMA Read Request received, memory allocated for the
request must be held until the response is acknowledged.  The number of
outstanding RDMA Reads must be limited to control resource exhaustion.
Discarding excessive RDMA Reads pending completions of outstanding
requests does not seem viable in the absence of a deadlock avoidance
mechanism.  The VI/TCP Protocol provided negotiation of the number of
outstanding RDMA Reads during connection establishment.  This number
represents a per VI limit and the negotiated value remains for the
lifetime of the VI/TCP connection.

3.  The VI/TCP Protocol

     This section provides the VI/TCP protocol data unit formats.  All
multibyte formats are to be represented in network byte order (i.e.,
big-endian).  Each VI PDU contains a VI Segment Header. Optionally, 
an RDMA Header or CE (connection establishment) Header may be present. 
The VI Segment Header provides sufficient features to support non-
RDMA send/receives.  The RDMA Header must be included for RDMA 
transfers.  The CE Header must be included for connection 
establishment.   The TCP layer provides a reliable data stream
connection and the VI segments are placed in this stream.

3.1   VI Segment Format

 +---------------+---------------+---------------+---------------+
 |                                                               |
 |                        VI Segment Header                      |
 |                                                               |
 +---------------+---------------+---------------+---------------+
 |                                                               |
 |                        RDMA Header                            |
 |             (Included in RdmaRead and RdmaWrite               |
 |                      segments only.)                          |
 |                                                               |
 +---------------+---------------+---------------+---------------+
 |                                                               |
 |                        CE Header                              |
 |             (Included in ConnectRequest and                   |
 |               and ConnectAccept segments only.)               |
 |                                                               |
 +---------------+---------------+---------------+---------------+
 |                                                               |
 |                       VI Payload Data                         |
 |             (Included in Send, RdmaWrite and                  |
 |               RdmaReadResponse segments only.)                |
 |                                                               |
 +---------------+---------------+---------------+---------------+
 |                                                               |
 |                          VI Trailer                           |
 |                                                               |
 +---------------+---------------+---------------+---------------+

DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 8]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

3.2.  VI Segment Header

     The VI Segment Header is defined as follows.


 |    Byte 0     |    Byte 1     |    Byte 2     |    Byte 3     |
 |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
 +---------------+---------------+---------------+---------------+
 |    Version    |  Type/Flags   |         Segment Length        |
 +---------------+---------------+---------------+---------------+
 |                          Data Offset                          |
 +---------------+---------------+---------------+---------------+
 |                         Immediate Data                        |
 +---------------+---------------+---------------+---------------+
 |                         Message Number                        |
 +---------------+---------------+---------------+---------------+
 |                          Message ACK                          |
 +---------------+---------------+---------------+---------------+
 |     Rx Descriptors Posted     |       Remote Error Code       |
 +---------------+---------------+---------------+---------------+




Version

This is an 8-bit field indicating the VI/TCP version.
This document describes version one, and this field should
contain the value 0x1.

























DiCecco, Williams, Terrell, Scott, Sapuntzakis                 [Page 9]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

Type/Flags

This is an 8 bit field which contains a 5-bit field indicating
the packet type and three one bit flags.  The defined values of the
type field are as follows:

    -  0 : Send
    -  1 : RdmaWrite
    -  2 : RdmaReadRequest
    -  3 : RdmaReadResponse
    -  4 : NOP
    -  5 : ConnectRequest
    -  6 : ConnectAccept
    -  7 : ConnectReject
    -  8 : ConnectNoMatch

The flag bits are defined as follows (bit 7 is MSB, bit 0 is LSB):

- BIT 7 :  End of Message
           Indicates the current segment is the last of a message

- BIT 6 :  Immediate Data Valid
           As defined by the VI Architecture [VIAR].  This bit
           MUST be correctly set in each segment of a message.
           If "Immediate Data Valid" does not apply to a particular
           message type, it MUST be set to zero by the sender and
           ignored by the receiver.

- BIT 5 :  Transmit Error
           Indicates either transmit length or protection error.
           If the "Transmit Error" bit is set in any segment of
           a VI message, the receiver MUST regard the entire message
           as in error, and notify the receiving application 
           accordingly.

       7     6     5     4     3     2     1     0
    +-----+-----+-----+-----+-----+-----+-----+-----+
    | Eom | IdV | TrE |            Type             |
    +-----+-----+-----+-----+-----+-----+-----+-----+

Segment Length

Segment Length is a 16-bit field containing the length of the VI
segment including the VI Segment Header  and VI Segment trailer.
This length  can be added to the byte location (within the TCP stream)
of the first byte of this segment to get the first byte position of
the next segment.  The total length exceeds the value contained in the
Segment Length by the length of the trailer.






DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 10]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

Data Offset

For the initial segment of any message, this 32-bit field will contain
zero.  For subsequent segments, it will contain the number of bytes
already transferred for this message in prior segments.  Only segment
payload is included in this count; headers are specifically not
included.  A Send, RdmaWrite, or RdmaReadResponse message may be 
divided into multiple VI segments as they carry  VI consumer data
which may be up to 4GB is size.  All other VI messages MUST consist
of a single VI segment.

Immediate Data

May hold 32-bits of optional user data as described by the VI
Architecture [VIAR].  Each segment of a message must contain the
correct value for the immediate data.  If "Immediate Data Valid"
is set to zero for a message, the sender MAY place the immediate
data from the send (or RDMA write) descriptor in this field.
Otherwise the sender MUST place zero in this field.  If the message
type does not support immediate data, the sender must place zero
in this field.

A receiving end point must ignore the contents of the Immediate
Data field if the message type does not support immediate data.
If the message type does suport immediate data and the
"Immediate Data Valid" bit is not set, then the receiver MAY
deliver the contents of the Immediate Data field to the 
user.  If the "Immediate Data Valid" bit is set, then
the Immediate Data must be delivered to the user.


Message Number

Messages are sequentially numberd by the VI/TCP Provider.  The initial
Message Number may be varied by an implementation.  For RDMA Read
Responses, Message Number carries the message number of the 
corresponding RDMA Read Request.  Any two segments with the same type 
are part of the same message if and only if their message numbers are 
equal.

Rx Descriptors Posted

Indicates the number of receive Descriptors, modulo 2^16, that have 
been posted during the lifetime of the VI/TCP connection.  If 
Descriptor flow control is in effect, the VI/TCP provider must delay 
any transmission which would consume receive Descriptors until receive 
descriptors complete and become available.







DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 11]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

Notes on NOP

NOPs are used to send a Message ACK  or update RxDescriptorsPosted
when a VI has no data to transmit.  If a VI has data to transmit,
the Message Ack and RxDescriptorsPosted number is included in the
transferred segments.  However, if a transmitter is idle, a NOP is
utilized to permit conveyance of this information.  An implementation
need not send a NOP to notify the remote end of each Rx descriptor
as it is posted, however sufficient notification SHOULD be done 
so as not to unnecessarily impede the flow of data.

NOPs do not constitute a VI message and therefore do not occupy
space in the message numbering sequence.  The message number
field of a NOP must contain the message number of the last (non
RDMA read response) message sent.

Message ACK

Message ACK is valid only for VI/TCP connections at the Reliable 
Reception Level [VIAR].  Message ACK is used in conjunction with Remote 
Error Code to provide information relating to memory protection or VI 
Descriptor errors and also to provide facilities for implementation 
specific error handling.  If the VI Error subfield of the Remote Error 
Code (Remote Error Codes, next section) indicates "No Error", then
Message ACK contains the Message Number from the last VI Message
received without error.  If VI Error is OTHER THAN "No Error", then
Message ACK contains the Message Number of the message segment in error.
Message ACK should not be indicated for messages until both message
data has been written to host memory and the associated completion
information has been written (if applicable).  Message ACKs may be
included in any VI Message Segment including that of a NOP message.

When a VI/TCP connection is supporting Reliable Reception level, the
Message ACK field must be valid and will be used determine when 
transmit Descriptors will be completed.  RDMA Reads are completed upon 
receipt of a valid response.  Message ACK are indicated for messages 
received in error.  In this case, the VI Error Type field of Remote 
Error Code is set to reflect the appropriate VI error.  Remote Error 
Code is defined in the following paragraph.  Message ACK is invalid on 
subsequent messages.

When a VI/TCP connection is supporting level Reliable Delivery or 
Unreliable Delivery, the contents of Message ACK are undefined and must 
be ignored by a receiver.  The sender may, for simplicity, choose
to send ACKs in a manner identical to Reliable Reception.  Otherwise
the sender should set the value to zero.

Remote Error Code

Remote Error Code is comprised of two subfields - the VI Error Type, 
and the IS Error Code.  Both VI Error Type and IS Error Code apply to 
VI message identified by the Message ACK field.  These are defined in 
the following paragraphs.

DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 12]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

IS Error Code

IS Error Code is an Implementation Specific error code, its semantics
are implementation dependent and considered outside the scope of this
document.  If the IS Error Code is set, Message ACK must be set to 
indicate the VI Message on which the error occurred.  Note that VI 
errors (see next paragraph) and local errors need not be mutually 
exclusive and this field may be used to provide supplemental status 
information.

VI Error Type

VI Error Type contains bits indicating specific error condition.

    -  Bit  0 : RDMA Memory Protection Error
    -  Bit  1 : VI Descriptor Error
    -  Bit  2 : Unrecoverable Transport Error

When a VI/TCP connection is supporting Reliable Reception level, the VI
Error Type field must be valid and is used to update the Status of the
VI Descriptor's Control Segment.

When a VI/TCP connection is supporting Reliable Delivery level or 
Unreliable Delivery, VI Error Type is undefined and must be ignored
by the receiver.  The sender, for simplicity, may choose to set
the VI Error Type as done for Reliable Reception, and should
set it to zero otherwise.

     15  14  13  12  11  10  9   8   7   6   5   4   3   2   1   0
   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
   |         IS error code         |       reserved    |UTE|VDE|MPE|
   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+






















DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 13]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

3.3   VI/TCP Connection Establishment (CE) Header

     The VI/TCP CE Header is defined as follows.



     |    Byte 0     |    Byte 1     |    Byte 2     |    Byte 3     |
     |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
     +---------------+---------------+---------------+---------------+
     |     Calling Attributes        | Calling Discriminator Length  |
     +---------------+---------------+---------------+---------------+
     |                           MTU Size                            |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                    Calling Discriminator (64 bytes)         -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |   Calling RDMA Read Window    |  Called Discriminator Length  |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                    Called Discriminator (64 bytes)          -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     +-                      Options                                -+
     |                                                               |
     +---------------+---------------+---------------+---------------+



















DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 14]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

Calling Attributes

The Calling Attributes field contains the following flag bits.
Bit 15 is MSB, bit 0 is LSB.

    - Bit  0 : Unreliable
    - Bit  1 : Reliable Delivery
    - Bit  2 : Reliable Reception
    - Bit  3 : RDMA Write Enable
    - Bit  4 : RDMA Read Enable
    - Bit  5 : Descriptor Flow Control Enabled
    - Bit  6 : Peer-to-peer Connection Establishment
    - Bits 7-15 : reserved

     15  14  13  12  11  10  9   8   7   6   5   4   3   2   1   0
   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
   |         Reserved                  |P2P|DFC|RWE|RWE| RR| RD| UR|
   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The reliability level of the two ends of a connection must match,
and the peer-to-peer bit must match. However the RDMA Write Enable,
RDMA Read Enable and Descriptor Flow Control Enabled bits do not need
to match and may be specified independently by each end of a connection.

If one end sets the "Descriptor Flow Control Enabled" bit, then that
end expects to support flow control by not processing sends until
the remote receive descriptor is available.  The end receiving the
"Descriptor Flow Control Enabled" bit must use the "Rx Descriptors
Posted" field in the VI header to notify the other end when and if
sends (or RDMA writes with immediate data) may be done.  If the
"Descriptor Flow Control Enabled" bit is not set in a received CE
header, the end receiving MAY, but is not required to, notify the
other end of receive descriptors posted via the "Rx Descriptors Posted"
field of the VI header.

Calling/Called Discriminator and Discriminator Lengths:

These fields are as defined by the VI Architecture [VIAR]

Calling/Called Discriminator:

These fields contain the discriminators as defined by the VI
Architecture.  Although the actual length of the discriminator is
determined by the associated length field, a 64 byte field is used
to hold the discriminators thereby setting a maximum length of 64
bytes that may be used.

MTU Size

MTU Size is "proposed" in Connect Request PDUs and is considered an
"agreed" value in a Connect Accept.  The agreed value must be the 
lesser of the called/calling VI/TCP Provider's MTU capability.


DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 15]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

Options:

The option field contains zero or more options.  Each options
has the following format


     Byte   Byte   Byte   Byte   Byte    .........   Byte
   |  0   |  1   |  2   |  3   |  4   |            |  N-1   |
   +------+------+------+------+------+------------+--------+
   |    Type     |    Length   |           Data             |
   +------+------+------+------+------+------------+--------+

The options defined as of this revision are as follows:

    End of Option List       Type = 0
                             (no length or data included)

    CRC option:              Type = 1
                             Length = 4
                            (no data)

    Urgent Marker Option:    Type = 2
                             Length = 4
                             (no data)

A receiver MUST ignore any unsupported or unknown options.  The option
list is terminated by the end of the containing VI segment or by
the "End of option list".  If the CRC option is specified, the
segment must have a valid CRC and therefore the "End Of Option List"
must be explicitly included so the CRC is not interpreted as an
option.


3.4  VI/TCP RDMA Header

     The VI/TCP RDMA Header is defined as follows.


  |    Byte 0     |    Byte 1     |    Byte 2     |    Byte 3     |
  |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
  +---------------+---------------+---------------+---------------+
  |                                                               |
  +-                         RDMA Address                        -+
  |                                                               |
  +---------------+---------------+---------------+---------------+
  |                   Registered Memory Handle                    |
  +---------------+---------------+---------------+---------------+
  |                          RDMA Length                          |
  +---------------+---------------+---------------+---------------+





DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 16]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

RDMA Address

The RDMA Address field contains the 64-bit data address of the first
data segment from the VI Descriptor


Registered Memory Handle

The Registered Memory Handle field contains the Memory Handle returned
when the region of memory containing the data segment was registered
with the VI Provider.  This is the same memory handle required by the 
VI Descriptor.

RDMA Length

The RDMA Length field contains the length field from the VI Descriptor
that indicates the total number of bytes to be transferred across all
segments of a message.

3.5  VI Trailer

The format of the VI trailer is as follows.  The CRC immediately
follows the last byte of the VI segment payload, or header
if there is no payload.  It is not necessarily word aligned.

  |    Byte 0     |    Byte 1     |    Byte 2     |    Byte 3     |
  |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
  +---------------+---------------+---------------+---------------+
  |                             CRC                               |
  +---------------+---------------+---------------+---------------+

The VI trailer is included in CE segments (segments of type
ConnectRequest and ConnectAccept) if and only if the CRC option
is included in that segment.  A trailer is MAY be included in a
ConnectReject or ConnectNoMatch VI segment at the option of the
sender.

All other segments MUST include a trailer CRC if and only if the
ConnectRequest and ConnectAccept message which established the
connection both included the CRC option.














DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 17]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

3.6  CRC Option

VI/TCP  allows for an optional CRC to be included in each segment.
In order for this option to be enabled, both ends of a connection 
must include the CRC option in the option field of the ConnectRequest
and ConnectAccept messages.

A ConnectRequest segment will contain a CRC if and only if it contains
the CRC option in the options section.

A ConnectAccept MAY contain the CRC option only if the associated
ConnectRequest contained the CRC option, and MUST contain a computed
CRC if an only if it contains the CRC option.

ConnectReject and ConnectNoMatch segments MAY contain a CRC.  Since
these types of segments contain no payload, a receiver can 
determine by means of the segment length if there is a CRC
included.

All other segment types MUST contain a CRC if and only if the CRC
option was specified in both the ConnectRequest and ConnectAccept
segments which established the connection.  Otherwise they MUST
NOT contain a CRC.

The CRC-32 is calculated across the entire VI segment (but does
not cover other segments of the same message, or lower level protocol
headers such as TCP).  The algorithm used to calculate the CRC is
exactly that used for the ethernet CRC except that a different
generator polynomial is used.  The generator polynomial for the VI/TCP
CRC is

   x^32 + x^31 + x^30 + x^28 + x^27 + x^25 + x^24 + x^22 +
   x^21 + x^20 + x^16 + x^10 + x^9  + x^6  + 1.

This polynomial is the standard ethernet polynomial with a left-right 
reversal.  (Or mathematically, substitute y = x^-1 and multiply by 
y^32).  In hex format with the x^32 term removed, this is 0xDB710641.

















DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 18]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

The CRC computation is described mathematically as follows.

    a)  Start with the VI segment with zero inserted in the CRC field.
        This is the entire VI segment including all VI headers and
        trailers.

    b)  Complement the first 32 bits of the segment.

    c)  The n bits of the segment are then considered to be the 
        coefficients of a polynomial M(x) of degree n-1.

    d)  M(x) is divided by G(x), the generator polynomial defined
        above, producing a remainder R(x) of degree less than or
        equal to 31.

    e)  The bit sequence of R(x) is complemented and the result is
        the CRC and placed in the CRC field of the VI segment.

If a VI segment is received with an incorrect value in the 
CRC field one of the following two actions MUST be taken.

    1.  Drop the segment and do not send a TCP ack covering the 
        bad data.  The TCP layer will then attempt to retransmit.
        This can only be done if the implementation merges the VI
        and TCP layers.

    2.  Deliver the data to the application with status indicating
        transport error.  In this case the connection must be 
        closed immediately if the mode is Reliable Delivery or
        Reliable Reception.  The connection may continue if in
        Unreliable mode.  In Reliable Reception mode, a message
        ack MUST be sent to the remote end indicating an
        "Unrecoverable Transport Error".

3.7  Urgent Marker Option

Either end may specify the Urgent Marker Option.  The end receiving
the Urgent Marker Option MAY desginate the first byte of any VI
segment as TCP urgent data.  As specified by RFC 1122, the urgent
pointer in the TCP header must point the the urgent byte (the first
byte of the VI segment) and not the byte following the urgent
byte as some implementations mistakenly do.  (If an implementation
can't guarantee this, it MUST never designate any urgent data.)

A VI/TCP implementation MUST never designate any data other than the
first byte of a VI segment as urgent.  Unless and until the Urgent
Marker Option was received from the remote end of the VI/TCP
connection, no TCP data may be designated as urgent.






DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 19]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

4.  VI/TCP Connection Establishment

     This section contains the state machines governing VI/TCP 
connection establishment.  Both active and passive (e.g., listens) 
scenarios are presented.  For peer-to-peer connection establishment,
the same connection establishment mechanism is used with one
end using active connection establishment and one end using passive.
The end of the connection with the "higher" address does an active
establishment and sends the ConnectRequest message.  Comparison of
address is done by treating each end's IP address as an unsigned
binary number (32 bits or 128 bits for IPv4 and IPv6 respectively)
and doing a normal numerical comparison.

Receiving a Connect No Match VI message during peer connection
establishment results in repeated attempts for a period specified by
the VI Consumer's connection timeout value.

4.1.  Basic Connection Establishment Timeline




  VIPL API [VIDG]      |   VI/TCP Protocol    |             VIPL API
  ------------------------------------------------------------------
                       |                      |
                       |                      |       VipConnectWait
  VipConnectRequest    |                      | <-----------------
    -----------------> |                      |
                       | setup TCP connection |
                       |                      |
                       |   Connect Request    |
                       | -------------------> |  VipConnectWait(ret)
                       |                      | ----------------->
                       |                      |
                       |                      |     VipConnectAccept
                       |    Connect Accept    |  <----------------
  VipConnectReq (ret)  | <------------------- |
    <----------------- |  or Connect Reject   |
                       |  or Connect No Match |















DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 20]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

4.2.  Connection Establishment - Active

     The state machine governing active VI/TCP connection establishment
is as follows:



             +----------------+  (Legend: event - action)
             |  Disconnected  | <--------------------------------<+
             +----------------+                                   ^
                     |  VipConnectRequest                         |
                     |  - Setup TCP connection                    |
                     |                                            |
                    \|/                                           |
             +----------------+   TCP setup fail                  ^
     +------>|   Connecting   +>--------------------------------->+
     |       +----------------+                                   ^
     | TCP Closes    |  TCP connection established                |
     | -             |  - ConnectRequest                          |
     | Reestablish   |                                            |
     |              \|/          ConnectReject or                 |
     |       +----------------+  ConnectNoMatch or Timeout        ^
     +------<| Pending Accept |>--------------------------------->+
             +----------------+  - close TCP connect.             ^
                     |                                            |
                     | ConnectAccept                              |
                    \|/                                           |
             +----------------+  Vip or TCP disconnect            ^
             |   Connected    |>--------------------------------->+
             +----------------+  - close TCP connection
























DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 21]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

4.3.  "Connection Establishment - Passive"

     The state machine governing passive VI/TCP connection 
establishment is as follows:



   +----------------+                                +--------------+
   |  Listening on  |                                + Disconnected +
   |     VI/TCP     |                                +--------------+
   | Well Known Port|                                               ^
   +----------------+                                               |
           |                                                        |
           |  incoming TCP connection - Accept TCP connection       |
           |          (Legend: event - action)                      |
          \|/                                                       |
   +----------------+  Timeout - close TCP connection               ^
   |    Incoming    |---------------------------------------------->+
   +----------------+  TCP connection closes                        ^
           |                                                        |
           |  incoming Connect Request                              |
          \|/                                                       |
   +----------------+  No Matching Discriminator                    ^
   |    Matching    +---------------------------------------------->+
   +----------------+  - Send Connect NoMatch; close TCP connection ^
           |                                                        |
           |  Discriminator match                                   |
          \|/                                                       |
   +----------------+  ConnectReject - VipConnectReject, close TCP  ^
   | Pending Accept |---------------------------------------------->+
   +----------------+                                               ^
           |                                                        |
           |  VipConnectAccept - ConnectAccept                      |
          \|/                                                      
   +----------------+  Vip or TCP disconnect - close TCP connection ^
   |   Connected    |---------------------------------------------->+
   +----------------+

















DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 22]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000

5.  Security Considerations

     No special security considerations exist at this time.

6.  Intellectual Property

    The existence of following US patents is acknowledged:
    5,991,818 and 6,094,712.  The authors offer no opinion regarding 
    these patents.

7.  References


     [VIAR]  "Virtual Interface Architecture Specification", Compaq
              Computer Corp., Intel Corporation, Microsoft Corporation,
              1997.

     [VIDG]  "Intel Virtual Interface (VI) Architecture Developer's
             Guide", Intel Corporation, September 1998.

     [PAWS]  Jacobsen, Braden, Borman, "TCP Extensions for High
             Performance", RFC 1323, May 1992.
































DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 23]

Internet-Draft           VI / TCP (Internet VI)       November 17, 2000


8.  Author's Addresses


     Stephen DiCecco
     James Williams
     Giganet, Inc.
     Concord Office Center
     2352 Main Street
     Concord, Massachusetts  01742
     978.461.0402 (tel)
     978.461.0430 (fax)
     www.giganet.com
     Email:
     sdicecco@giganet.com
     jimw@giganet.com

     Bill Terrell
     TROIKA Networks, Inc.
     2829 Townsgate Road, Suite 200
     Westlake Village, CA  91361
     805.370.2612 (tel)
     805.371.1344 (fax)
     www.TroikaNetworks.com
     Email:
     terrell@TroikaNetworks.com

     John A. Scott
     627 Davis Drive, Suite 200
     Morrisville, NC 27560
     919.993.5626 (tel)
     919.993.5604 (fax)
     www.netapp.com
     Email:
     jscott@netapp.com

     Costa Sapuntzakis
     Cisco Systems, Inc.
     170 W. Tasman Drive
     San Jose, CA 95134, USA
     Phone: +1 408 525 5497
     www.cisco.com
     Email: csapuntz@cisco.com











DiCecco, Williams, Terrell, Scott, Sapuntzakis                [Page 24]