Internet DRAFT - draft-csapuntz-tcprdma

draft-csapuntz-tcprdma




INTERNET-DRAFT                                          C. Sapuntzakis
Expires July 2000                                        Cisco Systems
                                                           D. Cheriton
                                                         Cisco Systems
                                                         February 2000

                            TCP RDMA option
                      draft-csapuntz-tcprdma-00.txt


Status of this Memo

     This document is an Internet-Draft and is NOT offered in accordance
     with Section 10 of RFC2026, and the author does not provide the
     IETF with any rights other than to publish as an Internet-Draft.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other docu-
     ments at any time.  It is inappropriate to use Internet-Drafts as
     reference material or to cite them other than as "work in pro-
     gress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


Copyright Notice

     Copyright (C) Cisco Systems (1999-2000). All Rights Reserved.


Abstract


     The TCP option introduced in this draft reduces the overhead of
     receiving data with TCP-based protocols such as NFS and HTTP.  It
     enables the construction of a simple hardware accelerator that
     copies data directly from the incoming packet into application
     buffers, avoiding expensive copies in the protocol stack.  Even
     without hardware acceleration, the option enables the protocol
     stack to decrease the number of copies it must do.



Sapuntzakis, Cheriton                                           [Page 1]

Internet-Draft              TCP RDMA option             22 February 2000


     The TCP RDMA option is an annotation and requires no modifications
     to overlying protocols. It can be used with popular protocols such
     as HTTP, NFS, and CIFS, along with new protocols.

     The TCP option also provides a bit to indicate application-level
     message boundaries. The bit enables out-of-order processing of the
     TCP receive queue, potentially decreasing service times in the
     presence of packet drops and improving performance on parallel sys-
     tems.










































Sapuntzakis, Cheriton                                           [Page 2]

Internet-Draft              TCP RDMA option             22 February 2000


Table of Contents


             1.   Glossary
             2.   Introduction
             3.   RDMA option
             3.1.   Usage
             3.1.1.   RID
             3.1.2.   Data Offset
             3.1.3.   Data Length
             3.1.4.   Total RDMA Length
             3.1.5.   Buffer Offset
             3.1.6.   Message Aligned (A) bit
             3.1.7.   Unsolicited (U) bit
             3.1.8.   Other constraints
             3.2.   Negotiating use of the option
             3.3.   Multiple options
             3.4.   Interactions with TCP congestion control
             4.   Examples
             5.   RID Formats
             5.1.   NFS
             5.1.1.   NFS RID Format
             5.1.1.1.   RPC XID
             5.1.1.2.   Operation
             5.1.1.3.   Zeroes
             5.1.2.   READ RPC replies
             5.1.3.   WRITE RPC requests
             5.1.4.   Message Aligned (A) bit
             5.2.   HTTP
             5.2.1.   RID format
             5.2.2.   GET responses
             5.2.3.   POST or PUT requests
             5.3.   Common Internet File System (CIFS)
             5.3.1.   Tag Format
             5.3.1.1.   Pid and Mid
             5.3.1.2.   Operation Index
             5.3.1.3.   Zeros
             5.3.2.   Unsolicited Bit
             5.3.3.   Message Aligned (A) bit
             5.4.   SCSI
             6.   Security considerations
             6.1.   Receiver security considerations
             7.   Authors' Addresses
             8.   References







Sapuntzakis, Cheriton                                           [Page 3]

Internet-Draft              TCP RDMA option             22 February 2000


1.  Glossary

     remote DMA (RDMA) - the transfer of application data from a remote
     buffer into a contiguous, usually aligned, local buffer

     RDMA data - the application data being transferred via RDMA

     unsolicited data - data that a receiver did not request


2.  Introduction


     Currently, doing remote DMA (RDMA) between processors over TCP pro-
     tocols such as HTTP and NFS requires much processing on the client
     and server machines, especially at speeds of a gigabit or higher.
     To see where this overhead comes from, it is instructive to look at
     an example.

     Consider the problem of an 8 kilobyte NFS transfer coming in from
     an Ethernet and eventually ending up in an application's memory.
     Ethernet's MTU is around 1500 bytes so the sender sends at least 6
     packets across the Ethernet.

     At the receiver, the six packets arrive at the network interface.
     For each of the six packets, the network interface card on the
     receiver copies the entire packet to the host's memory. The network
     interface notifies the host software of the arrival of the packets.

     The host software then does IP and TCP processing, which eventually
     results in the software copying the TCP payload into a TCP receive
     buffer.

     NFS parses the data in the TCP receive buffer to find the file
     pages. NFS copies the file pages to the buffer cache. Once in the
     buffer cache, the operating system maps the pages into the
     application's address space.

     These memory-to-memory copies cost valuable main memory bandwidth
     at clients and servers. To improve performance, it is necessary to
     reduce the number of such copies. One way to do this is to have the
     network interface card write the file data into the final location
     (e.g. the buffer cache) the first time. This requires that the net-
     work interface card recognize file data in incoming packets.

     For NFS and HTTP, the problem of recognizing file data involves
     parsing the protocol headers. This is complex and does not lend
     itself to a simple hardware realization.



Sapuntzakis, Cheriton                                           [Page 4]

Internet-Draft              TCP RDMA option             22 February 2000


     This memo defines a new TCP option, the RDMA option, which circum-
     vents the parsing of complex protocol headers. The sender places
     the option on TCP segments containing RDMA data. The RDMA option
     describes to the receiver the location of the RDMA data in the TCP
     payload.

     An RDMA identifier (RID) in the option allows multiple outstanding
     RDMA transfers on a TCP connection by allowing the sender and
     receiver to uniquely tag the RDMAs. The layout of the RID depends
     on the specific higher layer protocol (e.g. NFS).

     The TCP RDMA option is an annotation and requires no modifications
     to overlying protocols.

     This memo specifies the RDMA option in detail in section 3. The use
     of this option with NFS, HTTP, SCSI, and CIFS, is specified in sec-
     tion 5.


































Sapuntzakis, Cheriton                                           [Page 5]

Internet-Draft              TCP RDMA option             22 February 2000


3.  RDMA option


3.1.  Usage



         Kind: 25 (decimal)
         Length: 2 or 4 or 16 or 20 bytes

     Byte/       0     |       1       |       2       |       3       |
        /              |               |               |               |
       |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
       +---------------+---------------+-+-+---------------------------+
     0 |      25       |    Length     |A|U|       RDMA ID             |
       +---------------+---------------+-+-+---------------------------+
     4 |                         RDMA ID (RID)                         |
       +---------------------------------------------------------------+
     8 |                         Buffer Offset                         |
       +-------------------------------+-------------------------------+
     12|          Data  Offset         |          Data Length          |
       +-------------------------------+-------------------------------+
     16|                       Total RDMA Length                       |
       +---------------------------------------------------------------+


3.1.1.  RID


     All segments in a single RDMA transfer carry the same 46-bit RDMA
     ID (RID).

     The RID is an application-level identifier that the receiver can
     use to map the transfer to an application buffer. The exact value
     of the RID depends on the overlying protocol. RID formats for
     several popular protocols are given in section 5.

     The RDMA ID is stored in network byte order. That is, bits 40-45 of
     the RID get placed in bits 0-5 of byte 2. Bits 0-7 of the RID get
     placed in bits 0-7 of byte 7.


3.1.2.  Data Offset


     The data offset specifies the number of bytes from the beginning of
     the TCP payload to the RDMA transfer data.




Sapuntzakis, Cheriton                                           [Page 6]

Internet-Draft              TCP RDMA option             22 February 2000


     The data offset MUST not exceed the length of the TCP payload.


3.1.3.  Data Length


     The data length specifies the number of bytes of RDMA transfer data
     in this segment, starting at the data offset.

     The data length MUST not cause the option to describe bytes past
     the end of the TCP segment.

     A data length of zero is valid.


3.1.4.  Total RDMA Length


     The total RDMA length is the number of bytes that will be
     transferred using this RID. If the sender does not know the length
     of the RDMA when the segment is sent, the sender should send the 16
     byte version of this option, leaving the total RDMA length field
     off.

     The total RDMA length, when present, MUST be the same for all seg-
     ments in the RDMA transfer.

     A total RDMA length of zero is valid.


3.1.5.  Buffer Offset


     If this RDMA transfer is going into a separate buffer on the
     receiver, the buffer offset field specifies the offset in that
     buffer. At that offset, the receiver should write the RDMA data
     demarcated by the data offset and data length fields.


3.1.6.  Message Aligned (A) bit

     The message aligned bit, when 1, indicates that byte 0 of the TCP
     payload corresponds to the start of a new application-layer mes-
     sage.

     The message aligned (A) bit is bit 7 of byte 2.

     The four byte version of the option may be sent if the sender



Sapuntzakis, Cheriton                                           [Page 7]

Internet-Draft              TCP RDMA option             22 February 2000


     wishes to only communicate a message aligned state.


3.1.7.  Unsolicited (U) bit


     In NFS and other RPC-based protocol, transfers from the server to
     the client (e.g. reads) occur in the response to an explicit
     request by the client. The explicit request by the client indicates
     that the client has an allocated buffer waiting for the data from
     the transfer (or at least has had the opportunity to do so). The
     client can use the explicit request to communicate some identifier
     to the server that the server places in the response. In the
     response, that identifier, embedded in the RID, can be used to
     associate the data with a client buffer.

     However, transfers from the client to the server (e.g. writes)
     often occur on the request. There is usually no opportunity in
     these protocols for a client to obtain any kind of identifier for
     the server's application buffer. Indeed, the server may not even
     have an application buffer allocated for the client request.

     To indicate this special situation, the unsolicited bit is used.

     The unsolicited bit (U) is bit 6 of byte 2.

     The sender SHOULD set the unsolicited bit (U) to one if the RID is
     not expected by the receiver.


3.1.8.  Other constraints


     The RDMA option MUST appear on every segment containing data that
     is part of the RDMA transfer.

     The sender MUST align the RDMA option on a 4 octet boundary rela-
     tive to the TCP header.


3.2.  Negotiating use of the option


     For the purpose of options negotiation, the length field MAY be set
     to 2 to prevent any accidental RDMA transfers.






Sapuntzakis, Cheriton                                           [Page 8]

Internet-Draft              TCP RDMA option             22 February 2000


3.3.  Multiple options


     Correct implementations MAY only look at the first RDMA option in a
     segment. The TCP segments MUST conform to the rules layed out in
     section 3 when all RDMA options but the first in the segment are
     stripped. The most important of these requirements is that the RDMA
     option MUST appear on every segment that contains data that is part
     of the RDMA transfer.


3.4.  Interactions with TCP congestion control


     The RDMA option may result in segments that are under maximum seg-
     ment size (MSS) being sent. This may slow the opening of congestion
     windows on systems that do so based on the number MSS packets
     received.

































Sapuntzakis, Cheriton                                           [Page 9]

Internet-Draft              TCP RDMA option             22 February 2000


4.  Examples


     The figure below is a representation of a TCP stream.  It has a
     single RDMA transfer that occupies two contiguous sections of the
     TCP stream (section 1 and section 2).



                               Sequence number
          +----------------+   0
          |     Header     |
          |                |
          |                |
          +----------------+   100
          | Transfer       |
          | Section 1      |
          /                /
          /                /
          |                |
          +----------------+   2100
          |    Trailer     |
          +----------------+   2200
          |    Header      |
          |                |
          +----------------+   2300
          | Transfer       |
          | Section 2      |
          /                /
          /                /
          |                |
          +----------------+   4300


     The table below illustrates how this section of the TCP stream will
     be turned into 6 TCP segments with the RDMA option. The TCP maximum
     segment size for this stream is 1000 bytes.

     The sequence number comes from the TCP header.












Sapuntzakis, Cheriton                                          [Page 10]

Internet-Draft              TCP RDMA option             22 February 2000



          +------------------------------------------------------+
          | Segment  | Sequence | Buffer   | Data     | Data     |
          | Number   | Number   | Offset   | Offset   | Length   |
          |          |          |          |          |          |
          +------------------------------------------------------+
          | 1        | 0        | 0        | 100      | 900      |
          | 2        | 1000     | 900      | 0        | 1000     |
          | 3        | 2000     | 1900     | 0        | 100      |
          | 4        | 2200     | 2000     | 100      | 900      |
          | 5        | 3200     | 2900     | 0        | 1000     |
          | 6        | 4200     | 3900     | 0        | 100      |
          +------------------------------------------------------+


     Segment #3 is only 200 bytes, part data and part trailer. If avail-
     able to the TCP stack at the time, the TCP stack could have sent
     out the next header as part of the segment. Below is such a segmen-
     tation.


          +------------------------------------------------------+
          | Segment  | Sequence | Buffer   | Data     | Data     |
          | Number   | Number   | Offset   | Offset   | Length   |
          |          |          |          |          |          |
          +------------------------------------------------------+
          | 1        | 0        | 0        | 100      | 900      |
          | 2        | 1000     | 900      | 0        | 1000     |
          | 3        | 2000     | 1900     | 0        | 100      |
          | 4        | 2300     | 2000     | 0        | 1000     |
          | 5        | 3300     | 3000     | 0        | 1000     |
          +------------------------------------------------------+


     Note: not putting application headers at the front of a TCP segment
     may cause decreased performance with some receivers.

     In either segmentation, segment 3 cannot include any of Transfer
     Part 2 since the RDMA option can only describe one transfer per
     packet.  Thus, segment 3 will always be less than MSS, even if the
     stack has more to send.










Sapuntzakis, Cheriton                                          [Page 11]

Internet-Draft              TCP RDMA option             22 February 2000


5.  RID Formats



5.1.  NFS



     In NFS, file pages are transferred using the NFS READ and WRITE
     RPCs.  When issuing a READ, the NFS client presumably has an appli-
     cation buffer (e.g. block cache buffer) waiting to absorb it. When
     receiving a WRITE, the NFS server may not have a waiting applica-
     tion buffer to absorb the write.


5.1.1.  NFS RID Format


     RID format for NFS protocol:


       4       4 3        3 3
       5       0 9        2 1                        0
      +---------+----------+--------------------------+
      |  Zero   | Operation|        RPC XID           |
      +---------+----------+--------------------------+



5.1.1.1.  RPC XID


     The NFS protocols work on top of ONC RPC which associates with each
     RPC a 32-bit transaction ID (XID).


5.1.1.2.  Operation


     NFS version 4 allows multiple read and write "operations" per RPC.
     These operations share the same XID since they are part of the same
     RPC. To disambiguate the RDMAs resulting from these operations, the
     RID contains an Operation Index in bits 32-39 of the RID. The
     operation index is zero for the first operation, one for the
     second, and so on.

     Note that the operation index is independent of whether the opera-
     tion results in an RDMA.  If only the third operation in an RPC



Sapuntzakis, Cheriton                                          [Page 12]

Internet-Draft              TCP RDMA option             22 February 2000


     results in an RDMA, then the RID for that RDMA will have a 2 in the
     operation index field.

     The operation index MUST be zero for NFS versions 2 and 3.


5.1.1.3.  Zeroes


     Bits 40-45 MUST be set to zero by the sender and received as zeros
     by the receiver.


5.1.2.  READ RPC replies


     For the file pages in NFS READ responses, the server MUST NOT set
     the unsolicited bit to 1.

     If the READ RPC fails and no data is returned, the server SHOULD
     indicate zero length RDMA transfer.


5.1.3.  WRITE RPC requests


     For NFS WRITE calls, the client SHOULD set the unsolicited bit to
     one, since the server is not expecting the WRITE.


5.1.4.  Message Aligned (A) bit

     The message aligned bit, when used on an NFS connection, indicates
     the start of an ONC RPC message at byte 0 of a payload. For the
     purposes of this specification, the start of an ONC RPC message is
     the four byte length field that is defined for the tunneling of RPC
     over TCP.


5.2.  HTTP



5.2.1.  RID format







Sapuntzakis, Cheriton                                          [Page 13]

Internet-Draft              TCP RDMA option             22 February 2000



       4                 3 3
       5                 2 1                       0
      +-------------------+-------------------------+
      |      Zero         |      Request idx        |
      +-------------------+-------------------------+


     On an HTTP/1.1 connection, the server sends back responses in the
     order it received requests. Thus, the index of the request, where
     the first request is index 0, is sufficient to disambiguate the
     RDMAs.


5.2.2.  GET responses


     The unsolicited bit SHOULD be set to zero.

     Note, the HTTP server may not know the length of the response, so
     clients should be prepared to receive the 16 byte option.

5.2.3.  POST or PUT requests


     In POST or PUT requests, the client sends data to the server. The
     unsolicited bit SHOULD be set to one.


5.3.  Common Internet File System (CIFS)


     The Common Internet File System (CIFS) is based on top of an RPC
     system known as Server Message Block (SMB).


5.3.1.  Tag Format




      4     4 3          3 3         1 1
      5     0 9          2 1         6 5           0
     +-------+------------+-----------+-------------+
     | Zero  | Operation  |   PID     |     MID     |
     +-------+------------+-----------+-------------+





Sapuntzakis, Cheriton                                          [Page 14]

Internet-Draft              TCP RDMA option             22 February 2000


5.3.1.1.  Pid and Mid


     In SMB, a request is uniquely identified by a 64-bit quantity that
     includes 4 16-bit fields: Tree Id, User Id, Process Id (PID), and
     Multiplex Id (MID).

     There is insufficient room in the RDMA tag to include all four
     fields. However, the PID and MID originate from the client and are
     uninterpreted by the server. The client can assign PIDs and MIDs so
     as to disambiguate concurrent requests. Thus, a CIFS client using
     the RDMA option MUST ensure that two concurrent SMB requests do not
     share the same PID and MID fields.


5.3.1.2.  Operation Index


     CIFS supports compound requests that can result in multiple
     transfers per SMB. The operation index, in bits 32-39, corresponds
     to the index of the operation in the SMB that caused the RDMA. The
     first operation is given index zero and so on. Operations are logi-
     cally assigned indexes whether or not they cause an RDMA.

5.3.1.3.  Zeros


     Bits 40-45 MUST be set to zero by the sender and received as zeros
     by the receiver.

5.3.2.  Unsolicited Bit


     For CIFS operations that return data from the server, the unsoli-
     cited bit SHOULD be set to zero.

     For CIFS operations that send data from the client, the unsolicited
     bit SHOULD be set to one.


5.3.3.  Message Aligned (A) bit

     The message aligned bit, when used on a CIFS connection, indicates
     the start of a NetBIOS message at byte 0 of a payload. For the pur-
     poses of this specification, the start of an NetBIOS message is the
     four byte length field that is defined for the tunneling of NetBIOS
     over TCP.




Sapuntzakis, Cheriton                                          [Page 15]

Internet-Draft              TCP RDMA option             22 February 2000


5.4.  SCSI


     The SCSI Architecture model [SAM, SAM2] lays out the requirements
     for SCSI transports. [SCSI/TCP] is just such a transport.  The
     [SCSI/TCP] document defines the RID structure for SCSI.













































Sapuntzakis, Cheriton                                          [Page 16]

Internet-Draft              TCP RDMA option             22 February 2000


6.  Security considerations


     The RDMA option potentially leaks information about an encrypted
     TCP stream.  The presence of or absence of the option, the size and
     position of the RDMA, and the RID may all leak information to a
     passive listener.

     The TCP RDMA option is not protected by SSL or TLS, which only pro-
     tect the TCP payload. It is, however, protected by the IPsec AH and
     ESP headers.


6.1.  Receiver security considerations


     A malicious sender may attempt an RDMA transfer larger than the
     receiving DMA buffer. A secure receiver MUST do bounds checking on
     the offsets to avoid buffer overruns.

     When mapping from RIDs to buffers, a receiver should take into
     account the TCP connection to decrease the opportunity for mali-
     cious senders to interfere with RDMAs taking place on other connec-
     tions.

     Some receivers may set aside buffers for unsolicited transfers. A
     malicious sender can monopolize those buffers, potentially causing
     performance degradation to the rest of the system, by doing a
     series of small, unsolicited transfers. The receiver may wish to
     place quotas on the size and number of outstanding unsolicited
     transfers on a single connection.




















Sapuntzakis, Cheriton                                          [Page 17]

Internet-Draft              TCP RDMA option             22 February 2000


7.  Authors' Addresses



     Constantine Sapuntzakis
     Cisco Systems, Inc.
     170 W. Tasman Drive
     San Jose, CA 95134
     USA

     Phone: +1 408 525 5497
     Email: csapuntz@cisco.com

     David Cheriton
     Cisco Systems, Inc.
     170 W. Tasman Drive
     San Jose, CA 95134
     USA

     Phone: +1 408 527 8207
     Email: cheriton@cisco.com



8.  References

     [CIFS] Leach, P., "A Common Internet File System (CIFS/1.0) Proto-
     col Preliminary Draft", http://www.cifs.com/specs/draft-leach-
     cifs-v1-spec-01.txt, December 1997

     [HTTP] Gettys, J., et al., "Hypertext Transfer Protocol -
     HTTP/1.1", RFC 2616, June 1999

     [NFSv3]  Callaghan, B., "NFS Version 3 Protocol Specification", RFC
     1813, June 1995

     [RPC] Srinivasan, R., "RPC: Remote Procedure Call Protocol Specifi-
     cation Version 2", RFC 1831, August 1995

     [SCSI/TCP] Satran, J., et al., "SCSI/TCP",
     ftp://ftp.ietf.org/internet-drafts/draft-satran-scot-00.txt

     [SAM] "SCSI-3 Architecture Model", ANSI X3.270:1996,
     http://www.t10.org/

     [SAM2] "SCSI Architecture Model - 2 Draft", ANSI T101157-D,
     http://www.t10.org/




Sapuntzakis, Cheriton                                          [Page 18]

Internet-Draft              TCP RDMA option             22 February 2000


     [TCP]  Postel, J., "Transmission Control Protocol - DARPA Internet
     Program Protocol Specification", RFC 793, September 1981

















































Sapuntzakis, Cheriton                                          [Page 19]