Internet DRAFT - draft-csapuntz-tcprdma
draft-csapuntz-tcprdma
INTERNET-DRAFT C. Sapuntzakis
Expires July 2000 Cisco Systems
D. Cheriton
Cisco Systems
February 2000
TCP RDMA option
draft-csapuntz-tcprdma-00.txt
Status of this Memo
This document is an Internet-Draft and is NOT offered in accordance
with Section 10 of RFC2026, and the author does not provide the
IETF with any rights other than to publish as an Internet-Draft.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other docu-
ments at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in pro-
gress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) Cisco Systems (1999-2000). All Rights Reserved.
Abstract
The TCP option introduced in this draft reduces the overhead of
receiving data with TCP-based protocols such as NFS and HTTP. It
enables the construction of a simple hardware accelerator that
copies data directly from the incoming packet into application
buffers, avoiding expensive copies in the protocol stack. Even
without hardware acceleration, the option enables the protocol
stack to decrease the number of copies it must do.
Sapuntzakis, Cheriton [Page 1]
Internet-Draft TCP RDMA option 22 February 2000
The TCP RDMA option is an annotation and requires no modifications
to overlying protocols. It can be used with popular protocols such
as HTTP, NFS, and CIFS, along with new protocols.
The TCP option also provides a bit to indicate application-level
message boundaries. The bit enables out-of-order processing of the
TCP receive queue, potentially decreasing service times in the
presence of packet drops and improving performance on parallel sys-
tems.
Sapuntzakis, Cheriton [Page 2]
Internet-Draft TCP RDMA option 22 February 2000
Table of Contents
1. Glossary
2. Introduction
3. RDMA option
3.1. Usage
3.1.1. RID
3.1.2. Data Offset
3.1.3. Data Length
3.1.4. Total RDMA Length
3.1.5. Buffer Offset
3.1.6. Message Aligned (A) bit
3.1.7. Unsolicited (U) bit
3.1.8. Other constraints
3.2. Negotiating use of the option
3.3. Multiple options
3.4. Interactions with TCP congestion control
4. Examples
5. RID Formats
5.1. NFS
5.1.1. NFS RID Format
5.1.1.1. RPC XID
5.1.1.2. Operation
5.1.1.3. Zeroes
5.1.2. READ RPC replies
5.1.3. WRITE RPC requests
5.1.4. Message Aligned (A) bit
5.2. HTTP
5.2.1. RID format
5.2.2. GET responses
5.2.3. POST or PUT requests
5.3. Common Internet File System (CIFS)
5.3.1. Tag Format
5.3.1.1. Pid and Mid
5.3.1.2. Operation Index
5.3.1.3. Zeros
5.3.2. Unsolicited Bit
5.3.3. Message Aligned (A) bit
5.4. SCSI
6. Security considerations
6.1. Receiver security considerations
7. Authors' Addresses
8. References
Sapuntzakis, Cheriton [Page 3]
Internet-Draft TCP RDMA option 22 February 2000
1. Glossary
remote DMA (RDMA) - the transfer of application data from a remote
buffer into a contiguous, usually aligned, local buffer
RDMA data - the application data being transferred via RDMA
unsolicited data - data that a receiver did not request
2. Introduction
Currently, doing remote DMA (RDMA) between processors over TCP pro-
tocols such as HTTP and NFS requires much processing on the client
and server machines, especially at speeds of a gigabit or higher.
To see where this overhead comes from, it is instructive to look at
an example.
Consider the problem of an 8 kilobyte NFS transfer coming in from
an Ethernet and eventually ending up in an application's memory.
Ethernet's MTU is around 1500 bytes so the sender sends at least 6
packets across the Ethernet.
At the receiver, the six packets arrive at the network interface.
For each of the six packets, the network interface card on the
receiver copies the entire packet to the host's memory. The network
interface notifies the host software of the arrival of the packets.
The host software then does IP and TCP processing, which eventually
results in the software copying the TCP payload into a TCP receive
buffer.
NFS parses the data in the TCP receive buffer to find the file
pages. NFS copies the file pages to the buffer cache. Once in the
buffer cache, the operating system maps the pages into the
application's address space.
These memory-to-memory copies cost valuable main memory bandwidth
at clients and servers. To improve performance, it is necessary to
reduce the number of such copies. One way to do this is to have the
network interface card write the file data into the final location
(e.g. the buffer cache) the first time. This requires that the net-
work interface card recognize file data in incoming packets.
For NFS and HTTP, the problem of recognizing file data involves
parsing the protocol headers. This is complex and does not lend
itself to a simple hardware realization.
Sapuntzakis, Cheriton [Page 4]
Internet-Draft TCP RDMA option 22 February 2000
This memo defines a new TCP option, the RDMA option, which circum-
vents the parsing of complex protocol headers. The sender places
the option on TCP segments containing RDMA data. The RDMA option
describes to the receiver the location of the RDMA data in the TCP
payload.
An RDMA identifier (RID) in the option allows multiple outstanding
RDMA transfers on a TCP connection by allowing the sender and
receiver to uniquely tag the RDMAs. The layout of the RID depends
on the specific higher layer protocol (e.g. NFS).
The TCP RDMA option is an annotation and requires no modifications
to overlying protocols.
This memo specifies the RDMA option in detail in section 3. The use
of this option with NFS, HTTP, SCSI, and CIFS, is specified in sec-
tion 5.
Sapuntzakis, Cheriton [Page 5]
Internet-Draft TCP RDMA option 22 February 2000
3. RDMA option
3.1. Usage
Kind: 25 (decimal)
Length: 2 or 4 or 16 or 20 bytes
Byte/ 0 | 1 | 2 | 3 |
/ | | | |
|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
+---------------+---------------+-+-+---------------------------+
0 | 25 | Length |A|U| RDMA ID |
+---------------+---------------+-+-+---------------------------+
4 | RDMA ID (RID) |
+---------------------------------------------------------------+
8 | Buffer Offset |
+-------------------------------+-------------------------------+
12| Data Offset | Data Length |
+-------------------------------+-------------------------------+
16| Total RDMA Length |
+---------------------------------------------------------------+
3.1.1. RID
All segments in a single RDMA transfer carry the same 46-bit RDMA
ID (RID).
The RID is an application-level identifier that the receiver can
use to map the transfer to an application buffer. The exact value
of the RID depends on the overlying protocol. RID formats for
several popular protocols are given in section 5.
The RDMA ID is stored in network byte order. That is, bits 40-45 of
the RID get placed in bits 0-5 of byte 2. Bits 0-7 of the RID get
placed in bits 0-7 of byte 7.
3.1.2. Data Offset
The data offset specifies the number of bytes from the beginning of
the TCP payload to the RDMA transfer data.
Sapuntzakis, Cheriton [Page 6]
Internet-Draft TCP RDMA option 22 February 2000
The data offset MUST not exceed the length of the TCP payload.
3.1.3. Data Length
The data length specifies the number of bytes of RDMA transfer data
in this segment, starting at the data offset.
The data length MUST not cause the option to describe bytes past
the end of the TCP segment.
A data length of zero is valid.
3.1.4. Total RDMA Length
The total RDMA length is the number of bytes that will be
transferred using this RID. If the sender does not know the length
of the RDMA when the segment is sent, the sender should send the 16
byte version of this option, leaving the total RDMA length field
off.
The total RDMA length, when present, MUST be the same for all seg-
ments in the RDMA transfer.
A total RDMA length of zero is valid.
3.1.5. Buffer Offset
If this RDMA transfer is going into a separate buffer on the
receiver, the buffer offset field specifies the offset in that
buffer. At that offset, the receiver should write the RDMA data
demarcated by the data offset and data length fields.
3.1.6. Message Aligned (A) bit
The message aligned bit, when 1, indicates that byte 0 of the TCP
payload corresponds to the start of a new application-layer mes-
sage.
The message aligned (A) bit is bit 7 of byte 2.
The four byte version of the option may be sent if the sender
Sapuntzakis, Cheriton [Page 7]
Internet-Draft TCP RDMA option 22 February 2000
wishes to only communicate a message aligned state.
3.1.7. Unsolicited (U) bit
In NFS and other RPC-based protocol, transfers from the server to
the client (e.g. reads) occur in the response to an explicit
request by the client. The explicit request by the client indicates
that the client has an allocated buffer waiting for the data from
the transfer (or at least has had the opportunity to do so). The
client can use the explicit request to communicate some identifier
to the server that the server places in the response. In the
response, that identifier, embedded in the RID, can be used to
associate the data with a client buffer.
However, transfers from the client to the server (e.g. writes)
often occur on the request. There is usually no opportunity in
these protocols for a client to obtain any kind of identifier for
the server's application buffer. Indeed, the server may not even
have an application buffer allocated for the client request.
To indicate this special situation, the unsolicited bit is used.
The unsolicited bit (U) is bit 6 of byte 2.
The sender SHOULD set the unsolicited bit (U) to one if the RID is
not expected by the receiver.
3.1.8. Other constraints
The RDMA option MUST appear on every segment containing data that
is part of the RDMA transfer.
The sender MUST align the RDMA option on a 4 octet boundary rela-
tive to the TCP header.
3.2. Negotiating use of the option
For the purpose of options negotiation, the length field MAY be set
to 2 to prevent any accidental RDMA transfers.
Sapuntzakis, Cheriton [Page 8]
Internet-Draft TCP RDMA option 22 February 2000
3.3. Multiple options
Correct implementations MAY only look at the first RDMA option in a
segment. The TCP segments MUST conform to the rules layed out in
section 3 when all RDMA options but the first in the segment are
stripped. The most important of these requirements is that the RDMA
option MUST appear on every segment that contains data that is part
of the RDMA transfer.
3.4. Interactions with TCP congestion control
The RDMA option may result in segments that are under maximum seg-
ment size (MSS) being sent. This may slow the opening of congestion
windows on systems that do so based on the number MSS packets
received.
Sapuntzakis, Cheriton [Page 9]
Internet-Draft TCP RDMA option 22 February 2000
4. Examples
The figure below is a representation of a TCP stream. It has a
single RDMA transfer that occupies two contiguous sections of the
TCP stream (section 1 and section 2).
Sequence number
+----------------+ 0
| Header |
| |
| |
+----------------+ 100
| Transfer |
| Section 1 |
/ /
/ /
| |
+----------------+ 2100
| Trailer |
+----------------+ 2200
| Header |
| |
+----------------+ 2300
| Transfer |
| Section 2 |
/ /
/ /
| |
+----------------+ 4300
The table below illustrates how this section of the TCP stream will
be turned into 6 TCP segments with the RDMA option. The TCP maximum
segment size for this stream is 1000 bytes.
The sequence number comes from the TCP header.
Sapuntzakis, Cheriton [Page 10]
Internet-Draft TCP RDMA option 22 February 2000
+------------------------------------------------------+
| Segment | Sequence | Buffer | Data | Data |
| Number | Number | Offset | Offset | Length |
| | | | | |
+------------------------------------------------------+
| 1 | 0 | 0 | 100 | 900 |
| 2 | 1000 | 900 | 0 | 1000 |
| 3 | 2000 | 1900 | 0 | 100 |
| 4 | 2200 | 2000 | 100 | 900 |
| 5 | 3200 | 2900 | 0 | 1000 |
| 6 | 4200 | 3900 | 0 | 100 |
+------------------------------------------------------+
Segment #3 is only 200 bytes, part data and part trailer. If avail-
able to the TCP stack at the time, the TCP stack could have sent
out the next header as part of the segment. Below is such a segmen-
tation.
+------------------------------------------------------+
| Segment | Sequence | Buffer | Data | Data |
| Number | Number | Offset | Offset | Length |
| | | | | |
+------------------------------------------------------+
| 1 | 0 | 0 | 100 | 900 |
| 2 | 1000 | 900 | 0 | 1000 |
| 3 | 2000 | 1900 | 0 | 100 |
| 4 | 2300 | 2000 | 0 | 1000 |
| 5 | 3300 | 3000 | 0 | 1000 |
+------------------------------------------------------+
Note: not putting application headers at the front of a TCP segment
may cause decreased performance with some receivers.
In either segmentation, segment 3 cannot include any of Transfer
Part 2 since the RDMA option can only describe one transfer per
packet. Thus, segment 3 will always be less than MSS, even if the
stack has more to send.
Sapuntzakis, Cheriton [Page 11]
Internet-Draft TCP RDMA option 22 February 2000
5. RID Formats
5.1. NFS
In NFS, file pages are transferred using the NFS READ and WRITE
RPCs. When issuing a READ, the NFS client presumably has an appli-
cation buffer (e.g. block cache buffer) waiting to absorb it. When
receiving a WRITE, the NFS server may not have a waiting applica-
tion buffer to absorb the write.
5.1.1. NFS RID Format
RID format for NFS protocol:
4 4 3 3 3
5 0 9 2 1 0
+---------+----------+--------------------------+
| Zero | Operation| RPC XID |
+---------+----------+--------------------------+
5.1.1.1. RPC XID
The NFS protocols work on top of ONC RPC which associates with each
RPC a 32-bit transaction ID (XID).
5.1.1.2. Operation
NFS version 4 allows multiple read and write "operations" per RPC.
These operations share the same XID since they are part of the same
RPC. To disambiguate the RDMAs resulting from these operations, the
RID contains an Operation Index in bits 32-39 of the RID. The
operation index is zero for the first operation, one for the
second, and so on.
Note that the operation index is independent of whether the opera-
tion results in an RDMA. If only the third operation in an RPC
Sapuntzakis, Cheriton [Page 12]
Internet-Draft TCP RDMA option 22 February 2000
results in an RDMA, then the RID for that RDMA will have a 2 in the
operation index field.
The operation index MUST be zero for NFS versions 2 and 3.
5.1.1.3. Zeroes
Bits 40-45 MUST be set to zero by the sender and received as zeros
by the receiver.
5.1.2. READ RPC replies
For the file pages in NFS READ responses, the server MUST NOT set
the unsolicited bit to 1.
If the READ RPC fails and no data is returned, the server SHOULD
indicate zero length RDMA transfer.
5.1.3. WRITE RPC requests
For NFS WRITE calls, the client SHOULD set the unsolicited bit to
one, since the server is not expecting the WRITE.
5.1.4. Message Aligned (A) bit
The message aligned bit, when used on an NFS connection, indicates
the start of an ONC RPC message at byte 0 of a payload. For the
purposes of this specification, the start of an ONC RPC message is
the four byte length field that is defined for the tunneling of RPC
over TCP.
5.2. HTTP
5.2.1. RID format
Sapuntzakis, Cheriton [Page 13]
Internet-Draft TCP RDMA option 22 February 2000
4 3 3
5 2 1 0
+-------------------+-------------------------+
| Zero | Request idx |
+-------------------+-------------------------+
On an HTTP/1.1 connection, the server sends back responses in the
order it received requests. Thus, the index of the request, where
the first request is index 0, is sufficient to disambiguate the
RDMAs.
5.2.2. GET responses
The unsolicited bit SHOULD be set to zero.
Note, the HTTP server may not know the length of the response, so
clients should be prepared to receive the 16 byte option.
5.2.3. POST or PUT requests
In POST or PUT requests, the client sends data to the server. The
unsolicited bit SHOULD be set to one.
5.3. Common Internet File System (CIFS)
The Common Internet File System (CIFS) is based on top of an RPC
system known as Server Message Block (SMB).
5.3.1. Tag Format
4 4 3 3 3 1 1
5 0 9 2 1 6 5 0
+-------+------------+-----------+-------------+
| Zero | Operation | PID | MID |
+-------+------------+-----------+-------------+
Sapuntzakis, Cheriton [Page 14]
Internet-Draft TCP RDMA option 22 February 2000
5.3.1.1. Pid and Mid
In SMB, a request is uniquely identified by a 64-bit quantity that
includes 4 16-bit fields: Tree Id, User Id, Process Id (PID), and
Multiplex Id (MID).
There is insufficient room in the RDMA tag to include all four
fields. However, the PID and MID originate from the client and are
uninterpreted by the server. The client can assign PIDs and MIDs so
as to disambiguate concurrent requests. Thus, a CIFS client using
the RDMA option MUST ensure that two concurrent SMB requests do not
share the same PID and MID fields.
5.3.1.2. Operation Index
CIFS supports compound requests that can result in multiple
transfers per SMB. The operation index, in bits 32-39, corresponds
to the index of the operation in the SMB that caused the RDMA. The
first operation is given index zero and so on. Operations are logi-
cally assigned indexes whether or not they cause an RDMA.
5.3.1.3. Zeros
Bits 40-45 MUST be set to zero by the sender and received as zeros
by the receiver.
5.3.2. Unsolicited Bit
For CIFS operations that return data from the server, the unsoli-
cited bit SHOULD be set to zero.
For CIFS operations that send data from the client, the unsolicited
bit SHOULD be set to one.
5.3.3. Message Aligned (A) bit
The message aligned bit, when used on a CIFS connection, indicates
the start of a NetBIOS message at byte 0 of a payload. For the pur-
poses of this specification, the start of an NetBIOS message is the
four byte length field that is defined for the tunneling of NetBIOS
over TCP.
Sapuntzakis, Cheriton [Page 15]
Internet-Draft TCP RDMA option 22 February 2000
5.4. SCSI
The SCSI Architecture model [SAM, SAM2] lays out the requirements
for SCSI transports. [SCSI/TCP] is just such a transport. The
[SCSI/TCP] document defines the RID structure for SCSI.
Sapuntzakis, Cheriton [Page 16]
Internet-Draft TCP RDMA option 22 February 2000
6. Security considerations
The RDMA option potentially leaks information about an encrypted
TCP stream. The presence of or absence of the option, the size and
position of the RDMA, and the RID may all leak information to a
passive listener.
The TCP RDMA option is not protected by SSL or TLS, which only pro-
tect the TCP payload. It is, however, protected by the IPsec AH and
ESP headers.
6.1. Receiver security considerations
A malicious sender may attempt an RDMA transfer larger than the
receiving DMA buffer. A secure receiver MUST do bounds checking on
the offsets to avoid buffer overruns.
When mapping from RIDs to buffers, a receiver should take into
account the TCP connection to decrease the opportunity for mali-
cious senders to interfere with RDMAs taking place on other connec-
tions.
Some receivers may set aside buffers for unsolicited transfers. A
malicious sender can monopolize those buffers, potentially causing
performance degradation to the rest of the system, by doing a
series of small, unsolicited transfers. The receiver may wish to
place quotas on the size and number of outstanding unsolicited
transfers on a single connection.
Sapuntzakis, Cheriton [Page 17]
Internet-Draft TCP RDMA option 22 February 2000
7. Authors' Addresses
Constantine Sapuntzakis
Cisco Systems, Inc.
170 W. Tasman Drive
San Jose, CA 95134
USA
Phone: +1 408 525 5497
Email: csapuntz@cisco.com
David Cheriton
Cisco Systems, Inc.
170 W. Tasman Drive
San Jose, CA 95134
USA
Phone: +1 408 527 8207
Email: cheriton@cisco.com
8. References
[CIFS] Leach, P., "A Common Internet File System (CIFS/1.0) Proto-
col Preliminary Draft", http://www.cifs.com/specs/draft-leach-
cifs-v1-spec-01.txt, December 1997
[HTTP] Gettys, J., et al., "Hypertext Transfer Protocol -
HTTP/1.1", RFC 2616, June 1999
[NFSv3] Callaghan, B., "NFS Version 3 Protocol Specification", RFC
1813, June 1995
[RPC] Srinivasan, R., "RPC: Remote Procedure Call Protocol Specifi-
cation Version 2", RFC 1831, August 1995
[SCSI/TCP] Satran, J., et al., "SCSI/TCP",
ftp://ftp.ietf.org/internet-drafts/draft-satran-scot-00.txt
[SAM] "SCSI-3 Architecture Model", ANSI X3.270:1996,
http://www.t10.org/
[SAM2] "SCSI Architecture Model - 2 Draft", ANSI T101157-D,
http://www.t10.org/
Sapuntzakis, Cheriton [Page 18]
Internet-Draft TCP RDMA option 22 February 2000
[TCP] Postel, J., "Transmission Control Protocol - DARPA Internet
Program Protocol Specification", RFC 793, September 1981
Sapuntzakis, Cheriton [Page 19]