Internet DRAFT - draft-demizu-tcpm-paws-reordering
draft-demizu-tcpm-paws-reordering
Network Working Group Noritoshi Demizu
INTERNET-DRAFT NICT
Expires: April 25, 2005 October 25, 2004
A Modification to Make PAWS Robust to Segment Reordering
<draft-demizu-tcpm-paws-reordering-01.txt>
Status of this Memo
By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed,
or will be disclosed, and any of which I become aware will be
disclosed, in accordance with RFC 3668.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than a "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract
The purpose of PAWS (Protect Against Wrapped Sequence numbers),
defined in RFC1323, is to protect a TCP connection against old
duplicate segments from the same connection. There is a possibility,
however, that PAWS may discard valid reordered segments. This memo
proposes a modification to PAWS to solve this problem.
Demizu Expires April 2005 [Page 1]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
1. Introduction
The purpose of PAWS (Protect Against Wrapped Sequence numbers)
[RFC1323] is to protect a TCP connection against old duplicate
segments from the same connection. The PAWS mechanism uses the TCP
Timestamps option [RFC1323] to detect old duplicate segments.
PAWS could, however, falsely discard valid segments that are delayed
due to reordering. For example, if a retransmitted segment sent by
Fast Retransmit [RFC2581] is reordered and arrives at the receiver
earlier than some delayed segments sent prior to the retransmitted
segment, and if the timestamp on the retransmitted segment is newer
than the timestamps on the delayed segments, the delayed segments
will be discarded by PAWS at the receiver.
In the case where Limited Transmit [RFC3042] is used as well, a
retransmitted segment sent by Fast Retransmit might follow two new
data segments sent by Limited Transmit in a row. In this case, the
problem described in the previous paragraph would be more likely to
occur.
In the case where NewReno [RFC3782] is used, the problem could occur
if a retransmitted segment sent by NewReno is reordered and arrives
at the receiver earlier than some new data segments sent before the
retransmitted segment.
[Pax97][BPS99] and [ZM04] show that segment reordering is not a rare
event. If valid segments were falsely discarded by PAWS due to
reordering, there would be negative effects on TCP performance.
This memo proposes a modification to PAWS to solve this problem.
The memo is organized as follows. Section 2 describes the problem by
showing examples. Section 3 describes basic idea. And section 4
proposes a solution. Appendix A compares possible methods of solving
the problem.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in [RFC2119].
2. Problem Description
There is a possibility that valid segments could be discarded by PAWS
when those segments are delayed because of reordering. This section
shows some examples of this problem, then describes a generic
scenario and some possible negative effects.
Demizu Expires April 2005 [Page 2]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
2.1 Example 1: Reordering and Fast Retransmit with Limited Transmit
In this example, suppose TCP A is sending data to TCP B. Assume that
TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion
Control [RFC2581], and Limited Transmit [RFC3042], and that TCP B
supports the TCP Timestamps option with PAWS [RFC1323].
Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by
TCP A, where the letter indicates the sequence number and the digit
represents the timestamp (TSval). In this data segment sequence,
suppose W.1 and X.2 are sent in the Congestion Avoidance phase, Y.3
and Z.4 are sent by Limited Transmit, and A.5 is sent by Fast
Retransmit.
Figure 1 illustrates the data segment sequence observed at TCP A.
The x-axis represents time, and the y-axis represents the sequence
number. W.1 through Z.4 and A.5 indicate the data segments sent.
Each 'o' mark indicates a received ACK segment. Lines are drawn with
symbol characters between data segments and between ACK segments.
Sequence number
A Z.4
| Y.3~~ \
| X.2~~ \
| W.1~~ \
| ~~ \
| A.5
| o____o____o____o
| o~~~~ 1 2 3!! <-- dup-ACK count
| o~~~~
+--------------------------------> Time
Figure 1: Time vs. sequence number at TCP A
Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent
by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5
are exchanged) on the path to TCP B. Figure 2 illustrates the
resulting data segment sequence observed at TCP B.
What happens at TCP B is described below.
0. Assume TS.Recent is valid and TS.Recent == 0.
Assume RCV.NXT == A.
1. W.1 is received. PAWS accepts it because TS.Recent < 1.
TS.Recent is not updated because RCV.NXT < W.
Demizu Expires April 2005 [Page 3]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
2. X.2 is received. PAWS accepts it because TS.Recent < 2.
TS.Recent is not updated because RCV.NXT < X.
3. Y.3 is received. PAWS accepts it because TS.Recent < 3.
TS.Recent is not updated because RCV.NXT < Y.
4. A.5 is received. PAWS accepts it because TS.Recent < 5.
TS.Recent is updated because RCV.NXT == A and A.5 has data.
Now, TS.Recent == 5 and RCV.NXT >= A + the data length of A.5.
(The actual new value of RCV.NXT depends on the out-of-order
data queue in TCP B.)
5. Z.4 is received. PAWS discards it because TS.Recent > 4.
In this example, the valid segment Z.4 is discarded by PAWS in step
5. Figure 2 illustrates this scenario.
Sequence number
A Z.4
| Y.3 /
| X.2~~ \ /
| W.1~~ \ /
| ~~ \ /
| A.5
|
+--------------------------------> Time
+---------+-------------------------------+
|Segment |(prev) W.1 X.2 Y.3 A.5 Z.4 |
+---------+-------------------------------+
|PAWS | - Pass Pass Pass Pass Fail|
|TS.Recent| 0 0 0 0 5 5 |
|RCV.NXT | A A A A >A >A |
+---------+-------------------------------+
Figure 2: Time vs. sequence number at TCP B
Even in the case where TCP A does not support Limited Transmit (i.e.,
in the case where Y.3 and Z.4 are not sent in the example above), if
the data segment sequence W.1, X.2, A.5 sent by TCP A is reordered as
W.1, A.5, X.2 (i.e., X.2 and A.5 are exchanged) on the path to TCP B,
X.2 could be discarded by PAWS. Since there would be a small gap
between the time when X.2 is sent and the time when A.5 is sent, the
possibility of this problem occurring would be less than in the
example above.
Demizu Expires April 2005 [Page 4]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
2.2 Example 2: Reordering and NewReno
In this example, suppose TCP A is sending data to TCP B. Assume that
TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion
Control [RFC2581], and NewReno [RFC3782], and that TCP B supports the
TCP Timestamps option with PAWS [RFC1323].
Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by
TCP A, where the letter indicates the sequence number and the digit
represents the timestamp (TSval). In the data segment sequence,
suppose W.1 through Z.4 are sent by Fast Recovery at each time when a
duplicate ACK segment is received, and A.5 is sent by NewReno.
Figure 3 illustrates the data segment sequence observed at TCP A.
This figure uses the same notation as Figure 1.
Sequence number
A Z.4
| Y.3~~ \
| X.2~~ \
| W.1~~ \
| ~~ \
| A.5
| o
| /
| /
| /
| /
| ..o____o____o____o
|
+--------------------------------> Time
Figure 3: Time vs. sequence number at TCP A
Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent
by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5
are exchanged) on the path to TCP B.
The resulting data segment sequence observed at TCP B are the same as
Figure 2. And what happens at TCP B are also the same as those in
the example described in section 2.1. Consequently, the valid
segment Z.4 is discarded by PAWS.
2.3 Generic Scenario
In general, this problem occurs in the following scenario. Suppose
TCP A is sending data to TCP B, and consider the following steps.
Demizu Expires April 2005 [Page 5]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
1. Data segment Z.4 is sent by the sender (TCP A).
2. Data segment A.5 is sent by the sender (TCP A).
The sequence number of segment A.5 is lower than that of
segment Z.4. The TSval value on segment A.5 is newer than that
on segment Z.4.
Note: Segment A.5 would be a retransmitted segment sent by Fast
Retransmit, NewReno, SACK [RFC2018][RFC3517], or another
mechanism that infers a segment loss and retransmits the
lost data quickly. The sequence number of segment A.5
would be less than SND.NXT.
3. Segment A.5 arrives at the receiver earlier than segment Z.4.
Suppose that
segment A.5 satisfies SEG.SEQ <= RCV.NXT < SEG.SEQ + SEG.LEN,
and the TSval value on segment A.5 is not older than the
TS.Recent value at the receiver (TCB B).
Segment A.5 is accepted by PAWS at the receiver. TS.Recent at
the receiver is updated with the TSval value on segment A.5.
RCV.NXT is also updated.
4. Segment Z.4 arrives at the receiver (TCP B).
Segment Z.4 is discarded by PAWS because the TSval value on
segment Z.4 is older than the TS.Recent value at the receiver.
In this scenario, the gap between the time when segment Z.4 is sent
and the time when segment A.5 is sent should be small so that
reordering could exchange segments Z.4 and A.5.
2.4 Negative effects
This problem would cause some negative effects on TCP performance.
A sender would spend additional time detecting a loss and recovering
from it. Moreover, the sender would consider the loss to be a
congestion indication, and the congestion window would needlessly be
further reduced.
In addition, discarding valid acceptable segments at a receiver is a
waste of bandwidth.
Demizu Expires April 2005 [Page 6]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
3. Basic Idea
Suppose the following tuple is recorded for every 2^30 bytes of data.
<received time, sequence number, timestamp>
If more than one segment contained a target byte, this tuple was
created only for the first segment containing the target byte.
In this tuple, the received time field keeps the time when the tuple
was recorded. The sequence number field keeps the sequence number of
the target byte. The timestamp field keeps the SEG.TSval value of
the segment containing the target byte.
Any incoming segment should be checked as follows: if the second
latest tuple is not older than 24 days ago, and the SEG.TSval value
on the received segments is older than the value of the timestamp
field in the second latest tuple, the incoming segment would be an
old duplicate segment. If an incoming segment was an old duplicate
segment, it should be discarded and an ACK segment should be sent in
reply.
The algorithm updating TS.Recent should take care to keep TS.Recent
monotonically non-decreasing, because this idea does not ensure it.
Appendix A.3 describes this idea in more depth.
Demizu Expires April 2005 [Page 7]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
4. Specification
This memo proposes the following modification to the processing of
the TCP Timestamps option to solve the problem described in section
2.
- The TCP per-connection state is augmented by an array: TSRec.
- TSRec is an array of a structure which consists of the
following members.
- rcv_time: when the target byte is received first time.
- rcv_seqno: the sequence number of the target byte.
- rcv_tsval: the SEG.TSval value of the segment.
- The number of elements in the TSRec array is 2.
- When an original SYN is received:
/* Record it in the first tuple. */
TSRec[0].rcv_time = current_time;
TSRec[0].rcv_seqno = IRS; /* Initial Receive Seq no */
TSRec[0].rcv_tsval = SEG.TSval;
/* Copy the first tuple to the second tuple. */
TSRec[1] = TSRec[0];
- When other segment is received:
/* Execute the test. */
if (current_time - TSRec[1].rcv_time <= 24days &&
SEG.TSval < TSRec[1].rcv_tsval) {
Send an ACK segment in reply.
Discard the incoming segment.
}
/* Update RCV.NXT */
/* Record it in the first tuple if necessary. */
if (SEG.SEQ > 0 && RCV.NXT - TSRec[0].rcv_seqno > 2^30) {
/* Copy the first tuple to the second tuple. */
TSRec[1] = TSRec[0];
/* Record the received segment in the first tuple. */
TSRec[0].rcv_time = current_time;
TSRec[0].rcv_seqno += 2^30;
TSRec[0].rcv_tsval = SEG.TSval;
}
Demizu Expires April 2005 [Page 8]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
5. Security Considerations
Security issues are not discussed in this memo.
Acknowledgments
Masaki Hirabaru (NICT) invented the idea in section 3.
Normative References
[RFC793] J. Postel, "Transmission Control Protocol", RFC793,
STD 7, September 1981
[RFC1323] V. Jacobson, R. Braden, and D. Borman, "TCP Extensions
for High Performance", RFC1323, May 1992
[RFC2119] S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", RFC2119, BCP14, March 1997
[RFC3668] S. Bradner, "Intellectual Property Rights in IETF
Technology", RFC3668, BCP79, February 2004
Informative References
[RFC2018] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow,
"TCP Selective Acknowledgement Options", RFC2018,
October 1996.
[RFC2581] M. Allman, V. Paxson, and W. Stevens, "TCP Congestion
Control", RFC2581, April 1999
[RFC3042] M. Allman, H. Balakrishnan, and S. Floyd, "Enhancing
TCP's Loss Recovery Using Limited Transmit", RFC3042,
January 2001
[RFC3517] E. Blanton, M. Allman, K. Fall, and L. Wang,
"A Conservative Selective Acknowledgment (SACK)-based
Loss Recovery Algorithm for TCP", RFC3517, April 2003
[RFC3522] R. Ludwig, M. Meyer, "The Eifel Detection Algorithm for
TCP", RFC3522, April 2003
[RFC3782] S. Floyd, T. Henderson, and A. Gurtov, "The NewReno
Modification to TCP's Fast Recovery Algorithm",
RFC3782, April 2004
Demizu Expires April 2005 [Page 9]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
[BPS99] J. Bennett, C. Partridge, and N. Shectman, "Packet
Reordering is Not Pathological Network Behavior",
IEEE/ACM Transactions on Networking, December 1999
[Dem04] N. Demizu, "Protection Against Spoofing Attacks by Using
the TCP Timestamps Option", (work in progress),
Internet-Draft <draft-demizu-tcpm-pasa-issues-00.txt>,
October 2004.
[Pax97] Vern Paxon, "End-to-End Internet Packet Dynamics",
In ACM SIGCOMM'97, September 1997
[ZM04] Zhou, X. and P. Van Mieghem, "Reordering of IP Packets
in Internet", In Proceedings of Passive and Active
Measurement (PAM2004), April 2004,
Appendix A: Comparison of Possible Methods
Appendix A discusses possible methods of solving the problem
described in section 2. Three methods are described and compared:
a receiver-side modification, sender-side modifications, and
a replacement of PAWS.
In the description below, my.TSclock is the "local source of 32-bit
timestamp values." TSval is one of the two timestamp fields of the
TCP Timestamps option. See [RFC1323] for more details.
The variables TS.SndMax and TS.SndMin are borrowed from [Dem04].
Acknowledgments
Kacheong Poon (Sun) gave the author a hint for the idea tweaking
TSval on a local node to control TS.Recent on a peer node.
Appendix A.2 is based on his hint.
Ronald Alleyne (S2IO) gave the author the idea using TS.SndMin.
He also gave the author its limitation, its side effects, and
measures against them. Appendix A.2.2 is based on his proposal.
Masaki Hirabaru (NICT) gave the author an alternative idea to the
current PAWS. Appendix A.3 is based on his idea.
A.1 Receiver-side Modification
A straightforward way to solve the problem would be to modify the
rules of PAWS so that valid delayed segments will be accepted.
Demizu Expires April 2005 [Page 10]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
The new rule would be as the following:
- Change the inequality in R1) in section 4.2.1 of [RFC1323] as
follows:
Current: SEG.TSval < TS.Recent
Proposal: SEG.TSval < TS.Recent - T1, where T1 = RTT.
- In addition, to keep TS.Recent be monotonically non-decreasing,
in R3) in section 4.2.1 of [RFC1323], TS.Recent is updated only
when SEG.TSval >= TS.Recent.
With this new rule, it would be very important to choose the value of
T1 appropriately. If T1 was too large, old duplicate segments would
be accepted, and PAWS would become useless. On the other hand, if T1
was too small, valid segments would still be discarded, and this new
rule would become useless.
It would be difficult, however, for a receiver to determine the value
of T1 appropriately, because it would not have enough information on
the frequency of the values of TSval's on segments sent by a sender.
- [RFC1323] recommends that the range of the frequency is 1
millisecond to 1 second. This is too wide for a receiver to be
able to choose a practicable value for T1 which would work under
any circumstances.
- It might be possible to infer an approximation of the frequency
by observing the TSval values on received segments. The
calculation would be very complex, however.
- It would be possible to introduce a new TCP option to exchange
the frequency information. Unfortunately, this would take
many years to deploy. Furthermore, the TCP option space is
limited.
- It might be possible to specify the frequency, e.g. 1ms, as a
standard. It would take many years to deploy, however.
Therefore, it would be difficult to solve the problem by the method
above.
A.2 Sender-side Modification
Another solution would be to change the rule for the TSval value on a
retransmitted segment sent by Fast Retransmit, NewReno, etc., so that
the TSval value on such a segment will never be newer than the TSval
Demizu Expires April 2005 [Page 11]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
values on segments sent prior to the retransmitted segment.
A.2.1 Sender-side Modification Using the Maximum TSval Sent
A.2.1.1 Proposal
An intuitive way of a sender-side modification would be as follows:
- When a segment is *NOT* sent by Fast Retransmit, NewReno, etc.,
use the current timestamp clock for the TSval value on the
segment, and record the value.
- When a segment *is* sent by Fast Retransmit, NewReno, etc., use
the recorded value for the TSval value on the segment.
Since TSval is monotonically non-decreasing, the recorded value is
equal to the maximum value of TSval that has been sent. The variable
to record the value is called TS.SndMax in this memo.
Now, let us revisit the examples in section 2 with this rule. In
those examples, Z.4 is discarded by PAWS because it arrives at the
receiver later than A.5 while its TSval is older than that on A.5.
If the rule above was introduced, however, the TSval value on A.5
would change from 5 to 4. Therefore, at step 4 in section 2.1,
TS.Recent at the receiver would become 4 instead of 5. Hence, Z.4
would be accepted. Thus, the rule given above solves the problem.
Figure A-1 illustrates the scenario observed at TCB B.
Sequence number
A Z.4
| Y.3 /
| X.2~~ \ /
| W.1~~ \ /
| ~~ \ /
| A.4
|
+--------------------------------> Time
+---------+-------------------------------+
|Segment |(prev) W.1 X.2 Y.3 A.4 Z.4 |
+---------+-------------------------------+
|PAWS | - Pass Pass Pass Pass Pass|
|TS.Recent| 0 0 0 0 4 4 |
|RCV.NXT | A A A A >A >A |
+---------+-------------------------------+
Figure A-1: Time vs. sequence number at TCP B
Demizu Expires April 2005 [Page 12]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
A.2.1.2 Limitation
The rule above has an obvious limitation: If A.4 (which was A.5 in
section 2) arrived at the receiver earlier than both Y.3 and Z.4, Z.4
would be accepted, but Y.3 would be discarded. In other words, the
rule avoids discarding only the last data segment. It is an open
issue whether this limitation is acceptable or not.
One way to avoid discarding the last N segments would be as
follows: Both the TSval values and the sequence numbers of the
last N segments should be recorded. Then, the TSval value on a
segment sent by Fast Retransmit, NewReno, etc. should be that of
the oldest TSval value among the recorded segments whose sequence
number is not less than the sequence number plus the data length
of the segment being sent. This idea would overcome the
limitation, but it would introduce some complexity. Therefore,
this memo does not recommend this idea.
A.2.1.3 Side Effects
With the rule above, the TSval value on a segment sent by Fast
Retransmit, NewReno, etc. may be older than the current timestamp
clock. Hence, the measured RTT calculated using the SEG.TSecr value
on the ACK segment that acknowledges the retransmitted segment may be
longer than the actual RTT.
In many cases, the gap between the time when the new data segment
(e.g., Z.4) was sent and the time when the retransmitted segment
(e.g., A.4) is sent would be small. In such cases, the error of
the measured RTT would be negligible.
One considerable case would be where the congestion window is 2
MSS and Limited Transmit is used. In this case, if only the first
data segment is lost, Limited Transmit would keep sending new data
segments every RTT, and each data segment would update TS.SndMax.
When the third duplicate ACK segment is received, Fast Retransmit
retransmits the lost data with TSval = TS.SndMax, which indicates
one RTT ago. If an ACK segment for the retransmitted segment was
returned, the measured RTT would be one RTT longer than the actual
RTT.
The worst case would be where NewReno is retransmitting data
segments upon each ACK segment which is returned for every RTT.
Since data segments sent by NewReno would not update TS.SndMax
with the rule above, the errors of measured RTTs would be RTT
times the number of successive retransmitted segments.
Although it would be better to mismeasure RTTs in limited cases than
Demizu Expires April 2005 [Page 13]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
to discard valid segments in exceptional cases, mismeasurement should
be avoided. One way would be as follows: if a segment is sent by
Fast Retransmit, NewReno, etc., but the current timestamp clock -
TS.SndMax is longer than a fraction of one RTT (e.g. 1/2 * RTT),
update TS.SndMax with the current timestamp clock. The fraction of
one RTT should be long enough to keep the modification effective.
A.2.1.4 How to Distinguish Segments
To implement the rule above, it is important to precisely distinguish
segments sent by Fast Retransmit, NewReno, etc. from other segments.
It is implementation-dependent.
In some implementations, it would be easy to add a new flag to the
arguments of the TCP output routine to indicate whether the output
request was triggered by Fast Retransmit, NewReno, etc. or not.
In other implementations, segments would be distinguishable by their
sequence numbers. If the sequence number of a data segment being
sent was less than SND.NXT, the data segment would be sent by Fast
Retransmit, NewReno, etc. Note that some implementations, such as
BSD-derived implementations, temporarily lower SND.NXT on sending
such a segment. Also note that the sequence number of a RST segment
sent in reply to an unacceptable segment received in unestablished
states may be less than SND.NXT.
A.2.1.5 Formal Description
The rule above can be formalized as follows:
The TCP per-connection state is augmented by a new timestamp:
TS.SndMax.
Initially:
TS.SndMax = my.TSclock
On sending each segment:
If (the segment is NOT sent by Fast Retransmit, NewReno, etc.
or my.TSclock - TS.SndMax > k * RTT where k=1/2) then
TS.SndMax = my.TSclock.
SEG.TSval = TS.SndMax
The method of distinguishing segments sent by Fast Retransmit,
Demizu Expires April 2005 [Page 14]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
NewReno, etc. from other segments is implementation-dependent.
See appendix A.2.1.4.
A.2.2 Sender-side Modification Using the Maximum TSecr Received
A.2.2.1 Proposal
The rule described in appendix A.2.1 avoids discarding only the last
data segment. If this limitation is unacceptable in real networks,
the TSval value on a retransmitted segment sent by Fast Retransmit,
NewReno, etc. needs to be changed to some older value.
The oldest candidate for such value would be the maximum value of
TSecr that has been received. The variable to record the value is
called TS.SndMin in this memo. Note that the TS.SndMin value is
older than or equal to the TS.Recent value of the peer node.
In the case where TS.SndMin is used for TSval on such a retransmitted
segment, during recover phase, TS.SndMin would be unchanged and equal
to the TS.Recent value at the peer node. Therefore, the TSval value
on any new data segments sent would be newer than or equal to the
TS.Recent value at the peer node. Thus, new data segments would
never be discarded by PAWS.
A.2.2.2 Side Effects
Since the TSval value on a segment sent by Fast Retransmit, NewReno,
etc. would be older than the current timestamp clock, the measured
RTT would be longer than the actual RTT, and errors would be worse
than the rule described in appendix A.2.1.
An error of each measured RTT would be equal to the difference
between the time when the recovery phase starts and the current time,
because the TS.SndMin value would not be changed during recovery
phase. For example, when a retransmitted segment was sent by Fast
Retransmit, the error would be one RTT. When a retransmitted segment
was sent by NewReno, the error would be several RTTs. To avoid
falsely increasing the estimated RTT, measured RTTs need to be
ignored during recovery phase.
In addition, segments retransmitted by SACK to fill holes could be
discarded by PAWS due to reordering if an original segment whose
TSval is newer than the TS.SndMin value is delayed but arrives
earlier than the retransmitted segment, and advances RCV.NXT of the
receiver. Fortunately, this scenario would be unlikely to occur.
Demizu Expires April 2005 [Page 15]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
Furthermore, Eifel [RFC3522] would not work with this rule.
A.2.2.3 Formal Description
The rule described in appendix A.2.2 can be formalized as follows:
The TCP per-connection state is augmented by a new timestamp:
TS.SndMin.
Initially:
TS.SndMin = my.TSclock
On receiving a segment:
/* After the segment passes all TCP tests */
if (TS.SndMin < SEG.TSecr) then
TS.SndMin = SEG.TSecr
On sending a segment:
If (SEG.SEQ < SND.NXT) then
SEG.TSval = TS.SndMin
else
SEG.TSval = my.TSclock
A.2.3 Tradeoffs between Sender-side Modifications
The difference between proposals in appendices A.2.1 and A.2.2 is in
the TSval value on retransmitted segments sent by Fast Retransmit,
NewReno, etc. Appendix A.2.1 proposes to use TS.SndMax, while
appendix A.2.2 proposes to use TS.SndMin. There would be many other
candidates between the TS.SndMin value and the TS.SndMax value, such
as the idea described in A.2.1.2.
In general, sender-side modification can be described as follows:
- When a segment *is* sent by Fast Retransmit, NewReno, etc., use
a specific value between the TS.SndMin value and the TS.SndMax
value for the TSval on the segment.
- Otherwise, use the current timestamp clock for the TSval on the
segment.
The key to sender-side modification is the choise of the TSval value.
There are tradeoffs, though.
Demizu Expires April 2005 [Page 16]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
- If the TSval value was equal to TS.SndMax,
- New data segments except the last one might be discarded by
PAWS.
- RTTM would mismeasure RTTs in some limited cases.
- If the value was near TS.SndMin,
- New data segments would not be discarded by PAWS.
Segments retransmitted by SACK might be discarded by PAWS.
- RTTM would not work. Measured RTTs should be ignored.
- Eifel would not work.
- If some other value between the TS.SndMax value and the
TS.SndMin value is used for TSval, both characteristics above
would be mixed.
As seen above, tweaking TSval on a retransmitted segment sent by Fast
Retransmit, NewReno, etc. would break various assumptions of existing
mechanisms utilizing the TCP Timestamps option, and would cause
number of side effects. Some of them are listed above, however,
there would be unknown side effects.
As a result, sender-side modifications are half measures. It is
impossible to choose a perfect TSval value.
A.3 Replacement of PAWS
A.3.1 Outline of the Idea
The third idea would replace PAWS with a new algorithm that is more
tolerant to reordering. For example, suppose the following tuple is
recorded for every CHUNK_SIZE bytes of data.
<received time, sequence number, timestamp>
If more than one segment contained a target byte, this tuple was
created only for the first segment containing the target byte.
In this tuple, the received time field keeps the time when the tuple
was recorded. If its value becomes older than LIFE_TIME ago, the
tuple should be invalidated. The sequence number field keeps the
sequence number of the target byte. The timestamp field keeps the
SEG.TSval value of the segment containing the target byte.
Demizu Expires April 2005 [Page 17]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
Any incoming segment should be checked as follows: if the second
latest tuple is not older than LIFE_TIME ago, and the SEG.TSval value
on the received segments is older than the value of the timestamp
field in the second latest tuple, the incoming segment would be an
old duplicate segment. If an incoming segment was an old duplicate
segment, it should be discarded and an ACK segment should be sent in
reply.
PAWS forces that incoming SEG.TSval values is monotonically
non-decreasing. If the SEG.TSval value on an incoming segment is
older than TS.Recent, such a segment is discarded by PAWS.
Contrarily, the SEG.TSval values on segments that pass the test above
may not be monotonically non-decreasing. Hence, the algorithm
updating TS.Recent should take care by itself to keep TS.Recent
monotonically non-decreasing.
A.3.2 Formal Description
The idea outlined in appendix A.3.1 can be formalized as follows.
- There are three parameters: NELEM, CHUNK_SIZE, and LIFE_TIME.
- The TCP per-connection state is augmented by an array: TSRec.
- TSRec is an array of a structure which consists of the
following members.
- rcv_time: when the target byte is received first time.
- rcv_seqno: the sequence number of the target byte.
- rcv_tsval: the SEG.TSval value of the segment.
- The number of elements in the TSRec array is NELEM.
- When an original SYN is received:
/* Record it in the first tuple. */
TSRec[0].rcv_time = current_time;
TSRec[0].rcv_seqno = IRS; /* Initial Receive Seq-no */
TSRec[0].rcv_tsval = SEG.TSval;
/* Copy the first tuple to others as their initial values. */
for (i = 1; i < NELEM; i++) {
TSRec[i] = TSRec[0];
}
- When other segment is received:
Demizu Expires April 2005 [Page 18]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
/* Execute the test. */
if (current_time - TSRec[NELEM-1].rcv_time <= LIFE_TIME &&
SEG.TSval < TSRec[NELEM-1].rcv_tsval) {
Send an ACK segment in reply.
Discard the incoming segment.
}
/* Update RCV.NXT */
/* Record it in a new tuple if necessary. */
if (SEG.SEQ > 0 && RCV.NXT - TSRec[0].rcv_seqno > CHUNK_SIZE) {
/* Move elements in the TSRec array backward. */
for (i = NELEM - 1; i > 0; i++) {
TSRec[i] = TSRec[i-1];
}
/* Record the received segment in a new tuple. */
TSRec[0].rcv_time = current_time;
TSRec[0].rcv_seqno += CHUNK_SIZE;
TSRec[0].rcv_tsval = SEG.TSval;
}
A.3.3 Parameters
The values of the parameters would be as follows.
- The idea outlined in appendix A.3.1 assumes NELEM == 2.
- The maximum value of CHUNK_SIZE would be 2^32 / 2 / NELEM. If
CHUNK_SIZE was larger than this value, the rcv_seqno field of
TSRec[NELEM-1] would be wrapped.
The minimum value of CHUNK_SIZE would be the current receive
window size. The maximum size of the current receive window
size is determined by the TCP Window Scale option [RFC1323]:
(2^16 - 1) * 2^14 == 2^30 - 2^14.
Hence, the maximum possible values of NELEM, CHUNK_SIZE, and
receive window size depend on each other. If NELEM is 2, the
maximum possible value of CHUNK_SIZE is 2^30, which is larger
than the maximum size of receive window size.
- The best value for LIFE_TIME would be 24 days, as the same as
that in section 4.2.3 of [RFC1323].
As a result, this memo recommends NELEM = 2, CHUNK_SIZE = 2^30, and
LIFE_TIME = 24 days.
Demizu Expires April 2005 [Page 19]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
A.4 Comparison of Ideas
The ideas described in appendices A.1 and A.2 have problems to be
solved, while the idea described in Appendix A.3 does not have any
known issues. Hence, this memo recommends the modification in
appendix A.3.
Author's Address
Noritoshi Demizu
National Institute of Information and Communications Technology(NICT)
4-2-1 Nukui-Kitamachi, Koganei, Tokyo 184-8795, Japan
Phone: +81-42-327-7432 (Ex. 5813)
E-mail: demizu@nict.go.jp
Copyright Statement
Copyright (C) The Internet Society (2004). This document is subject
to the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed
to pertain to the implementation or use of the technology
described in this document or the extent to which any license
under such rights might or might not be available; nor does it
represent that it has made any independent effort to identify any
such rights. Information on the procedures with respect to
rights in RFC documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use
of such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository
at http://www.ietf.org/ipr.
Demizu Expires April 2005 [Page 20]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt> October 2004
The IETF invites any interested party to bring to its attention
any copyrights, patents or patent applications, or other
proprietary rights that may cover technology that may be required
to implement this standard. Please address the information to the
IETF at ietf-ipr@ietf.org.
Acknowledgment
Funding for the RFC Editor function is currently provided by the
Internet Society.
Demizu Expires April 2005 [Page 21]