Internet DRAFT - draft-demizu-tcpm-paws-reordering

draft-demizu-tcpm-paws-reordering



Network Working Group                                   Noritoshi Demizu
INTERNET-DRAFT                                                      NICT
Expires: April 25, 2005                                 October 25, 2004



        A Modification to Make PAWS Robust to Segment Reordering
               <draft-demizu-tcpm-paws-reordering-01.txt>



Status of this Memo


   By submitting this Internet-Draft, I certify that any applicable
   patent or other IPR claims of which I am aware have been disclosed,
   or will be disclosed, and any of which I become aware will be
   disclosed, in accordance with RFC 3668.


   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as
   Internet-Drafts.


   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than a "work in progress."


   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html


   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html



Copyright Notice


   Copyright (C) The Internet Society (2004).  All Rights Reserved.



Abstract


   The purpose of PAWS (Protect Against Wrapped Sequence numbers),
   defined in RFC1323, is to protect a TCP connection against old
   duplicate segments from the same connection.  There is a possibility,
   however, that PAWS may discard valid reordered segments.  This memo
   proposes a modification to PAWS to solve this problem.







Demizu                     Expires April 2005                   [Page 1]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



1. Introduction


   The purpose of PAWS (Protect Against Wrapped Sequence numbers)
   [RFC1323] is to protect a TCP connection against old duplicate
   segments from the same connection.  The PAWS mechanism uses the TCP
   Timestamps option [RFC1323] to detect old duplicate segments.


   PAWS could, however, falsely discard valid segments that are delayed
   due to reordering.  For example, if a retransmitted segment sent by
   Fast Retransmit [RFC2581] is reordered and arrives at the receiver
   earlier than some delayed segments sent prior to the retransmitted
   segment, and if the timestamp on the retransmitted segment is newer
   than the timestamps on the delayed segments, the delayed segments
   will be discarded by PAWS at the receiver.


   In the case where Limited Transmit [RFC3042] is used as well, a
   retransmitted segment sent by Fast Retransmit might follow two new
   data segments sent by Limited Transmit in a row.  In this case, the
   problem described in the previous paragraph would be more likely to
   occur.


   In the case where NewReno [RFC3782] is used, the problem could occur
   if a retransmitted segment sent by NewReno is reordered and arrives
   at the receiver earlier than some new data segments sent before the
   retransmitted segment.


   [Pax97][BPS99] and [ZM04] show that segment reordering is not a rare
   event.  If valid segments were falsely discarded by PAWS due to
   reordering, there would be negative effects on TCP performance.


   This memo proposes a modification to PAWS to solve this problem.


   The memo is organized as follows.  Section 2 describes the problem by
   showing examples.  Section 3 describes basic idea.  And section 4
   proposes a solution.  Appendix A compares possible methods of solving
   the problem.


   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
   this document are to be interpreted as described in [RFC2119].



2. Problem Description


   There is a possibility that valid segments could be discarded by PAWS
   when those segments are delayed because of reordering.  This section
   shows some examples of this problem, then describes a generic
   scenario and some possible negative effects.




Demizu                     Expires April 2005                   [Page 2]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



2.1 Example 1: Reordering and Fast Retransmit with Limited Transmit


   In this example, suppose TCP A is sending data to TCP B.  Assume that
   TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion
   Control [RFC2581], and Limited Transmit [RFC3042], and that TCP B
   supports the TCP Timestamps option with PAWS [RFC1323].


   Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by
   TCP A, where the letter indicates the sequence number and the digit
   represents the timestamp (TSval).  In this data segment sequence,
   suppose W.1 and X.2 are sent in the Congestion Avoidance phase, Y.3
   and Z.4 are sent by Limited Transmit, and A.5 is sent by Fast
   Retransmit.


   Figure 1 illustrates the data segment sequence observed at TCP A.
   The x-axis represents time, and the y-axis represents the sequence
   number.  W.1 through Z.4 and A.5 indicate the data segments sent.
   Each 'o' mark indicates a received ACK segment.  Lines are drawn with
   symbol characters between data segments and between ACK segments.


            Sequence number
                  A                      Z.4
                  |                 Y.3~~  \
                  |            X.2~~        \
                  |       W.1~~              \
                  |     ~~                    \
                  |                           A.5
                  |             o____o____o____o
                  |        o~~~~     1    2    3!!   <-- dup-ACK count
                  |   o~~~~
                  +--------------------------------> Time


      Figure 1: Time vs. sequence number at TCP A


   Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent
   by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5
   are exchanged) on the path to TCP B.  Figure 2 illustrates the
   resulting data segment sequence observed at TCP B.


   What happens at TCP B is described below.


      0. Assume TS.Recent is valid and TS.Recent == 0.
         Assume RCV.NXT == A.


      1. W.1 is received.  PAWS accepts it because TS.Recent < 1.
         TS.Recent is not updated because RCV.NXT < W.






Demizu                     Expires April 2005                   [Page 3]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



      2. X.2 is received.  PAWS accepts it because TS.Recent < 2.
         TS.Recent is not updated because RCV.NXT < X.


      3. Y.3 is received.  PAWS accepts it because TS.Recent < 3.
         TS.Recent is not updated because RCV.NXT < Y.


      4. A.5 is received.  PAWS accepts it because TS.Recent < 5.
         TS.Recent is updated because RCV.NXT == A and A.5 has data.


         Now, TS.Recent == 5 and RCV.NXT >= A + the data length of A.5.
         (The actual new value of RCV.NXT depends on the out-of-order
          data queue in TCP B.)


      5. Z.4 is received.  PAWS discards it because TS.Recent > 4.


   In this example, the valid segment Z.4 is discarded by PAWS in step
   5.  Figure 2 illustrates this scenario.


            Sequence number
                  A                           Z.4
                  |                 Y.3       /
                  |            X.2~~   \     /
                  |       W.1~~         \   /
                  |     ~~               \ /
                  |                      A.5
                  |
                  +--------------------------------> Time


        +---------+-------------------------------+
        |Segment  |(prev) W.1  X.2  Y.3  A.5  Z.4 |
        +---------+-------------------------------+
        |PAWS     |   -   Pass Pass Pass Pass Fail|
        |TS.Recent|   0    0    0    0    5    5  |
        |RCV.NXT  |   A    A    A    A    >A   >A |
        +---------+-------------------------------+


      Figure 2: Time vs. sequence number at TCP B


   Even in the case where TCP A does not support Limited Transmit (i.e.,
   in the case where Y.3 and Z.4 are not sent in the example above), if
   the data segment sequence W.1, X.2, A.5 sent by TCP A is reordered as
   W.1, A.5, X.2 (i.e., X.2 and A.5 are exchanged) on the path to TCP B,
   X.2 could be discarded by PAWS.  Since there would be a small gap
   between the time when X.2 is sent and the time when A.5 is sent, the
   possibility of this problem occurring would be less than in the
   example above.






Demizu                     Expires April 2005                   [Page 4]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



2.2 Example 2: Reordering and NewReno


   In this example, suppose TCP A is sending data to TCP B.  Assume that
   TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion
   Control [RFC2581], and NewReno [RFC3782], and that TCP B supports the
   TCP Timestamps option with PAWS [RFC1323].


   Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by
   TCP A, where the letter indicates the sequence number and the digit
   represents the timestamp (TSval).  In the data segment sequence,
   suppose W.1 through Z.4 are sent by Fast Recovery at each time when a
   duplicate ACK segment is received, and A.5 is sent by NewReno.


   Figure 3 illustrates the data segment sequence observed at TCP A.
   This figure uses the same notation as Figure 1.


            Sequence number
                  A                    Z.4
                  |               Y.3~~  \
                  |          X.2~~        \
                  |     W.1~~              \
                  |   ~~                    \
                  |                         A.5
                  |                          o
                  |                         /
                  |                        /
                  |                       /
                  |                      /
                  |    ..o____o____o____o
                  |
                  +--------------------------------> Time


      Figure 3: Time vs. sequence number at TCP A


   Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent
   by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5
   are exchanged) on the path to TCP B.


   The resulting data segment sequence observed at TCP B are the same as
   Figure 2.  And what happens at TCP B are also the same as those in
   the example described in section 2.1.  Consequently, the valid
   segment Z.4 is discarded by PAWS.



2.3 Generic Scenario


   In general, this problem occurs in the following scenario.  Suppose
   TCP A is sending data to TCP B, and consider the following steps.




Demizu                     Expires April 2005                   [Page 5]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



      1. Data segment Z.4 is sent by the sender (TCP A).


      2. Data segment A.5 is sent by the sender (TCP A).


         The sequence number of segment A.5 is lower than that of
         segment Z.4.  The TSval value on segment A.5 is newer than that
         on segment Z.4.


         Note: Segment A.5 would be a retransmitted segment sent by Fast
               Retransmit, NewReno, SACK [RFC2018][RFC3517], or another
               mechanism that infers a segment loss and retransmits the
               lost data quickly.  The sequence number of segment A.5
               would be less than SND.NXT.


      3. Segment A.5 arrives at the receiver earlier than segment Z.4.


         Suppose that
           segment A.5 satisfies SEG.SEQ <= RCV.NXT < SEG.SEQ + SEG.LEN,
           and the TSval value on segment A.5 is not older than the
           TS.Recent value at the receiver (TCB B).


         Segment A.5 is accepted by PAWS at the receiver.  TS.Recent at
         the receiver is updated with the TSval value on segment A.5.
         RCV.NXT is also updated.


      4. Segment Z.4 arrives at the receiver (TCP B).


         Segment Z.4 is discarded by PAWS because the TSval value on
         segment Z.4 is older than the TS.Recent value at the receiver.


   In this scenario, the gap between the time when segment Z.4 is sent
   and the time when segment A.5 is sent should be small so that
   reordering could exchange segments Z.4 and A.5.



2.4 Negative effects


   This problem would cause some negative effects on TCP performance.


   A sender would spend additional time detecting a loss and recovering
   from it.  Moreover, the sender would consider the loss to be a
   congestion indication, and the congestion window would needlessly be
   further reduced.


   In addition, discarding valid acceptable segments at a receiver is a
   waste of bandwidth.






Demizu                     Expires April 2005                   [Page 6]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



3. Basic Idea


   Suppose the following tuple is recorded for every 2^30 bytes of data.


      <received time, sequence number, timestamp>


   If more than one segment contained a target byte, this tuple was
   created only for the first segment containing the target byte.


   In this tuple, the received time field keeps the time when the tuple
   was recorded.  The sequence number field keeps the sequence number of
   the target byte.  The timestamp field keeps the SEG.TSval value of
   the segment containing the target byte.


   Any incoming segment should be checked as follows: if the second
   latest tuple is not older than 24 days ago, and the SEG.TSval value
   on the received segments is older than the value of the timestamp
   field in the second latest tuple, the incoming segment would be an
   old duplicate segment.  If an incoming segment was an old duplicate
   segment, it should be discarded and an ACK segment should be sent in
   reply.


   The algorithm updating TS.Recent should take care to keep TS.Recent
   monotonically non-decreasing, because this idea does not ensure it.


   Appendix A.3 describes this idea in more depth.


























Demizu                     Expires April 2005                   [Page 7]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



4. Specification


   This memo proposes the following modification to the processing of
   the TCP Timestamps option to solve the problem described in section
   2.


      - The TCP per-connection state is augmented by an array: TSRec.


         - TSRec is an array of a structure which consists of the
           following members.


            - rcv_time: when the target byte is received first time.
            - rcv_seqno: the sequence number of the target byte.
            - rcv_tsval: the SEG.TSval value of the segment.


         - The number of elements in the TSRec array is 2.


      - When an original SYN is received:


         /* Record it in the first tuple. */
         TSRec[0].rcv_time  = current_time;
         TSRec[0].rcv_seqno = IRS;          /* Initial Receive Seq no */
         TSRec[0].rcv_tsval = SEG.TSval;


         /* Copy the first tuple to the second tuple. */
         TSRec[1] = TSRec[0];


      - When other segment is received:


         /* Execute the test. */
         if (current_time - TSRec[1].rcv_time <= 24days &&
             SEG.TSval < TSRec[1].rcv_tsval) {
            Send an ACK segment in reply.
            Discard the incoming segment.
         }


         /* Update RCV.NXT */


         /* Record it in the first tuple if necessary. */
         if (SEG.SEQ > 0 && RCV.NXT - TSRec[0].rcv_seqno > 2^30) {
            /* Copy the first tuple to the second tuple. */
            TSRec[1] = TSRec[0];
            /* Record the received segment in the first tuple. */
            TSRec[0].rcv_time  = current_time;
            TSRec[0].rcv_seqno += 2^30;
            TSRec[0].rcv_tsval = SEG.TSval;
         }





Demizu                     Expires April 2005                   [Page 8]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



5. Security Considerations


   Security issues are not discussed in this memo.



Acknowledgments


   Masaki Hirabaru (NICT) invented the idea in section 3.



Normative References


   [RFC793]    J. Postel, "Transmission Control Protocol", RFC793,
                STD 7, September 1981


   [RFC1323]   V. Jacobson, R. Braden, and D. Borman, "TCP Extensions
               for High Performance", RFC1323, May 1992


   [RFC2119]   S. Bradner, "Key words for use in RFCs to Indicate
               Requirement Levels", RFC2119, BCP14, March 1997


   [RFC3668]   S. Bradner, "Intellectual Property Rights in IETF
               Technology", RFC3668, BCP79, February 2004


Informative References


   [RFC2018]   M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow,
               "TCP Selective Acknowledgement Options", RFC2018,
               October 1996.


   [RFC2581]   M. Allman, V. Paxson, and W. Stevens, "TCP Congestion
               Control", RFC2581, April 1999


   [RFC3042]   M. Allman, H. Balakrishnan, and S. Floyd, "Enhancing
               TCP's Loss Recovery Using Limited Transmit", RFC3042,
               January 2001


   [RFC3517]   E. Blanton, M. Allman, K. Fall, and L. Wang,
               "A Conservative Selective Acknowledgment (SACK)-based
               Loss Recovery Algorithm for TCP", RFC3517, April 2003


   [RFC3522]   R. Ludwig, M. Meyer, "The Eifel Detection Algorithm for
               TCP", RFC3522, April 2003


   [RFC3782]   S. Floyd, T. Henderson, and A. Gurtov, "The NewReno
               Modification to TCP's Fast Recovery Algorithm",
               RFC3782, April 2004





Demizu                     Expires April 2005                   [Page 9]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   [BPS99]     J. Bennett, C. Partridge, and N. Shectman, "Packet
               Reordering is Not Pathological Network Behavior",
               IEEE/ACM Transactions on Networking, December 1999


   [Dem04]    N. Demizu, "Protection Against Spoofing Attacks by Using
              the TCP Timestamps Option", (work in progress),
              Internet-Draft <draft-demizu-tcpm-pasa-issues-00.txt>,
              October 2004.


   [Pax97]     Vern Paxon, "End-to-End Internet Packet Dynamics",
               In ACM SIGCOMM'97, September 1997


   [ZM04]      Zhou, X. and P. Van Mieghem, "Reordering of IP Packets
               in Internet", In Proceedings of Passive and Active
               Measurement (PAM2004), April 2004,



Appendix A: Comparison of Possible Methods


   Appendix A discusses possible methods of solving the problem
   described in section 2.  Three methods are described and compared:
   a receiver-side modification, sender-side modifications, and
   a replacement of PAWS.


   In the description below, my.TSclock is the "local source of 32-bit
   timestamp values."  TSval is one of the two timestamp fields of the
   TCP Timestamps option.  See [RFC1323] for more details.


   The variables TS.SndMax and TS.SndMin are borrowed from [Dem04].


   Acknowledgments


      Kacheong Poon (Sun) gave the author a hint for the idea tweaking
      TSval on a local node to control TS.Recent on a peer node.
      Appendix A.2 is based on his hint.


      Ronald Alleyne (S2IO) gave the author the idea using TS.SndMin.
      He also gave the author its limitation, its side effects, and
      measures against them.  Appendix A.2.2 is based on his proposal.


      Masaki Hirabaru (NICT) gave the author an alternative idea to the
      current PAWS.  Appendix A.3 is based on his idea.



A.1 Receiver-side Modification


   A straightforward way to solve the problem would be to modify the
   rules of PAWS so that valid delayed segments will be accepted.




Demizu                     Expires April 2005                  [Page 10]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   The new rule would be as the following:


      - Change the inequality in R1) in section 4.2.1 of [RFC1323] as
        follows:


         Current:  SEG.TSval < TS.Recent
         Proposal: SEG.TSval < TS.Recent - T1, where T1 = RTT.


      - In addition, to keep TS.Recent be monotonically non-decreasing,
        in R3) in section 4.2.1 of [RFC1323], TS.Recent is updated only
        when SEG.TSval >= TS.Recent.


   With this new rule, it would be very important to choose the value of
   T1 appropriately.  If T1 was too large, old duplicate segments would
   be accepted, and PAWS would become useless.  On the other hand, if T1
   was too small, valid segments would still be discarded, and this new
   rule would become useless.


   It would be difficult, however, for a receiver to determine the value
   of T1 appropriately, because it would not have enough information on
   the frequency of the values of TSval's on segments sent by a sender.


      - [RFC1323] recommends that the range of the frequency is 1
        millisecond to 1 second.  This is too wide for a receiver to be
        able to choose a practicable value for T1 which would work under
        any circumstances.


      - It might be possible to infer an approximation of the frequency
        by observing the TSval values on received segments.  The
        calculation would be very complex, however.


      - It would be possible to introduce a new TCP option to exchange
        the frequency information.  Unfortunately, this would take
        many years to deploy.  Furthermore, the TCP option space is
        limited.


      - It might be possible to specify the frequency, e.g. 1ms, as a
        standard.  It would take many years to deploy, however.


   Therefore, it would be difficult to solve the problem by the method
   above.



A.2 Sender-side Modification


   Another solution would be to change the rule for the TSval value on a
   retransmitted segment sent by Fast Retransmit, NewReno, etc., so that
   the TSval value on such a segment will never be newer than the TSval




Demizu                     Expires April 2005                  [Page 11]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   values on segments sent prior to the retransmitted segment.



A.2.1 Sender-side Modification Using the Maximum TSval Sent


A.2.1.1 Proposal


   An intuitive way of a sender-side modification would be as follows:


      - When a segment is *NOT* sent by Fast Retransmit, NewReno, etc.,
        use the current timestamp clock for the TSval value on the
        segment, and record the value.


      - When a segment *is* sent by Fast Retransmit, NewReno, etc., use
        the recorded value for the TSval value on the segment.


   Since TSval is monotonically non-decreasing, the recorded value is
   equal to the maximum value of TSval that has been sent.  The variable
   to record the value is called TS.SndMax in this memo.


   Now, let us revisit the examples in section 2 with this rule.  In
   those examples, Z.4 is discarded by PAWS because it arrives at the
   receiver later than A.5 while its TSval is older than that on A.5.
   If the rule above was introduced, however, the TSval value on A.5
   would change from 5 to 4.  Therefore, at step 4 in section 2.1,
   TS.Recent at the receiver would become 4 instead of 5.  Hence, Z.4
   would be accepted.  Thus, the rule given above solves the problem.
   Figure A-1 illustrates the scenario observed at TCB B.


            Sequence number
                  A                           Z.4
                  |                 Y.3       /
                  |            X.2~~   \     /
                  |       W.1~~         \   /
                  |     ~~               \ /
                  |                      A.4
                  |
                  +--------------------------------> Time


        +---------+-------------------------------+
        |Segment  |(prev) W.1  X.2  Y.3  A.4  Z.4 |
        +---------+-------------------------------+
        |PAWS     |   -   Pass Pass Pass Pass Pass|
        |TS.Recent|   0    0    0    0    4    4  |
        |RCV.NXT  |   A    A    A    A    >A   >A |
        +---------+-------------------------------+


      Figure A-1: Time vs. sequence number at TCP B




Demizu                     Expires April 2005                  [Page 12]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



A.2.1.2 Limitation


   The rule above has an obvious limitation: If A.4 (which was A.5 in
   section 2) arrived at the receiver earlier than both Y.3 and Z.4, Z.4
   would be accepted, but Y.3 would be discarded.  In other words, the
   rule avoids discarding only the last data segment.  It is an open
   issue whether this limitation is acceptable or not.


      One way to avoid discarding the last N segments would be as
      follows: Both the TSval values and the sequence numbers of the
      last N segments should be recorded.  Then, the TSval value on a
      segment sent by Fast Retransmit, NewReno, etc. should be that of
      the oldest TSval value among the recorded segments whose sequence
      number is not less than the sequence number plus the data length
      of the segment being sent.  This idea would overcome the
      limitation, but it would introduce some complexity.  Therefore,
      this memo does not recommend this idea.


A.2.1.3 Side Effects


   With the rule above, the TSval value on a segment sent by Fast
   Retransmit, NewReno, etc. may be older than the current timestamp
   clock.  Hence, the measured RTT calculated using the SEG.TSecr value
   on the ACK segment that acknowledges the retransmitted segment may be
   longer than the actual RTT.


      In many cases, the gap between the time when the new data segment
      (e.g., Z.4) was sent and the time when the retransmitted segment
      (e.g., A.4) is sent would be small.  In such cases, the error of
      the measured RTT would be negligible.


      One considerable case would be where the congestion window is 2
      MSS and Limited Transmit is used.  In this case, if only the first
      data segment is lost, Limited Transmit would keep sending new data
      segments every RTT, and each data segment would update TS.SndMax.
      When the third duplicate ACK segment is received, Fast Retransmit
      retransmits the lost data with TSval = TS.SndMax, which indicates
      one RTT ago.  If an ACK segment for the retransmitted segment was
      returned, the measured RTT would be one RTT longer than the actual
      RTT.


      The worst case would be where NewReno is retransmitting data
      segments upon each ACK segment which is returned for every RTT.
      Since data segments sent by NewReno would not update TS.SndMax
      with the rule above, the errors of measured RTTs would be RTT
      times the number of successive retransmitted segments.


   Although it would be better to mismeasure RTTs in limited cases than




Demizu                     Expires April 2005                  [Page 13]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   to discard valid segments in exceptional cases, mismeasurement should
   be avoided.  One way would be as follows: if a segment is sent by
   Fast Retransmit, NewReno, etc., but the current timestamp clock -
   TS.SndMax is longer than a fraction of one RTT (e.g. 1/2 * RTT),
   update TS.SndMax with the current timestamp clock.  The fraction of
   one RTT should be long enough to keep the modification effective.



A.2.1.4 How to Distinguish Segments


   To implement the rule above, it is important to precisely distinguish
   segments sent by Fast Retransmit, NewReno, etc. from other segments.
   It is implementation-dependent.


   In some implementations, it would be easy to add a new flag to the
   arguments of the TCP output routine to indicate whether the output
   request was triggered by Fast Retransmit, NewReno, etc. or not.


   In other implementations, segments would be distinguishable by their
   sequence numbers.  If the sequence number of a data segment being
   sent was less than SND.NXT, the data segment would be sent by Fast
   Retransmit, NewReno, etc.  Note that some implementations, such as
   BSD-derived implementations, temporarily lower SND.NXT on sending
   such a segment.  Also note that the sequence number of a RST segment
   sent in reply to an unacceptable segment received in unestablished
   states may be less than SND.NXT.



A.2.1.5 Formal Description


   The rule above can be formalized as follows:


      The TCP per-connection state is augmented by a new timestamp:
      TS.SndMax.


      Initially:


         TS.SndMax = my.TSclock


      On sending each segment:


         If (the segment is NOT sent by Fast Retransmit, NewReno, etc.
             or my.TSclock - TS.SndMax > k * RTT where k=1/2) then
            TS.SndMax = my.TSclock.


         SEG.TSval = TS.SndMax


   The method of distinguishing segments sent by Fast Retransmit,




Demizu                     Expires April 2005                  [Page 14]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   NewReno, etc. from other segments is implementation-dependent.
   See appendix A.2.1.4.



A.2.2 Sender-side Modification Using the Maximum TSecr Received


A.2.2.1 Proposal


   The rule described in appendix A.2.1 avoids discarding only the last
   data segment.  If this limitation is unacceptable in real networks,
   the TSval value on a retransmitted segment sent by Fast Retransmit,
   NewReno, etc. needs to be changed to some older value.


   The oldest candidate for such value would be the maximum value of
   TSecr that has been received.  The variable to record the value is
   called TS.SndMin in this memo.  Note that the TS.SndMin value is
   older than or equal to the TS.Recent value of the peer node.


   In the case where TS.SndMin is used for TSval on such a retransmitted
   segment, during recover phase, TS.SndMin would be unchanged and equal
   to the TS.Recent value at the peer node.  Therefore, the TSval value
   on any new data segments sent would be newer than or equal to the
   TS.Recent value at the peer node.  Thus, new data segments would
   never be discarded by PAWS.



A.2.2.2 Side Effects


   Since the TSval value on a segment sent by Fast Retransmit, NewReno,
   etc. would be older than the current timestamp clock, the measured
   RTT would be longer than the actual RTT, and errors would be worse
   than the rule described in appendix A.2.1.


   An error of each measured RTT would be equal to the difference
   between the time when the recovery phase starts and the current time,
   because the TS.SndMin value would not be changed during recovery
   phase.  For example, when a retransmitted segment was sent by Fast
   Retransmit, the error would be one RTT.  When a retransmitted segment
   was sent by NewReno, the error would be several RTTs.  To avoid
   falsely increasing the estimated RTT, measured RTTs need to be
   ignored during recovery phase.


   In addition, segments retransmitted by SACK to fill holes could be
   discarded by PAWS due to reordering if an original segment whose
   TSval is newer than the TS.SndMin value is delayed but arrives
   earlier than the retransmitted segment, and advances RCV.NXT of the
   receiver.  Fortunately, this scenario would be unlikely to occur.





Demizu                     Expires April 2005                  [Page 15]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   Furthermore, Eifel [RFC3522] would not work with this rule.



A.2.2.3 Formal Description


   The rule described in appendix A.2.2 can be formalized as follows:


     The TCP per-connection state is augmented by a new timestamp:
     TS.SndMin.


      Initially:


         TS.SndMin = my.TSclock


      On receiving a segment:


         /* After the segment passes all TCP tests */
         if (TS.SndMin < SEG.TSecr) then
            TS.SndMin = SEG.TSecr


      On sending a segment:


         If (SEG.SEQ < SND.NXT) then
            SEG.TSval = TS.SndMin
         else
            SEG.TSval = my.TSclock



A.2.3 Tradeoffs between Sender-side Modifications


   The difference between proposals in appendices A.2.1 and A.2.2 is in
   the TSval value on retransmitted segments sent by Fast Retransmit,
   NewReno, etc.  Appendix A.2.1 proposes to use TS.SndMax, while
   appendix A.2.2 proposes to use TS.SndMin.  There would be many other
   candidates between the TS.SndMin value and the TS.SndMax value, such
   as the idea described in A.2.1.2.


   In general, sender-side modification can be described as follows:


      - When a segment *is* sent by Fast Retransmit, NewReno, etc., use
        a specific value between the TS.SndMin value and the TS.SndMax
        value for the TSval on the segment.


      - Otherwise, use the current timestamp clock for the TSval on the
        segment.


   The key to sender-side modification is the choise of the TSval value.
   There are tradeoffs, though.




Demizu                     Expires April 2005                  [Page 16]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



      - If the TSval value was equal to TS.SndMax,


         - New data segments except the last one might be discarded by
           PAWS.


         - RTTM would mismeasure RTTs in some limited cases.


      - If the value was near TS.SndMin,


         - New data segments would not be discarded by PAWS.
           Segments retransmitted by SACK might be discarded by PAWS.


         - RTTM would not work.  Measured RTTs should be ignored.


         - Eifel would not work.


      - If some other value between the TS.SndMax value and the
        TS.SndMin value is used for TSval, both characteristics above
        would be mixed.


   As seen above, tweaking TSval on a retransmitted segment sent by Fast
   Retransmit, NewReno, etc. would break various assumptions of existing
   mechanisms utilizing the TCP Timestamps option, and would cause
   number of side effects.  Some of them are listed above, however,
   there would be unknown side effects.


   As a result, sender-side modifications are half measures.  It is
   impossible to choose a perfect TSval value.



A.3 Replacement of PAWS


A.3.1 Outline of the Idea


   The third idea would replace PAWS with a new algorithm that is more
   tolerant to reordering.  For example, suppose the following tuple is
   recorded for every CHUNK_SIZE bytes of data.


      <received time, sequence number, timestamp>


   If more than one segment contained a target byte, this tuple was
   created only for the first segment containing the target byte.


   In this tuple, the received time field keeps the time when the tuple
   was recorded.  If its value becomes older than LIFE_TIME ago, the
   tuple should be invalidated.  The sequence number field keeps the
   sequence number of the target byte.  The timestamp field keeps the
   SEG.TSval value of the segment containing the target byte.




Demizu                     Expires April 2005                  [Page 17]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   Any incoming segment should be checked as follows: if the second
   latest tuple is not older than LIFE_TIME ago, and the SEG.TSval value
   on the received segments is older than the value of the timestamp
   field in the second latest tuple, the incoming segment would be an
   old duplicate segment.  If an incoming segment was an old duplicate
   segment, it should be discarded and an ACK segment should be sent in
   reply.


   PAWS forces that incoming SEG.TSval values is monotonically
   non-decreasing.  If the SEG.TSval value on an incoming segment is
   older than TS.Recent, such a segment is discarded by PAWS.
   Contrarily, the SEG.TSval values on segments that pass the test above
   may not be monotonically non-decreasing.  Hence, the algorithm
   updating TS.Recent should take care by itself to keep TS.Recent
   monotonically non-decreasing.



A.3.2 Formal Description


   The idea outlined in appendix A.3.1 can be formalized as follows.


      - There are three parameters: NELEM, CHUNK_SIZE, and LIFE_TIME.


      - The TCP per-connection state is augmented by an array: TSRec.


         - TSRec is an array of a structure which consists of the
           following members.


            - rcv_time: when the target byte is received first time.
            - rcv_seqno: the sequence number of the target byte.
            - rcv_tsval: the SEG.TSval value of the segment.


         - The number of elements in the TSRec array is NELEM.


      - When an original SYN is received:


         /* Record it in the first tuple. */
         TSRec[0].rcv_time  = current_time;
         TSRec[0].rcv_seqno = IRS;          /* Initial Receive Seq-no */
         TSRec[0].rcv_tsval = SEG.TSval;


         /* Copy the first tuple to others as their initial values. */
         for (i = 1; i < NELEM; i++) {
            TSRec[i] = TSRec[0];
         }


      - When other segment is received:





Demizu                     Expires April 2005                  [Page 18]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



         /* Execute the test. */
         if (current_time - TSRec[NELEM-1].rcv_time <= LIFE_TIME &&
             SEG.TSval < TSRec[NELEM-1].rcv_tsval) {
            Send an ACK segment in reply.
            Discard the incoming segment.
         }


         /* Update RCV.NXT */


         /* Record it in a new tuple if necessary. */
         if (SEG.SEQ > 0 && RCV.NXT - TSRec[0].rcv_seqno > CHUNK_SIZE) {
            /* Move elements in the TSRec array backward. */
            for (i = NELEM - 1; i > 0; i++) {
               TSRec[i] = TSRec[i-1];
            }
            /* Record the received segment in a new tuple. */
            TSRec[0].rcv_time  = current_time;
            TSRec[0].rcv_seqno += CHUNK_SIZE;
            TSRec[0].rcv_tsval = SEG.TSval;
         }



A.3.3 Parameters


   The values of the parameters would be as follows.


      - The idea outlined in appendix A.3.1 assumes NELEM == 2.


      - The maximum value of CHUNK_SIZE would be 2^32 / 2 / NELEM.  If
        CHUNK_SIZE was larger than this value, the rcv_seqno field of
        TSRec[NELEM-1] would be wrapped.


        The minimum value of CHUNK_SIZE would be the current receive
        window size.  The maximum size of the current receive window
        size is determined by the TCP Window Scale option [RFC1323]:
        (2^16 - 1) * 2^14 == 2^30 - 2^14.


        Hence, the maximum possible values of NELEM, CHUNK_SIZE, and
        receive window size depend on each other.  If NELEM is 2, the
        maximum possible value of CHUNK_SIZE is 2^30, which is larger
        than the maximum size of receive window size.


      - The best value for LIFE_TIME would be 24 days, as the same as
        that in section 4.2.3 of [RFC1323].


   As a result, this memo recommends NELEM = 2, CHUNK_SIZE = 2^30, and
   LIFE_TIME = 24 days.





Demizu                     Expires April 2005                  [Page 19]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



A.4 Comparison of Ideas


   The ideas described in appendices A.1 and A.2 have problems to be
   solved, while the idea described in Appendix A.3 does not have any
   known issues.  Hence, this memo recommends the modification in
   appendix A.3.



Author's Address


   Noritoshi Demizu
   National Institute of Information and Communications Technology(NICT)
   4-2-1 Nukui-Kitamachi, Koganei, Tokyo 184-8795, Japan
   Phone: +81-42-327-7432 (Ex. 5813)
   E-mail: demizu@nict.go.jp



Copyright Statement


   Copyright (C) The Internet Society (2004).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property


   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed
   to pertain to the implementation or use of the technology
   described in this document or the extent to which any license
   under such rights might or might not be available; nor does it
   represent that it has made any independent effort to identify any
   such rights.  Information on the procedures with respect to
   rights in RFC documents can be found in BCP 78 and BCP 79.


   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use
   of such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository
   at http://www.ietf.org/ipr.




Demizu                     Expires April 2005                  [Page 20]
Internet-Draft <draft-demizu-tcpm-paws-reordering-01.txt>   October 2004



   The IETF invites any interested party to bring to its attention
   any copyrights, patents or patent applications, or other
   proprietary rights that may cover technology that may be required
   to implement this standard.  Please address the information to the
   IETF at ietf-ipr@ietf.org.


Acknowledgment


   Funding for the RFC Editor function is currently provided by the
   Internet Society.









































Demizu                     Expires April 2005                  [Page 21]