Internet DRAFT - draft-ayesta-to-short-tcp

draft-ayesta-to-short-tcp



Internet Engineering Task Force                  Urtzi Ayesta
                                      
Internet Draft                                   FranceTelecom R&D

Document: draft-ayesta-to-short-tcp-00.txt       Konstantin Avrachenkov

Expires: October 2002                            INRIA

                                                 October 2002


On reducing the number of TimeOuts for short-lived TCP connections

Status of this Memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026 [1].  Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, and
its working groups.  Note that      other groups may also distribute
working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


Abstract

This document shows that short TCP sessions are prone to timeout. In
particular, one single segment loss will provoke TCP to timeout if the
document size is below certain threshold. This document analyzes the
benefit of TCP modifications such as Limited Transmit Algorithm
[RFC3042] and Increasing Initial Window [RFC2414] in the context of
short-lived TCP transfers. However TCP remains vulnerable to the losses
at the very end of the transmission. Therefore we suggest complementary
modifications to Limited Transmit Algorithm to recover effectively from
losses at the end of the TCP transfer.
 
Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [2].

1. Introduction

TCP sender requires the reception of three duplicate acknowledgements
(ACK) to recover from a segment loss without timing out. Consequently,
losses at the very end of the transmission will inevitably provoke a
timeout. This might especially degrade the TCP performance of
short-lived sessions. This document analyzes possible modifications to
reduce the timeout probability.

RFC 2988 [RFC2988] defines the standard algorithm to compute the
retransmission timeout (RTO). In particular, RFC 2988 [RFC2988]
recommends to round this timer up to 1 second to avoid retransmissions
of segments only delayed and not lost. Because of this conservative RTO
definition, it is important for TCP senders to detect and recover from
as many losses as possible without having a timeout.

The TCP loss recovery mechanism have had have had several modifications
over the recent years. The fast retransmission algorithm, which was
developed in Tahoe TCP [Jac88], retransmits an unacknowledged segment
upon reception of three duplicate ACKs, sets the congestion window to
one, sets the slow-start threshold to half of the current congestion
window and begins slow start. In the fast recovery algorithm proposed
in Reno TCP version[FF96], after receiving three duplicate ACKs, the
congestion window is halved by two and Congestion Avoidance replaces
slow start. TCP's selective acknowledgement(SACK) [RFC2018] option
permits the receiver to inform the sender about the data blocks that
were successfully received.

Recently two new modifications have been proposed, Increasing the
Initial Window (IW) [RFC2414] and Limited Transmit Algorithm (LT)
[RFC3042]. According to the IW proposal the initial size of the
congestion window is increased from one or two segment(s) to roughly 4K
bytes (never more than four segments). This modification benefits the
individual connection in several ways [RFC2414]. In the particular case
of short-lived TCP we note that: it reduces the transfer time in
several round trip times (RTT) and it makes TCP more robust against
segments lost in the very beginning of the connection. With LT, the TCP
sender sends a new data segment in response to each of the first two
duplicate ACKs. Eventually it will receive a third duplicate which will
trigger off fast retransmission and fast recovery phases. Clearly,
transmitting these new data segments increases the probability that TCP
can recover from a lost segment(s) without timing out (see [Flo01] for
simulations examples of LT).

In the literature it has been reported that many of the timeouts are
due to non-trigger of fast recovery. In [LK98] the authors analyzed
part of the traces collected by Paxson [Pax97] and found that 85% of
the timeouts were due to this reason. [BPS+97] found that almost 50% of
the losses required a timeout to recover. In addition only 4% out of
them could have been avoided with the TCP selective acknowledgement
(SACK) mechanism and 25% using LT. Unfortunately, to the best of our
knowledge some important questions remain open: Why do TCP senders
receive not enough duplicate ACKs? Is this because of the small size of
the congestion window or because of burstyness of the segment loss
process?


So far the same TCP algorithm is used regardless the size of the file
to be transmitted. It is known (see, e.g. [FBP+01]) that a TCP session
typically belongs to one of the following two kinds: "mice" or
"elephants". Most TCP sessions are "mice" with a small size, but a
small amount of "elephants" (in terms of flows) is responsible for the
largest amount of transferred data (in terms of bytes) (approximately
80% according to [GM01]). In [TMW97] the authors based on measurements
in the backbone found that the average size of flows was 10Kbytes. More
recent measurements on the Sprint IP backbone network [FMD+01] show
that around 70% of the flows carry fewer than 1Kbyte and 90% of the
flows carry fewer than 10Kbytes.

Note: The values of [FMD+01] do not correspond only to TCP, but to all
transport protocols. Still the authors report that over 90% of the
traffic is transmitted with TCP, even on the links with a significant
percentage of streaming media. In [TMW97] authors have reported that
TCP carries 95% of the bytes, 90% of the segments and 80% of the flows
on the link.

Therefore, it seems feasible to modify TCP and improve the performance
of short-lived TCP flows without significant increase of the overall
network load.

The rest of the paper is organized as follows, Section 2 analyzes in
detail the performance of TCP focusing on short-lived TCP sessions. In
Section 3 some possible TCP modifications are discussed and simulation
results are presented. The last section is conclusions.

2. What causes short-lived TCP to timeout?

When a loss occurs, the congestion window of the sender will continue
sliding forward until the lost segment gets to the left most position.
If the value of the congestion window is less than four segments (two
with LT) the TCP session will timeout. There is yet another situation
when the sender will inevitably timeout. Namely, if a loss occurs when
the remaining amount of data is less than three segments, no matter
what the actual value of the congestion window is, the sender will not
receive three duplicate ACKs and will have to rely on a timeout to
detect the loss.

As a consequence, one can identify three situations where TCP sessions
are prone to timeouts. The first case corresponds to the beginning of
the session when the congestion window is below 4 (2 with LT) segments.
The second case corresponds to the middle part of the transfer when the
congestion window is small. For example the limit imposed by the
receiver advertised window is small, the link has a small
bandwidth-delay product or after the loss recovery phase. The third
case corresponds to the very end of the transmission. Namely if any of
the last three segments are lost the sender will not receive three
duplicate ACKs and it will inevitably timeout.  IW helps in the first
case, LT does the same in the first and second case. However, neither
of them helps at the end of the transmission.


Note: At the end of the transmission, the use of LT does not make any
difference since it only sends new data upon the reception of two
duplicate ACKs and it does not make the decision to retransmit a
segment until three duplicate ACKs are received. To the best of our
knowledge this case was first observed in [AA02].

The third case might not be of crucial importance for long-lived TCP
flows, but it may have a significant effect on the transfer time of
short-lived TCP.

One can define a threshold on the file size (TO-THRESH), such that if
the file size is less than this threshold a single loss will inevitably
lead to a timeout. TO-THRESH is given by the sum of the number of
segments that have to be transmitted to reach a congestion window of
size 4 (two with LT) and three segments corresponding to the end of the
file. In the case the receiver does not employ delayed ACK we get (in
brackets it is shown the contribution of the two intervals):

Initial			TCP		TCP
Window						Limited Transmit
	1 				6(3+3)	4(1+3)
	2 				5(2+3)	3(0+3)
	3				4(1+3)	3(0+3)
	4				3(0+3)	3(0+3)

In the case of TCP employing delayed ACK we get the following values:

	Delayed ACK employed
Initial			TCP		TCP
Window						Limited Transmit
	1 				7(4+3)	5(2+3)
	2 				6(3+3)	4(1+3)
	3				4(1+3)	3(0+3)
	4				3(0+3)	3(0+3)

The values presented in the tables above along the statistics reported
on the file size of TCP flows [GM01,TMW97,FMD+01]] suggest that the
value of TO-THRESH is of the order of a major portion of the TCP flows.
Clearly, this implies that TCP's loss recovery mechanism does not work
well for "mince" type TCP flows. Balakrishnan et al. [BPS+97] concluded
by measurements on an internet server that TCP's loss recovery
performance is poor when it comes to short Web transfers. We presume
that the end of the file effect might have had an impact on the
measurements and it was not identified by the authors.

In [AA02] we look at the expected TCP transfer time conditioning on the
number of losses. Via simulations and the theoretic model we observed a
very interesting phenomenon - the non monotonicity of the expected
conditional transfer time. That is given that certain segment(s)
is(are) lost(s), it turns out that on average it may take less time to
transmit a larger file. For instance in the case of one loss, the
picture transfer time vs. file size shows a unique peak at TO-THRESH.
First the transfer time goes up until the file size is smaller than
TO-THRESH. Then the transfer time start to decrease and only after some
file size it starts to increase again. This behavior is due to the
conservative duration of the retransmission timer, typically several
times greater than an average round trip time (RTT).


3. TCP modifications to improve the performance of short-lived TCP
transfers.

From the previous section we know that short-lived TCP flows are
particularly vulnerable to segment losses since in most of the
situations they will have to rely on a RTO to recover from them. The
use of the LT algorithm, reduces the value of TO-THRESH and hence the
aliviates the outlined problem. However if there is no new data to send
(at the end of the file) LT does not help. Thus, at the end of the TCP
transfer it might be useful to retransmit early.

Paxson [Pax97] affirms that TCP fast retransmission threshold could be
safely lowered from 3 duplicate ACKs to 2 by introducing a 20msec
waiting time before retransmitting. This strategy could be as well
adopted in the case one duplicate ACK is received and no further data
is queued to send. That is, waiting for some time before deciding to
retransmit a segment.

With early retransmission only the loss of the last segment will force
the sender to timeout. To overcome this one can consider that TCP could
send an extra segment at the end of the session (containing no data of
course). This segment would not be sent reliably and its only goal
would be to avoid a timeout when the last segment is dropped.

On the other hand this modification may degrade the performance of the
network because of the early retransmission of only reordered and not
lost segments and lead to an increase of the loss rate. Several authors
have studied the phenomena of segment reordering. Paxson [Pax97]
transmitted 100Kb between 35 computers and measured that 0.1%-2% of all
segments (data and ACK) experienced reordering and 12%-35% of the flows
(depending on the data set) experienced at least one reordered segment.
Bennett et al. [BPS99] sent ICMP probes to the MAE-East Internet
exchange point and found that the probability of a session experiencing
reordering was over 90%. They conjecture reordering is a function of
network load and they consider reordering is a result of the use of
parallelism in network devices. Iannaccone et al. In [BS02] the authors
develop three techniques to measure one-way segment reordering and
perform 20 day period test. They establish that over 40% of the paths
tested experience some reordering during the t

Note: ACKs that acknowledge new data are the only ones that make the
sender increase the congestion window. If we consider a TCP receiver
that implements a delayed ACK algorithm with more tha 50ms idle time, a
reordered segment with segment lag of 1 and time lag less than 50ms
would not affect the number and the rate of ACK acknowledging new
sequence numbers sent by the receiver to the sender.

Clearly, there is a variability on the reported values of reordering
and it is not possible to conclude whether this variability comes from
the differences on the procedure to collect and analyze the data or
changes in the network (for example, different grade of parallelism in
the switches). [JID+02] is in our opinion the most comprehensive study
carried out until now but their results have to be confirmed by other
studies before concluding that reordering is not significant in the
today Internet. To explain the differences among the cited papers, it
is worthy to note that [Pax97] focuses on long lived TCP flows while in
[JID+02] the authors deal with the usual mix of "elephant" and "mice"
type flows. Bennet et al. [BPS99] study is based on measurements taken
at a particular switch that is known to induce high level of reordering
while [JID+02] is based on flows from a big diverse range of sources
and destinations.

We have implemented this modification in the ns simulator [NS] (early
retransmission at the end of the transfer on top of LT) to evaluate the
reduction of the transfer time. We have not investigated the impact of
segment reordering because of the non-existence of appropriate models.
We compute the conditional expected transfer time given the flow
experiences at least one lost. We compare the values of the conditional
expected transfer time for TCP without LT, TCP with LT and TCP with LT
and file end early retransmission. In the particular case of RTT=100ms
we obtained that LT decreases the conditional expected transfer time of
file 6 segments by 10% and our proposition by 45%. The reduction in
conditional expected transfer time decreases as the file size
increases. This demonstrates that our modification benefits short-lived
TCP transfers.

One may expect that the increase of the load network load because of
spurious retransmission is proportional to the number of spurious
retransmission induced by the modification. If the measurements of the
flow size distribution [FBP+01,GM01,TMW97,FMD+01] and the segment
reordering rates [Pax97, BPS99,JID+02] are in agreement with the real
Internet, one expects that our modification will not lead to a
significant increase of the load and loss rates. Particularly if we
note that some of the reorderings are invisible for the sender due to
the small time lag [JID+02].

4. Conclusion.

This document analyzes the impact of timeouts in the context on the
performance of short-lived TCP flows. The document proposes a
modification of TCP on the top of the LT algorithm to avoid timeouts
and hence to reduce the transfer time.

Security Considerations

This document proposes a modification of TCP on the top of Limited
Transmit Algorithm. Security considerations concerning Limited Transmit
Algorithm are discussed in RFC 3042 and they apply to this algorithm
also. Secondly, when duplicate ACKs are received and there is no more
data to send this document proposes TCP to retransmit immediately to
avoid timeouts. This modification does not raise any known security
issue.


References



[AA02]  Urtzi Ayesta, Konstantin Avrachenkov, "The Effect of the
Initial Window Size and Limited Transmit Algorithm on the Transient
Behavior of TCP Transfers", In Proc. of the 15th ITC Internet
Specialist Seminar, Wurzburg, July 2002.

[BPS+97]        Hari Balakrishnan, Venkata Padmanabhan, Srinivasan
Seshan, Mark Stemm and Randy Katz, "TCP Behavior of a Busy Web Server:
analysis and Improvements". Proc IEEE INFOCOM, San Francisco, CA, March
1998

[BPS99] J.C.R. Bennett, C.Partridge and N.Shectman, "Packet Reordering
is Not Pathological Network Behavior," IEEE Transaction on Networking,
Vol. 7,No. 6, December 1999.

[BS02]  John Bellardo, Stefan Savage, "Measuring Packet Reordering,",
ACM SIGCOMM Internet Measurement Workshop 2002, Marseille, France,
November 2002.

[CSA00] Neal Cardwell, Stefan Savage, Thomas Anderson, "Modeling TCP
latency", in Proc. IEEE INFOCOM 2000, Tel-Aviv, Israel, March 2000.

[CMT98] K. Claffy, Greg Miller, and Kevin Thompson. "The nature of the
beast. Recent traffic measurements from an Internet backbone". In
Proceedings of INET '98, July 1998.

[FF96]   Kevin Fall, Sally Floyd. "Simulation-based Comparisons of
Tahoe, Reno and SACK TCP," Computer Communication Review, July 1996.

[Flo01] Floyd, S. "A Report on Some Recent Developments in TCP
Congestion Control", IEEE Communications Magazine, April 2001.

[FMD+01]        C.Fraleigh, S.Moon, C.Diot, B.Lyles, F.Tobagi,
"Packet-Level Traffic Measurements from a Tier-1 IP Backbone", Sprint
ATL Technical Report TR01-ATL-110101, November 2001.

[FBP+01] S. Ben Fredj, T.Bonald, A.Proutiere, G.Regnie, J.Roberts,
"Statistical Badwidth Sharing: A Study of Congestion at Flow Level",
SIGCOMM 2001.

[GM01]  Liang Guo, Ibrahim Matta, "The War Between Mice and Elephants",
Proc. 9th IEEE International Conference on Network Protocols (ICNP'01),
Riverside, CA, November,2001.

[Jac88] Jacobson, V., "Congestion Avoidance and Control," SIGCOMM 1988,
Stanford, CA., August 1988.

[JID+02]        S.Jaiswal, G.Iannaccone, C.Diot, J.Kurose, D.Towsley,
"Measurement and Classification of Out-of-Sequence Packets in a Tier-1
IP Backbone," ACM SIGCOMM Internet Measurement Workshop 2002,
Marseille, France, November 2002. Extended version available as: UMass
CMPSCI Technical Report
			TR 02-17.

[NS]            Ns network simulator. URL: http://www.isi.edu/nsnam/.

[LK98]  Lin, D., and Kung, H.T., TCP Fast Recovery Strategies: Analysis
and Improvements, In Proc. of INFOCOM 98, San Francisco, CA, March
1998.

[Pax97]         Vern Paxson, "Ent-to-End Internet Packet Dynamics", ACM
SIGCOMM, Cannes, France, September 1997.

[RFC1122] Braden, R., "Requirements for Internet Hosts Communication
Layers", STD 3, RFC 1122, October 1989.

[RFC2018] Mathis M., Mahdavi J., Floyd S., Romanow A., "TCP Selective
Acknowledgement Options," RFC 2018.

[RFC2414] M.Allmamn, S.Floyd, C. Partridge, "Increasing TCP's Initial
window", RFC 2414, September 1998.A small modification of RFC 2414 has been			approved by the IESG to go to Proposed Standard on August 28, 2002.


[RFC2581] M. Allman, V.Paxson, W.Stevens, "TCP Congestion Control", RFC
2581, April 1999.

[RFC2988] Vern Paxson, Mark Allman, "Computing TCP's Retransmission
Timer", RFC 2988, November 2000.

[RFC3042] Mark Allman, Hari Balakrishnan, Sally Floyd, "Enhancing
TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 2001.

[TMW97] Kevin Thompson, Gregory J. Miller, and Rick Wilder. "Wide-area
Internet traffic patterns and characteristics". IEEE Network, 11(6),
November 1997.

Acknowledgments




Author's Addresses

Urtzi Ayesta
France Telecom R&D
905 rue Albert Einstein
06921 Sophia Antipolis
France
Email: Urtzi.Ayesta@francetelecom.com
 
Konstantin Avrachenkov
INRIA
2004 route des Lucioles, B.P.93
06902, Sophia Antipolis 
France
Phone: 00 33 492 38 7751
Email: k.avrachenkov@inria.fr
1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9,
RFC 2026, October 1996.

2  Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997

		        October 2002




Ayesta et Avrachenkov	 October 2002	[Page 5]