File: rfc1185.txt

package info (click to toggle)
doc-rfc 20120225-2
  • links: PTS, VCS
  • area: non-free
  • in suites: jessie, jessie-kfreebsd, wheezy
  • size: 461,380 kB
  • ctags: 166
  • sloc: xml: 204,780; perl: 240; sh: 59; makefile: 10
file content (1179 lines) | stat: -rw-r--r-- 48,330 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179






Network Working Group                                        V. Jacobson
Request for Comments: 1185                                           LBL
                                                               R. Braden
                                                                     ISI
                                                                L. Zhang
                                                                    PARC
                                                            October 1990


                   TCP Extension for High-Speed Paths

Status of This Memo

   This memo describes an Experimental Protocol extension to TCP for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "IAB
   Official Protocol Standards" for the standardization state and status
   of this protocol.  Distribution of this memo is unlimited.

Summary

   This memo describes a small extension to TCP to support reliable
   operation over very high-speed paths, using sender timestamps
   transmitted using the TCP Echo option proposed in RFC-1072.

1. INTRODUCTION

   TCP uses positive acknowledgments and retransmissions to provide
   reliable end-to-end delivery over a full-duplex virtual circuit
   called a connection [Postel81].  A connection is defined by its two
   end points; each end point is a "socket", i.e., a (host,port) pair.
   To protect against data corruption, TCP uses an end-to-end checksum.
   Duplication and reordering are handled using a fine-grained sequence
   number space, with each octet receiving a distinct sequence number.

   The TCP protocol [Postel81] was designed to operate reliably over
   almost any transmission medium regardless of transmission rate,
   delay, corruption, duplication, or reordering of segments.  In
   practice, proper TCP implementations have demonstrated remarkable
   robustness in adapting to a wide range of network characteristics.
   For example, TCP implementations currently adapt to transfer rates in
   the range of 100 bps to 10**7 bps and round-trip delays in the range
   1 ms to 100 seconds.

   However, the introduction of fiber optics is resulting in ever-higher
   transmission speeds, and the fastest paths are moving out of the
   domain for which TCP was originally engineered.  This memo and RFC-
   1072 [Jacobson88] propose modest extensions to TCP to extend the



Jacobson, Braden & Zhang                                        [Page 1]

RFC 1185               TCP over High-Speed Paths            October 1990


   domain of its application to higher speeds.

   There is no one-line answer to the question: "How fast can TCP go?".
   The issues are reliability and performance, and these depend upon the
   round-trip delay and the maximum time that segments may be queued in
   the Internet, as well as upon the transmission speed.  We must think
   through these relationships very carefully if we are to successfully
   extend TCP's domain.

   TCP performance depends not upon the transfer rate itself, but rather
   upon the product of the transfer rate and the round-trip delay.  This
   "bandwidth*delay product" measures the amount of data that would
   "fill the pipe"; it is the buffer space required at sender and
   receiver to obtain maximum throughput on the TCP connection over the
   path.  RFC-1072 proposed a set of TCP extensions to improve TCP
   efficiency for "LFNs" (long fat networks), i.e., networks with large
   bandwidth*delay products.

   On the other hand, high transfer rate can threaten TCP reliability by
   violating the assumptions behind the TCP mechanism for duplicate
   detection and sequencing.  The present memo specifies a solution for
   this problem, extending TCP reliability to transfer rates well beyond
   the foreseeable upper limit of bandwidth.

   An especially serious kind of error may result from an accidental
   reuse of TCP sequence numbers in data segments.  Suppose that an "old
   duplicate segment", e.g., a duplicate data segment that was delayed
   in Internet queues, was delivered to the receiver at the wrong moment
   so that its sequence numbers fell somewhere within the current
   window.  There would be no checksum failure to warn of the error, and
   the result could be an undetected corruption of the data.  Reception
   of an old duplicate ACK segment at the transmitter could be only
   slightly less serious: it is likely to lock up the connection so that
   no further progress can be made and a RST is required to
   resynchronize the two ends.

   Duplication of sequence numbers might happen in either of two ways:

   (1)  Sequence number wrap-around on the current connection

        A TCP sequence number contains 32 bits.  At a high enough
        transfer rate, the 32-bit sequence space may be "wrapped"
        (cycled) within the time that a segment may be delayed in
        queues.  Section 2 discusses this case and proposes a mechanism
        to reject old duplicates on the current connection.

   (2)  Segment from an earlier connection incarnation




Jacobson, Braden & Zhang                                        [Page 2]

RFC 1185               TCP over High-Speed Paths            October 1990


        Suppose a connection terminates, either by a proper close
        sequence or due to a host crash, and the same connection (i.e.,
        using the same pair of sockets) is immediately reopened.  A
        delayed segment from the terminated connection could fall within
        the current window for the new incarnation and be accepted as
        valid.  This case is discussed in Section 3.

   TCP reliability depends upon the existence of a bound on the lifetime
   of a segment: the "Maximum Segment Lifetime" or MSL.  An MSL is
   generally required by any reliable transport protocol, since every
   sequence number field must be finite, and therefore any sequence
   number may eventually be reused.  In the Internet protocol suite, the
   MSL bound is enforced by an IP-layer mechanism, the "Time-to-Live" or
   TTL field.

   Watson's Delta-T protocol [Watson81] includes network-layer
   mechanisms for precise enforcement of an MSL.  In contrast, the IP
   mechanism for MSL enforcement is loosely defined and even more
   loosely implemented in the Internet.  Therefore, it is unwise to
   depend upon active enforcement of MSL for TCP connections, and it is
   unrealistic to imagine setting MSL's smaller than the current values
   (e.g., 120 seconds specified for TCP).  The timestamp algorithm
   described in the following section gives a way out of this dilemma
   for high-speed networks.


2.  SEQUENCE NUMBER WRAP-AROUND

   2.1  Background

      Avoiding reuse of sequence numbers within the same connection is
      simple in principle: enforce a segment lifetime shorter than the
      time it takes to cycle the sequence space, whose size is
      effectively 2**31.

      More specifically, if the maximum effective bandwidth at which TCP
      is able to transmit over a particular path is B bytes per second,
      then the following constraint must be satisfied for error-free
      operation:

          2**31 / B  > MSL (secs)                                    [1]

      The following table shows the value for Twrap = 2**31/B in
      seconds, for some important values of the bandwidth B:







Jacobson, Braden & Zhang                                        [Page 3]

RFC 1185               TCP over High-Speed Paths            October 1990


           Network       B*8          B         Twrap
                      bits/sec   bytes/sec      secs
           _______    _______      ______       ______

           ARPANET       56kbps       7KBps    3*10**5 (~3.6 days)

           DS1          1.5Mbps     190KBps    10**4 (~3 hours)

           Ethernet      10Mbps    1.25MBps    1700 (~30 mins)

           DS3           45Mbps     5.6MBps    380

           FDDI         100Mbps    12.5MBps    170

           Gigabit        1Gbps     125MBps    17


      It is clear why wrap-around of the sequence space was not a
      problem for 56kbps packet switching or even 10Mbps Ethernets.  On
      the other hand, at DS3 and FDDI speeds, Twrap is comparable to the
      2 minute MSL assumed by the TCP specification [Postel81].  Moving
      towards gigabit speeds, Twrap becomes too small for reliable
      enforcement by the Internet TTL mechanism.

      The 16-bit window field of TCP limits the effective bandwidth B to
      2**16/RTT, where RTT is the round-trip time in seconds
      [McKenzie89].  If the RTT is large enough, this limits B to a
      value that meets the constraint [1] for a large MSL value.  For
      example, consider a transcontinental backbone with an RTT of 60ms
      (set by the laws of physics).  With the bandwidth*delay product
      limited to 64KB by the TCP window size, B is then limited to
      1.1MBps, no matter how high the theoretical transfer rate of the
      path.  This corresponds to cycling the sequence number space in
      Twrap= 2000 secs, which is safe in today's Internet.

      Based on this reasoning, an earlier RFC [McKenzie89] has cautioned
      that expanding the TCP window space as proposed in RFC-1072 will
      lead to sequence wrap-around and hence to possible data
      corruption.  We believe that this is mis-identifying the culprit,
      which is not the larger window but rather the high bandwidth.

           For example, consider a (very large) FDDI LAN with a diameter
           of 10km.  Using the speed of light, we can compute the RTT
           across the ring as (2*10**4)/(3*10**8) = 67 microseconds, and
           the delay*bandwidth product is then 833 bytes.  A TCP
           connection across this LAN using a window of only 833 bytes
           will run at the full 100mbps and can wrap the sequence space
           in about 3 minutes, very close to the MSL of TCP. Thus, high



Jacobson, Braden & Zhang                                        [Page 4]

RFC 1185               TCP over High-Speed Paths            October 1990


           speed alone can cause a reliability problem with sequence
           number wrap-around, even without extended windows.

      An "obvious" fix for the problem of cycling the sequence space is
      to increase the size of the TCP sequence number field.  For
      example, the sequence number field (and also the acknowledgment
      field) could be expanded to 64 bits.  However, the proposals for
      making such a change while maintaining compatibility with current
      TCP have tended towards complexity and ugliness.

      This memo proposes a simple solution to the problem, using the TCP
      echo options defined in RFC-1072.  Section 2.2 which follows
      describes the original use of these options to carry timestamps in
      order to measure RTT accurately.  Section 2.3 proposes a method of
      using these same timestamps to reject old duplicate segments that
      could corrupt an open TCP connection.  Section 3 discusses the
      application of this mechanism to avoiding old duplicates from
      previous incarnations.

   2.2  TCP Timestamps

      RFC-1072 defined two TCP options, Echo and Echo Reply.  Echo
      carries a 32-bit number, and the receiver of the option must
      return this same value to the source host in an Echo Reply option.

      RFC-1072 furthermore describes the use of these options to contain
      32-bit timestamps, for measuring the RTT.  A TCP sending data
      would include Echo options containing the current clock value.
      The receiver would echo these timestamps in returning segments
      (generally, ACK segments).  The difference between a timestamp
      from an Echo Reply option and the current time would then measure
      the RTT at the sender.

      This mechanism was designed to solve the following problem: almost
      all TCP implementations base their RTT measurements on a sample of
      only one packet per window.  If we look at RTT estimation as a
      signal processing problem (which it is), a data signal at some
      frequency (the packet rate) is being sampled at a lower frequency
      (the window rate).  Unfortunately, this lower sampling frequency
      violates Nyquist's criteria and may introduce "aliasing" artifacts
      into the estimated RTT [Hamming77].

      A good RTT estimator with a conservative retransmission timeout
      calculation can tolerate the aliasing when the sampling frequency
      is "close" to the data frequency.   For example, with a window of
      8 packets, the sample rate is 1/8 the data frequency -- less than
      an order of magnitude different.  However, when the window is tens
      or hundreds of packets, the RTT estimator may be seriously in



Jacobson, Braden & Zhang                                        [Page 5]

RFC 1185               TCP over High-Speed Paths            October 1990


      error, resulting in spurious retransmissions.

      A solution to the aliasing problem that actually simplifies the
      sender substantially (since the RTT code is typically the single
      biggest protocol cost for TCP) is as follows: the will sender
      place a timestamp in each segment and the receiver will reflect
      these timestamps back in ACK segments.  Then a single subtract
      gives the sender an accurate RTT measurement for every ACK segment
      (which will correspond to every other data segment, with a
      sensible receiver).  RFC-1072 defined a timestamp echo option for
      this purpose.

      It is vitally important to use the timestamp echo option with big
      windows; otherwise, the door is opened to some dangerous
      instabilities due to aliasing.  Furthermore, the option is
      probably useful for all TCP's, since it simplifies the sender.

   2.3  Avoiding Old Duplicate Segments

      Timestamps carried from sender to receiver in TCP Echo options can
      also be used to prevent data corruption caused by sequence number
      wrap-around, as this section describes.

      2.3.1  Basic Algorithm

         Assume that every received TCP segment contains a timestamp.
         The basic idea is that a segment received with a timestamp that
         is earlier than the timestamp of the most recently accepted
         segment can be discarded as an old duplicate.  More
         specifically, the following processing is to be performed on
         normal incoming segments:

         R1)  If the timestamp in the arriving segment timestamp is less
              than the timestamp of the most recently received in-
              sequence segment, treat the arriving segment as not
              acceptable:

                   If SEG.LEN > 0, send an acknowledgement in reply as
                   specified in RFC-793 page 69, and drop the segment;
                   otherwise, just silently drop the segment.*

_________________________
*Sending an ACK segment in reply is not strictly necessary, since  the
case  can  only  arise  when a later in-order segment has already been
received.   However,  for  consistency  and  simplicity,  we   suggest
treating  a  timestamp  failure  the  same  way  TCP  treats any other
unacceptable segment.




Jacobson, Braden & Zhang                                        [Page 6]

RFC 1185               TCP over High-Speed Paths            October 1990


         R2)  If the segment is outside the window, reject it (normal
              TCP processing)

         R3)  If an arriving segment is in-sequence (i.e, at the left
              window edge), accept it normally and record its timestamp.

         R4)  Otherwise, treat the segment as a normal in-window, out-
              of-sequence TCP segment (e.g., queue it for later delivery
              to the user).


         Steps R2-R4 are the normal TCP processing steps specified by
         RFC-793, except that in R3 the latest timestamp is set from
         each in-sequence segment that is accepted.  Thus, the latest
         timestamp recorded at the receiver corresponds to the left edge
         of the window and only advances when the left edge moves
         [Jacobson88].

         It is important to note that the timestamp is checked only when
         a segment first arrives at the receiver, regardless of whether
         it is in-sequence or is queued.  Consider the following
         example.

              Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has
              been sent, where the letter indicates the sequence number
              and the digit represents the timestamp.  Suppose also that
              segment B.1 has been lost.  The highest in-sequence
              timestamp is 1 (from A.1), so C.1, ..., Z.1 are considered
              acceptable and are queued.  When B is retransmitted as
              segment B.2 (using the latest timestamp), it fills the
              hole and causes all the segments through Z to be
              acknowledged and passed to the user.  The timestamps of
              the queued segments are *not* inspected again at this
              time, since they have already been accepted.  When B.2 is
              accepted, the receivers's current timestamp is set to 2.

         This rule is vital to allow reasonable performance under loss.
         A full window of data is in transit at all times, and after a
         loss a full window less one packet will show up out-of-sequence
         to be queued at the receiver (e.g., up to ~2**30 bytes of
         data); the timestamp option must not result in discarding this
         data.

         In certain unlikely circumstances, the algorithm of rules R1-R4
         could lead to discarding some segments unnecessarily, as shown
         in the following example:

              Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have



Jacobson, Braden & Zhang                                        [Page 7]

RFC 1185               TCP over High-Speed Paths            October 1990


              been sent in sequence and that segment B.1 has been lost.
              Furthermore, suppose delivery of some of C.1, ... Z.1 is
              delayed until AFTER the retransmission B.2 arrives at the
              receiver.  These delayed segments will be discarded
              unnecessarily when they do arrive, since their timestamps
              are now out of date.

         This case is very unlikely to occur.  If the retransmission was
         triggered by a timeout, some of the segments C.1, ... Z.1 must
         have been delayed longer than the RTO time.  This is presumably
         an unlikely event, or there would be many spurious timeouts and
         retransmissions.  If B's retransmission was triggered by the
         "fast retransmit" algorithm, i.e., by duplicate ACK's, then the
         queued segments that caused these ACK's must have been received
         already.

         Even if a segment was delayed past the RTO, the selective
         acknowledgment (SACK) facility of RFC-1072 will cause the
         delayed packets to be retransmitted at the same time as B.2,
         avoiding an extra RTT and therefore causing a very small
         performance penalty.

         We know of no case with a significant probability of occurrence
         in which timestamps will cause performance degradation by
         unnecessarily discarding segments.

      2.3.2  Header Prediction

         "Header prediction" [Jacobson90] is a high-performance
         transport protocol implementation technique that is is most
         important for high-speed links.  This technique optimizes the
         code for the most common case: receiving a segment correctly
         and in order.  Using header prediction, the receiver asks the
         question, "Is this segment the next in sequence?"  This
         question can be answered in fewer machine instructions than the
         question, "Is this segment within the window?"

         Adding header prediction to our timestamp procedure leads to
         the following sequence for processing an arriving TCP segment:

         H1)  Check timestamp (same as step R1 above)

         H2)  Do header prediction: if segment is next in sequence and
              if there are no special conditions requiring additional
              processing, accept the segment, record its timestamp, and
              skip H3.

         H3)  Process the segment normally, as specified in RFC-793.



Jacobson, Braden & Zhang                                        [Page 8]

RFC 1185               TCP over High-Speed Paths            October 1990


              This includes dropping segments that are outside the
              window and possibly sending acknowledgments, and queueing
              in-window, out-of-sequence segments.

         However, the timestamp check in step H1 is very unlikely to
         fail, and it is a relatively expensive operation since it
         requires interval arithmetic on a finite field.  To perform
         this check on every single segment seems like poor
         implementation engineering, defeating the purpose of header
         prediction.  Therefore, we suggest that an implementor
         interchange H1 and H2, i.e., perform header prediction FIRST,
         performing H1 and H3 only if header prediction fails.  We
         believe that this change might gain 5-10% in performance on
         high-speed networks.

         This reordering does raise a theoretical hazard: a segment from
         2**32 bytes in the past may arrive at exactly the wrong time
         and be accepted mistakenly by the header-prediction step.  We
         make the following argument to show that the probability of
         this failure is negligible.

              If all segments are equally likely to show up as old
              duplicates, then the probability of an old duplicate
              exactly matching the left window edge is the maximum
              segment size (MSS) divided by the size of the sequence
              space.  This ratio must be less than 2**-16, since MSS
              must be < 2**16; for example, it will be (2**12)/(2**32) =
              2**-20 for an FDDI link.  However, the older a segment is,
              the less likely it is to be retained in the Internet, and
              under any reasonable model of segment lifetime the
              probability of an old duplicate exactly at the left window
              edge must be much smaller than 2**16.

              The 16 bit TCP checksum also allows a basic unreliability
              of one part in 2**16.  A protocol mechanism whose
              reliability exceeds the reliability of the TCP checksum
              should be considered "good enough", i.e., it won't
              contribute significantly to the overall error rate.  We
              therefore believe we can ignore the problem of an old
              duplicate being accepted by doing header prediction before
              checking the timestamp.

      2.3.3  Timestamp Frequency

         It is important to understand that the receiver algorithm for
         timestamps does not involve clock synchronization with the
         sender.  The sender's clock is used to stamp the segments, and
         the sender uses this fact to measure RTT's.  However, the



Jacobson, Braden & Zhang                                        [Page 9]

RFC 1185               TCP over High-Speed Paths            October 1990


         receiver treats the timestamp as simply a monotone-increasing
         serial number, without any necessary connection to its clock.
         From the receiver's viewpoint, the timestamp is acting as a
         logical extension of the high-order bits of the sequence
         number.

         However, the receiver algorithm dpes place some requirements on
         the frequency of the timestamp "clock":

         (a)  Timestamp clock must not be "too slow".

              It must tick at least once for each 2**31 bytes sent.  In
              fact, in order to be useful to the sender for round trip
              timing, the clock should tick at least once per window's
              worth of data, and even with the RFC-1072 window
              extension, 2**31 bytes must be at least two windows.

              To make this more quantitative, any clock faster than 1
              tick/sec will reject old duplicate segments for link
              speeds of ~2 Gbps;  a 1ms clock will work up to link
              speeds of 2 Tbps (10**12 bps!).

         (b)  Timestamp clock must not be "too fast".

              Its cycling time must be greater than MSL seconds.  Since
              the clock (timestamp) is 32 bits and the worst-case MSL is
              255 seconds, the maximum acceptable clock frequency is one
              tick every 59 ns.

              However, since the sender is using the timestamp for RTT
              calculations, the timestamp doesn't need to have much more
              resolution than the granularity of the retransmit timer,
              e.g., tens or hundreds of milliseconds.

         Thus, both limits are easily satisfied with a reasonable clock
         rate in the range 1-100ms per tick.

         Using the timestamp option relaxes the requirements on MSL for
         avoiding sequence number wrap-around.  For example, with a 1 ms
         timestamp clock, the 32-bit timestamp will wrap its sign bit in
         25 days.  Thus, it will reject old duplicates on the same
         connection as long as MSL is 25 days or less.  This appears to
         be a very safe figure.  If the timestamp has 10 ms resolution,
         the MSL requirement is boosted to 250 days.  An MSL of 25 days
         or longer can probably be assumed by the gateway system without
         requiring precise MSL enforcement by the TTL value in the IP
         layer.




Jacobson, Braden & Zhang                                       [Page 10]

RFC 1185               TCP over High-Speed Paths            October 1990


3.  DUPLICATES FROM EARLIER INCARNATIONS OF CONNECTION

   We turn now to the second potential cause of old duplicate packet
   errors: packets from an earlier incarnation of the same connection.
   The appendix contains a review the mechanisms currently included in
   TCP to handle this problem.  These mechanisms depend upon the
   enforcement of a maximum segment lifetime (MSL) by the Internet
   layer.

   The MSL required to prevent failures due to an earlier connection
   incarnation does not depend (directly) upon the transfer rate.
   However, the timestamp option used as described in Section 2 can
   provide additional security against old duplicates from earlier
   connections.  Furthermore, we will see that with the universal use of
   the timestamp option, enforcement of a maximum segment lifetime would
   no longer be required for reliable TCP operation.

   There are two cases to be considered (see the appendix for more
   explanation):  (1) a system crashing (and losing connection state)
   and restarting, and (2) the same connection being closed and reopened
   without a loss of host state.  These will be described in the
   following two sections.

   3.1  System Crash with Loss of State

      TCP's quiet time of one MSL upon system startup handles the loss
      of connection state in a system crash/restart.  For an
      explanation, see for example "When to Keep Quiet" in the TCP
      protocol specification [Postel81].  The MSL that is required here
      does not depend upon the transfer speed.  The current TCP MSL of 2
      minutes seems acceptable as an operational compromise, as many
      host systems take this long to boot after a crash.

      However, the timestamp option may be used to ease the MSL
      requirements (or to provide additional security against data
      corruption).  If timestamps are being used and if the timestamp
      clock can be guaranteed to be monotonic over a system
      crash/restart, i.e., if the first value of the sender's timestamp
      clock after a crash/restart can be guaranteed to be greater than
      the last value before the restart, then a quiet time will be
      unnecessary.

      To dispense totally with the quiet time would seem to require that
      the host clock be synchronized to a time source that is stable
      over the crash/restart period, with an accuracy of one timestamp
      clock tick or better.  Fortunately, we can back off from this
      strict requirement.  Suppose that the clock is always re-
      synchronized to within N timestamp clock ticks and that booting



Jacobson, Braden & Zhang                                       [Page 11]

RFC 1185               TCP over High-Speed Paths            October 1990


      (extended with a quiet time, if necessary) takes more than N
      ticks.  This will guarantee monotonicity of the timestamps, which
      can then be used to reject old duplicates even without an enforced
      MSL.

   3.2  Closing and Reopening a Connection

      When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT
      state ties up the socket pair for 4 minutes (see Section 3.5 of
      [Postel81].  Applications built upon TCP that close one connection
      and open a new one (e.g., an FTP data transfer connection using
      Stream mode) must choose a new socket pair each time.  This delay
      serves two different purposes:

      (a)  Implement the full-duplex reliable close handshake of TCP.

           The proper time to delay the final close step is not really
           related to the MSL; it depends instead upon the RTO for the
           FIN segments and therefore upon the RTT of the path.*
           Although there is no formal upper-bound on RTT, common
           network engineering practice makes an RTT greater than 1
           minute very unlikely.  Thus, the 4 minute delay in TIME-WAIT
           state works satisfactorily to provide a reliable full-duplex
           TCP close.  Note again that this is independent of MSL
           enforcement and network speed.

           The TIME-WAIT state could cause an indirect performance
           problem if an application needed to repeatedly close one
           connection and open another at a very high frequency, since
           the number of available TCP ports on a host is less than
           2**16.  However, high network speeds are not the major
           contributor to this problem; the RTT is the limiting factor
           in how quickly connections can be opened and closed.
           Therefore, this problem will no worse at high transfer
           speeds.

      (b)  Allow old duplicate segements to expire.

           Suppose that a host keeps a cache of the last timestamp
           received from each remote host.  This can be used to reject
           old duplicate segments from earlier incarnations of the
_________________________
*Note: It could be argued that the side that is sending  a  FIN  knows
what  degree  of reliability it needs, and therefore it should be able
to  determine  the  length  of  the  TIME-WAIT  delay  for  the  FIN's
recipient.   This could be accomplished with an appropriate TCP option
in FIN segments.




Jacobson, Braden & Zhang                                       [Page 12]

RFC 1185               TCP over High-Speed Paths            October 1990


           connection, if the timestamp clock can be guaranteed to have
           ticked at least once since the old conennection was open.
           This requires that the TIME-WAIT delay plus the RTT together
           must be at least one tick of the sender's timestamp clock.

           Note that this is a variant on the mechanism proposed by
           Garlick, Rom, and Postel (see the appendix), which required
           each host to maintain connection records containing the
           highest sequence numbers on every connection.  Using
           timestamps instead, it is only necessary to keep one quantity
           per remote host, regardless of the number of simultaneous
           connections to that host.

      We conclude that if all hosts used the TCP timestamp algorithm
      described in Section 2, enforcement of a maximum segment lifetime
      would be unnecessary and the quiet time at system startup could be
      shortened or removed.  In any case, the timestamp mechanism can
      provide additional security against old duplicates from earlier
      connection incarnations.   However, a 4 minute TIME-WAIT delay
      (unrelated to MSL enforcement or network speed) must be retained
      to provide the reliable close handshake of TCP.

4. CONCLUSIONS

   We have presented a mechanism, based upon the TCP timestamp echo
   option of RFC-1072, that will allow very high TCP transfer rates
   without reliability problems due to old duplicate segments on the
   same connection.  This mechanism also provides additional security
   against intrusion of old duplicates from earlier incarnations of the
   same connection.  If the timestamp mechanism were used by all hosts,
   the quiet time at system startup could be eliminated and enforcement
   of a maximum segment lifetime (MSL) would no longer be necessary.

REFERENCES

   [Cerf76]  Cerf, V., "TCP Resynchronization", Tech Note #79, Digital
   Systems Lab, Stanford, January 1976.

   [Dalal74]  Dalal, Y., "More on Selecting Sequence Numbers", INWG
   Protocol Note #4, October 1974.

   [Garlick77]  Garlick, L., R. Rom, and J. Postel, "Issues in Reliable
   Host-to-Host Protocols", Proc. Second Berkeley Workshop on
   Distributed Data Management and Computer Networks, May 1977.

   [Hamming77]  Hamming, R., "Digital Filters", ISBN 0-13-212571-4,
   Prentice Hall, Englewood Cliffs, N.J., 1977.




Jacobson, Braden & Zhang                                       [Page 13]

RFC 1185               TCP over High-Speed Paths            October 1990


   [Jacobson88]  Jacobson, V., and R. Braden, "TCP Extensions for
   Long-Delay Paths", RFC 1072, LBL and USC/Information Sciences
   Institute, October 1988.

   [Jacobson90]  Jacobson, V., "4BSD Header Prediction", ACM Computer
   Communication Review, April 1990.

   [McKenzie89]  McKenzie, A., "A Problem with the TCP Big Window
   Option", RFC 1110, BBN STC, August 1989.

   [Postel81]  Postel, J., "Transmission Control Protocol", RFC 793,
   DARPA, September 1981.

   [Tomlinson74]  Tomlinson, R., "Selecting Sequence Numbers", INWG
   Protocol Note #2, September 1974.

   [Watson81]  Watson, R., "Timer-based Mechanisms in Reliable
   Transport Protocol Connection Management", Computer Networks,
   Vol. 5, 1981.
































Jacobson, Braden & Zhang                                       [Page 14]

RFC 1185               TCP over High-Speed Paths            October 1990


APPENDIX -- Protection against Old Duplicates in TCP

   During the development of TCP, a great deal of effort was devoted to
   the problem of protecting a TCP connection from segments left from
   earlier incarnations of the same connection.  Several different
   mechanisms were proposed for this purpose [Tomlinson74] [Dalal74]
   [Cerf76] [Garlick77].

   The connection parameters that are required in this discussion are:

           Tc = Connection duration in seconds.

           Nc = Total number of bytes sent on connection.

           B = Effective bandwidth of connection = Nc/Tc.

   Tomlinson proposed a scheme with two parts: a clock-driven selection
   of ISN (Initial Sequence Number) for a connection, and a
   resynchronization procedure [Tomlinson74]. The clock-driven scheme
   chooses:

      ISN = (integer(R*t)) mod 2**32                 [2]

   where t is the current time relative to an arbitrary origin, and R is
   a constant.  R was intended to be chosen so that ISN will advance
   faster than sequence numbers will be used up on the connection.
   However, at high speeds this will not be true; the consequences of
   this will be discussed below.

   The clock-driven choice of ISN in formula [2] guarantees freedom from
   old duplicates matching a reopened connection if the original
   connection was "short-lived" and "slow".  By "short-lived", we mean a
   connection that stayed open for a time Tc less than the time to cycle
   the ISN, i.e., Tc < 2**32/R seconds.  By "slow", we mean that the
   effective transfer rate B is less than R.

   This is illustrated in Figure 1, where sequence numbers are plotted
   against time.  The asterisks show the ISN lines from formula [2],
   while the circles represent the trajectories of several short-lived
   incarnations of the same connection, each terminating at the "x".

        Note: allowing rapid reuse of connections was believed to be an
        important goal during the early TCP development.  This
        requirement was driven by the hope that TCP would serve as a
        basis for user-level transaction protocols as well as
        connection-oriented protocols.  The paradigm discussed was the
        "Christmas Tree" or "Kamikazee" segment that contained SYN and
        FIN bits as well as data.  Enthusiasm for this was somewhat



Jacobson, Braden & Zhang                                       [Page 15]

RFC 1185               TCP over High-Speed Paths            October 1990


        dampened when it was observed that the 3-way SYN handshake and
        the FIN handshake mean that 5 packets are required for a minimum
        exchange. Furthermore, the TIME-WAIT state delay implies that
        the same connection really cannot be reopened immediately.  No
        further work has been done in this area, although existing
        applications (especially SMTP) often generate very short TCP
        sessions.  The reuse problem is generally avoided by using a
        different port pair for each connection.


        |- 2**32       ISN             ISN
        |              *               *
        |             *               *
        |            *               *
        |           *x              *
        |          o               *
    ^   |         *               *
    |   |        *  x            *
        |       * o             *
    S   |      *o              *
    e   |     o               *
    q   |    *               *
        |   *               *
    #   |  * x             *
        | *o              *
        |o_______________*____________
                         ^         Time -->
                       4.55hrs


     Figure 1.  Clock-Driven ISN  avoiding duplication on
                short-Lived, slow connections.


   However, clock-driven ISN selection does not protect against old
   duplicate packets for a long-lived or fast connection:  the
   connection may close (or crash) just as the ISN has cycled around and
   reached the same value again.  If the connection is then reopened, a
   datagram still in transit from the old connection may fall into the
   current window.  This is illustrated by Figure 2 for a slow, long-
   lived connection, and by Figures 3 and 4 for fast connections.  In
   each case, the point "x" marks the place at which the original
   connection closes or crashes.  The arrow in Figure 2 illustrates an
   old duplicate segment.  Figure 3 shows a connection whose total byte
   count Nc < 2**32, while Figure 4 concerns Nc >= 2**32.

   To prevent the duplication illustrated in Figure 2, Tomlinson
   proposed to "resynchronize" the connection sequence numbers if they



Jacobson, Braden & Zhang                                       [Page 16]

RFC 1185               TCP over High-Speed Paths            October 1990


   came within an MSL of the ISN.  Resynchronization might take the form
   of a delay (point "y") or the choice of a new sequence number (point
   "z").

        |- 2**32       ISN               ISN
        |              *                 *
        |             *                 *
        |            *                 *
        |           *                 *
        |          *                 *
    ^   |         *                 *
    |   |        *                 *
        |       *                 *
    S   |      *                 *
    e   |     *                x* y
    q   |    *           o     *
        |   *      o          *z
    #   |  *o                *
        | *                 *
        |*_________________*____________
                           ^         Time -->
                          4.55hrs

        Figure 2.  Resynchronization to Avoid Duplication
                   on Slow, Long-Lived Connection



        |- 2**32       ISN               ISN
        |              *                 *
        |       x   o *                 *
        |            *                 *
        |      o-->o*                 *
        |          *                 *
    ^   |     o   o                 *
    |   |        *                 *
        |    o  *                 *
    S   |      *                 *
    e   |   o *                 *
    q   |    *                 *
        |  o*                 *
    #   |  *                 *
        | o                 *
        |*_________________*____________
                           ^         Time -->
                          4.55hrs

     Figure 3.  Duplication on Fast Connection: Nc < 2**32 bytes



Jacobson, Braden & Zhang                                       [Page 17]

RFC 1185               TCP over High-Speed Paths            October 1990


        |- 2**32       ISN               ISN
        |      o       *                 *
        |           x *                 *
        |            *                 *
        |     o     *                 *
        |          o                 *
    ^   |         *                 *
    |   |    o   *                 *
        |       * o               *
    S   |      *                *
    e   |   o *                 *
    q   |    *   o             *
        |   *                 *
    #   |  o                 *
        | *     o           *
        |*_________________*____________
                           ^         Time -->
                          4.55hrs

     Figure 4.  Duplication on Fast Connection: Nc > 2**32 bytes

   In summary, Figures 1-4 illustrated four possible failure modes for
   old duplicate packets from an earlier incarnation.  We will call
   these four modes F1 , F2, F3, and F4:


   F1:  B < R, Tc < 4.55 hrs. (Figure 1)

   F2:  B < R, Tc >= 4.55 hrs. (Figure 2)

   F3:  B >= R, Nc < 2**32 (Figure 3)

   F4:  B >= R, Nc >= 2**32 (Figure 4)


   Another limitation of clock-driven ISN selection should be mentioned.
   Tomlinson assumed that the current time t in formula [2] is obtained
   from a clock that is persistent over a system crash.  For his scheme
   to work correctly, the clock must be restarted with an accuracy of
   1/R seconds (e.g, 4 microseconds in the case of TCP).  While this may
   be possible for some hosts and some crashes, in most cases there will
   be an uncertainty in the clock after a crash that ranges from a
   second to several minutes.

   As a result of this random clock offset after system
   reinitialization, there is a possibility that old segments sent
   before the crash may fall into the window of a new connection
   incarnation.  The solution to this problem that was adopted in the



Jacobson, Braden & Zhang                                       [Page 18]

RFC 1185               TCP over High-Speed Paths            October 1990


   final TCP spec is a "quiet time" of MSL seconds when the system is
   initialized [Postel81, p. 28].  No TCP connection can be opened until
   the expiration of this quiet time.

   A different approach was suggested by Garlick, Rom, and Postel
   [Garlick77].  Rather than using clock-driven ISN selection, they
   proposed to maintain connection records containing the last ISN used
   on every connection.  To immediately open a new incarnation of a
   connection, the ISN is taken to be greater than the last sequence
   number of the previous incarnation, so that the new incarnation will
   have unique sequence numbers.  To handle a system crash, they
   proposed a quiet time, i.e., a delay at system startup time to allow
   old duplicates to expire.  Note that the connection records need be
   kept only for MSL seconds; after that, no collision is possible, and
   a new connection can start with sequence number zero.

   The scheme finally adopted for TCP combines features of both these
   proposals.  TCP uses three mechanisms:

   (A)  ISN selection is clock-driven to handle short-lived connections.
        The parameter R =  250KBps, so that the ISN value cycles in
        2**32/R = 4.55 hours.

   (B)  (One end of) a closed connection is left in a "busy" state,
        known as "TIME-WAIT" state, for a time of 2*MSL.  TIME-WAIT
        state handles the proper close of a long-lived connection
        without resynchronization.  It also allows reliable completion
        of the full-duplex close handshake.

   (C)  There is a quiet time of one MSL at system startup.  This
        handles a crash of a long-lived connection and avoids time
        resynchronization problems in (A).

   Notice that (B) and (C) together are logically sufficient to prevent
   accidental reuse of sequence numbers from a different incarnation,
   for any of the failure modes F1-F4.  (A) is not logically necessary
   since the close delay (B) makes it impossible to reopen the same TCP
   connection immediately.  However, the use of (A) does give additional
   assurance in a common case, perhaps compensating for a host that has
   set its TIME-WAIT state delay too short.

   Some TCP implementations have permitted a connection in the TIME-WAIT
   state to be reopened immediately by the other side, thus short-
   circuiting mechanism (B).  Specifically, a new SYN for the same
   socket pair is accepted when the earlier incarnation is still in
   TIME-WAIT state.  Old duplicates in one direction can be avoided by
   choosing the ISN to be the next unused sequence number from the
   preceding connection (i.e., FIN+1); this is essentially an



Jacobson, Braden & Zhang                                       [Page 19]

RFC 1185               TCP over High-Speed Paths            October 1990


   application of the scheme of Garlick, Rom, and Postel, using the
   connection block in TIME-WAIT state as the connection record.

   However, the connection is still vulnerable to old duplicates in the
   other direction.  Mechanism (A) prevents trouble in mode F1, but
   failures can arise in F2, F3, or F4; of these, F2, on short, fast
   connections, is the most dangerous.

   Finally, we note TCP will operate reliably without any MSL-based
   mechanisms in the following restricted domain:

   *    Total data sent is less then 2**32 octets, and

   *    Effective sustained rate less than 250KBps, and

   *    Connection duration less than 4.55 hours.

   At the present time, the great majority of current TCP usage falls
   into this restricted domain.  The third component, connection
   duration, is the most commonly violated.

Security Considerations

   Security issues are not discussed in this memo.

Authors' Addresses

   Van Jacobson
   University of California
   Lawrence Berkeley Laboratory
   Mail Stop 46A
   Berkeley, CA 94720

   Phone: (415) 486-6411
   EMail: van@CSAM.LBL.GOV


   Bob Braden
   University of Southern California
   Information Sciences Institute
   4676 Admiralty Way
   Marina del Rey, CA 90292

   Phone: (213) 822-1511
   EMail: Braden@ISI.EDU






Jacobson, Braden & Zhang                                       [Page 20]

RFC 1185               TCP over High-Speed Paths            October 1990


   Lixia Zhang
   XEROX Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, CA 94304

   Phone: (415) 494-4415
   EMail: lixia@PARC.XEROX.COM












































Jacobson, Braden & Zhang                                       [Page 21]