home *** CD-ROM | disk | FTP | other *** search
-
- Archive-name:tcp-ip/FAQ
- Last-modified: 1996/4/1
-
- Internet Protocol Frequently Asked Questions
-
- Maintained by: George V. Neville-Neil (gnn@wrs.com)
- Contributions from:
- Ran Atkinson
- Mark Bergman
- Stephane Bortzmeyer
- Rodney Brown
- Dr. Charles E. Campbell Jr.
- Phill Conrad
- Alan Cox
- Rick Jones
- Jon Kay
- Jay Kreibrich
- William Manning
- Barry Margolin
- Jim Muchow
- Subu Rama
- W. Richard Stevens
-
- Version 3.3
-
-
- ************************************************************************
-
- The following is a list of Frequently Asked Questions, and
- their answers, for people interested in the Internet Protocols,
- including TCP, UDP, ICMP and others. Please send all additions,
- corrections, complaints and kudos to the above address. This FAQ will
- be posted on or about the first of every month.
-
- This FAQ is available for anonymous ftp from :
- ftp.netcom.com:/pub/gnn/tcp-ip.faq . You may get it from my home page at
- ftp://ftp.netcom.com/pub/gnn/gnn.html
- You can read the FAQ in HTMl format on Netcom or from the mirror
- site http://web.cnam.fr/Network/TCP-IP/tcp-ip.html
-
- ************************************************************************
-
- Table of Contents:
- Glossary
- 1) Are there any good books on IP?
- 2) Where can I find example source code for TCP/UDP/IP?
- 3) Are there any public domain programs to check the performance of an
- IP link?
- 4) Where do I find RFCs?
- 5) How can I detect that the other end of a TCP connection has
- crashed? Can I use "keepalives" for this?
- 6) Can the keepalive timeouts be configured?
- 7) Can I set up a gateway to the Internet that translates IP
- addresses, so that I don't have to change all our internal addresses
- to an official network?
- 8) Are there object-oriented network programming tools?
- 9) What other FAQs are related to this one?
- 10) What newsgroups contain information on networks/protocols?
- 11) Van Jacobson explains TCP congestion avoidance.
- 12) Can I use a single bit subnet?
-
- Glossary:
-
- I felt this should be first given the plethora of acronyms used in the
- rest of this FAQ.
-
- IP: Internet Protocol. The lowest layer protocol defined in TCP/IP.
- This is the base layer on which all other protocols mentioned herein
- are built. IP is often referred to as TCP/IP as well.
-
- UDP: User Datagram Protocol. This is a connectionless protocol built
- on top of IP. It does not provide any guarantees on the ordering or
- delivery of messages. This protocol is layered on top of IP.
-
- TCP: Transmission Control Protocol. TCP is a connection oriented
- protocol that guarantees that messages are delivered in the order in
- which they were sent and that all messages are delivered. If a TCP
- connection cannot deliver a message it closes the connection and
- informs the entity that created it. This protocol is layered on top
- of IP.
-
- ICMP: Internet Control Message Protocol. ICMP is used for
- diagnostics in the network. The Unix program, ping, uses ICMP
- messages to detect the status of other hosts in the net. ICMP
- messages can either be queries (in the case of ping) or error reports,
- such as when a network is unreachable.
-
- RFC: Request For Comment. RFCs are documents that define the
- protocols used in the IP Internet. Some are only suggestions, some
- are even jokes, and others are published standards. Several sites in
- the Internet store RFCs and make them available for anonymous ftp.
-
- SLIP: Serial Line IP. An implementation of IP for use over a serial
- link (modem). CSLIP is an optimized (compressed) version of SLIP that
- gives better throughput.
-
- Bandwidth: The amount of data that can be pushed through a link in
- unit time. Usually measured in bits or bytes per second.
-
- Latency: The amount of time that a message spends in a network going
- from point A to point B.
-
- Jitter: The effect seen when latency is not a constant. That is, if
- messages experience a different latencies between two points in a
- network.
-
- RPC: Remote Procedure Call. RPC is a method of making network access
- to resource transparent to the application programmer by supplying a
- "stub" routine that is called in the same way as a regular procedure
- call. The stub actually performs the call across the network to
- another computer.
-
- Marshalling: The process of taking arbitrary data (characters,
- integers, structures) and packing them up for transmission across a
- network.
-
- MBONE: A virtual network that is a Multicast backBONE. It is still a
- research prototype, but it extends through most of the core of the
- Internet (including North America, Europe, and Australia). It uses IP
- Multicasting which is defined in RFC-1112. An MBONE FAQ is available
- via anonymous ftp from: ftp.isi.edu" There are frequent broadcasts of
- multimedia programs (audio and low bandwidth video) over the MBONE.
- Though the MBONE is used for mutlicasting, the long haul parts of the
- MBONE use point-to-point connections through unicast tunnels to
- connect the various multicast networks worldwide.
-
-
- 1) Are there any good books on IP?
-
- A) Yes. Please see the following:
-
- Internetworking with TCP/IP Volume I
- (Principles, Protocols, and Architecture)
- Douglas E. Comer
- Prentice Hall 1991 ISBN 0-13-468505-9
-
- This volume covers all of the protocols, including IP, UDP, TCP, and
- the gateway protocols. It also includes discussions of higher level
- protocols such as FTP, TELNET, and NFS.
-
- Internetworking with TCP/IP Volume II
- (Design, Implementation, and Internals)
- Douglas E. Comer / David L. Stevens
- Prentice Hall 1991 ISBN 0-13-472242-6
-
- Discusses the implementation of the protocols and gives numerous code
- examples.
-
- Internetworking with TCP/IP Volume III (BSD Socket Version)
- (Client - Server Programming and Applications)
- Douglas E. Comer / David L. Stevens
- Prentice Hall 1993 ISBN 0-13-474222-2
-
- This book discusses programming applications that use the internet
- protocols. It includes examples of telnet, ftp clients and servers.
- Discusses RPC and XDR at length.
-
- TCP/IP Illustrated, Volume 1: The Protocols,
- W. Richard Stevens
- (c) Addison-Wesley, 1994 ISBN 0-201-63346-9
-
- An excellent introduction to the entire TCP/IP protocol suite,
- covering all the major protocols, plus several important applications.
-
- "TCP/IP Illustrated, Volume 2: The Implementation",
- by Gary R. Wright and W. Richard Stevens
- (c) Addison-Wesley, 1995
- ISBN 0-201-63354-X
-
- This is a complete, and lenthy, discussion of the internals of TCP/IP
- based on the Net/2 release of BSD.
-
- Unix Network Programming
- W. Richard Stevens
- Prentice Hall 1990 ISBN 0-13-949876
-
- An excellent introduction to network programming under Unix.
-
- The Design and Implementation of the 4.3 BSD Operating System
- Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, John S.
- Quarterman
- Addison-Wesley 1989 ISBN 0-201-06196-1
-
- Though this book is a reference for the entire operating system, the
- eleventh and twelfth chapters completely explain how the networking
- protocols are implemented in the kernel.
-
- Stevens, W. Richard, Unix Network Programming. 1990, Prentice-Hall.
-
- An excellent introduction to network programming under Unix. Widely
- cited on the Usenet bulliten boards as the "best place to start" if you
- want to actually learn how to write Unix programs that communicate over
- a network.
-
- Rago, Steven A. Unix System V. Network Programming. 1993, Addison-Wesley.
-
- A book that covers the same kinds of topics as W. Richard Stevens Unix
- Network Programming, but is more specific to Unix System V Release 4
- (SVR4), and so perhaps is more useful and up to date if you are
- working specifically with that implementation. (Stevens book covers
- Unix System V release 3.x). There is a much more extensive coverage
- of Streams in Rago's book; 4 chapters, where Stevens only provides a
- couple of subsections. The design project at the end of the book is
- an implementation of SLIP.
-
-
- 2) Where can I find example source code for TCP/UDP/IP?
-
- A) Code from the Internetworking with TCP/IP Volume III is available
- for anonymous ftp from:
-
- arthur.cs.purdue.edu:/pub/dls
-
- Code used in the Net-2 version of Berkeley Unix is available for
- anonymous ftp from:
-
- ftp.uu.net:systems/unix/bsd-sources/sys/netinet
-
- and
-
- gatekeeper.dec.com:/pub/BSD/net2/sys/netinet
-
- Code from Richard Steven's book is available on:
- ftp.uu.net:/published/books/stevens.*
-
- Example source code and libraries to make coding quicker is available
- in the Simple Sockets Library written at NASA. The Simple Sockets
- Library makes sockets easy to use! And, it comes as source code. It
- has been tested on: Unix (SGI, DecStation, AIX, Sun 3, Sparcstation;
- version 2.02+: Solaris 2.1, SCO), VMS, and MSDOS (client only since
- there's no background there). It is provided in source code form, of
- course, and sits atop Berkeley sockets and tcp/ip.
-
- You can order the "Simple Sockets Library" from
-
- Austin Code Works
- 11100 Leafwood Lane
- Austin, TX 78750-3464 USA
- Phone (512) 258-0785
-
- Ask for the "SSL - The Simple Sockets Library". Last I checked, they
- were asking $20 US for it.
-
-
- For DOS there is WATTCP.ZIP (numerous sites):
-
- WATTCP is a DOS TCP/IP stack derived from the NCSA Telnet program and
- much enhanced. It comes with some example programs and complete source
- code. The interface isn't BSD sockets but is well suited to PC type
- work. It is also written so that it can be used and memory
- allocation).
-
- 3) Are there any public domain programs to check the performance of
- an IP link?
-
- A)
-
- TTCP: Available for anonymous ftp from....
-
- wuarchive.wustl.edu:/graphics/graphics/mirrors/sgi.com/sgi/src/ttcp
-
- On ftp.sgi.com are netperf (from Rick Jones at HP) and nettest
- (from Dave Borman at Cray). ttcp is also availabel at ftp.sgi.com.
-
- You can get to the NetPerf home page via:
-
- http://www.cup.hp.com/netperf/NetperfPage.html
-
-
- There is suite of Bandwidth Measuring programs from gnn@netcom.com.
- Available for anonymous ftp from ftp.netcom.com in
- ~ftp/gnn/bwmeas-0.3.tar.Z These are several programs that meausre
- bandwidth and jitter over several kinds of IPC links, including TCP
- and UDP.
-
-
- 4) Where do I find RFCs?
-
- A) This is the latest info on obtaining RFCs:
- Details on obtaining RFCs via FTP or EMAIL may be obtained by sending
- an EMAIL message to rfc-info@ISI.EDU with the message body
- help: ways_to_get_rfcs. For example:
-
- To: rfc-info@ISI.EDU
- Subject: getting rfcs
-
- help: ways_to_get_rfcs
-
- The response to this mail query is quite long and has been omitted.
-
- RFCs can be obtained via FTP from DS.INTERNIC.NET, NIS.NSF.NET,
- NISC.JVNC.NET, FTP.ISI.EDU, WUARCHIVE.WUSTL.EDU, SRC.DOC.IC.AC.UK,
- FTP.CONCERT.NET, or FTP.SESQUI.NET.
-
-
- Using Web, WAIS, and gopher:
-
- Web:
-
- http://web.nexor.co.uk/rfc-index/rfc-index-search-form.html
-
- WAIS access by keyword:
-
- wais://wais.cnam.fr/RFC
-
- Excellent presentation with a full-text search too:
-
- http://www.cis.ohio-state.edu/hypertext/information/rfc.html
-
- With Gopher:
-
- gopher://r2d2.jvnc.net/11/Internet%20Resources/RFC
- gopher://muspin.gsfc.nasa.gov:4320/1g2go4%20ds.internic.net%2070%201%201/.ds/
- .internetdocs
-
-
-
- 5) How can I detect that the other end of a TCP connection has crashed?
- Can I use "keepalives" for this?
-
- A) Detecting crashed systems over TCP/IP is difficult. TCP doesn't require
- any transmission over a connection if the application isn't sending
- anything, and many of the media over which TCP/IP is used (e.g. ethernet)
- don't provide a reliable way to determine whether a particular host is up.
- If a server doesn't hear from a client, it could be because it has nothing
- to say, some network between the server and client may be down, the server
- or client's network interface may be disconnected, or the client may have
- crashed. Network failures are often temporary (a thin ethernet will appear
- down while someone is adding a link to the daisy chain, and it often takes
- a few minutes for new routes to stabilize when a router goes down), and TCP
- connections shouldn't be dropped as a result.
-
- Keepalives are a feature of the sockets API that requests that an empty
- packet be sent periodically over an idle connection; this should evoke an
- acknowledgement from the remote system if it is still up, a reset if it has
- rebooted, and a timeout if it is down. These are not normally sent until
- the connection has been idle for a few hours. The purpose isn't to detect
- a crash immediately, but to keep unnecessary resources from being allocated
- forever.
-
- If more rapid detection of remote failures is required, this should be
- implemented in the application protocol. There is no standard mechanism
- for this, but an example is requiring clients to send a "no-op" message
- every minute or two. An example protocol that uses this is X Display
- Manager Control Protocol (XDMCP), part of the X Window System, Version 11;
- the XDM server managing a session periodically sends a Sync command to the
- display server, which should evoke an application-level response, and
- resets the session if it doesn't get a response (this is actually an
- example of a poor implementation, as a timeout can occur if another client
- "grabs" the server for too long).
-
- 6) Can the keepalive timeouts be configured?
-
- A) This varies by operating system. There is a program that works on
- many Unices (though not Linux or Solaris), called netconfig, that
- allows one to do this and documents many of the variables. It is
- available by anonymous FTP from
-
- cs.ucsd.edu:pub/csl/Netconfig/netconfig2.2.tar.Z
-
- In addition, Richard Stevens' TCP/IP Illustrated, Volume 1 includes a
- good discussion of setting the most useful variables on many
- platforms.
-
-
- 7) Can I set up a gateway to the Internet that translates IP addresses, so
- that I don't have to change all our internal addresses to an official
- network?
-
- A) There's no general solution to this. Many protocols include IP
- addresses in the application-level data (FTP's "PORT" command is the most
- notable), so it isn't simply a matter of translating addresses in the IP
- header. Also, if the network number(s) you're using match those assigned
- to another organization, your gateway won't be able to communicate with
- that organization (RFC 1597 proposes network numbers that are reserved for
- private use, to avoid such conflicts, but if you're already using a
- different network number this won't help you).
-
- However, if you're willing to live with limited access to the Internet from
- internal hosts, the "proxy" servers developed for firewalls can be used as
- a substitute for an address-translating gateway. See the firewall FAQ.
-
- 8) Are there object-oriented network programming tools?
-
- A) Yes, and one such system is called ACE (ADAPTIVE Communication
- Environment). Here is how to get more information and the software:
-
- OBTAINING ACE
-
- An HTML version of this README file is available at URL
- http://www.cs.wustl.edu/~schmidt/ACE.html. All software and
- documentation is available via both anonymous ftp and the Web.
-
- ACE is available for anonymous ftp from the ics.uci.edu (128.195.1.1)
- host in the gnu/C++_wrappers.tar.Z file (approximately .5 meg
- compressed). This release contains contains the source code,
- documentation, and example test drivers for C++ wrapper libras.
-
- 9) What other FAQs might you want to look in?
- comp.protocols.tcp-ip.ibmpc
- Aboba, Bernard D.(1994) "comp.protocols.tcp-ip.ibmpc Frequently
- Asked Questions (FAQ)" Usenet news.answers, available via
- file://ftp.netcom.com/pub/ma/mailcom/IBMTCP/ibmtcp.zip,
- 57 pages.
-
- comp.protocols.ppp
- Archive-name: ppp-faq/part[1-8]
- URL: http://cs.uni-bonn.de/ppp/part[1-8].html
-
- comp.dcom.lans.ethernet
- ftp site: dorm.rutgers.edu, pub/novell/DOCS
- Ethernet Network Questions and Answers
- Summarized from UseNet group comp.dcom.lans.ethernet
-
- 10) What other newsgroups deal with networking?
-
- comp.dcom.cabling Cabling selection, installation and use.
- comp.dcom.isdn The Integrated Services Digital Network
- (ISDN).
- comp.dcom.lans.ethernet Discussions of the Ethernet/IEEE 802.3
- protocols.
- comp.dcom.lans.fddi Discussions of the FDDI protocol suite.
- comp.dcom.lans.misc Local area network hardware and software.
- comp.dcom.lans.token-ring Installing and using token ring
- networks.
- comp.dcom.servers Selecting and operating data communications
- servers.
- comp.dcom.sys.cisco Info on Cisco routers and bridges.
- comp.dcom.sys.wellfleet Wellfleet bridge & router systems hardware &
- software.
- comp.protocols.ibm Networking with IBM mainframes.
- comp.protocols.iso The ISO protocol stack.
- comp.protocols.kerberos The Kerberos authentication server.
- comp.protocols.misc Various forms and types of protocol.
- comp.protocols.nfs Discussion about the Network File System
- protocol.
- comp.protocols.ppp Discussion of the Internet Point to Point
- Protocol.
- comp.protocols.smb SMB file sharing protocol and Samba SMB
- server/client.
- comp.protocols.tcp-ip TCP and IP network protocols.
- comp.protocols.tcp-ip.ibmpc TCP/IP for IBM(-like) personal
- computers.
- comp.security.misc Security isuipment for the PC.
- comp.os.ms-windows.networking.misc Windows and other networks.
- comp.os.ms-windows.networking.tcp-ip Windows and TCP/IP networking.
- comp.os.ms-windows.networking.windows Windows' built-in networking.
- comp.os.os2.networking.misc Miscellaneous networking issues of
- OS/2.
- comp.os.os2.networking.tcp-ip TCP/IP under OS/2.
- comp.sys.novell Discussion of Novell Netware products.
-
- 11) Van Jacobson explains TCP congestion avoidance.
-
- I've attached Van J's original posting on it (I seem to repost this every
- 6 months or so). If you want to see some real examples of this in action,
- take a look at Chapter 21 of my "TCP/IP Illustrated, Volume 1".
-
- Rich Stevens
-
- ---------------------------------------------------------------------------
- >From van@helios.ee.lbl.gov Mon Apr 30 01:44:05 1990
- To: end2end-interest@ISI.EDU
- Subject: modified TCP congestion avoidance algorithm
- Date: Mon, 30 Apr 90 01:40:59 PDT
- From: Van Jacobson <van@helios.ee.lbl.gov>
- Status: RO
-
- This is a description of the modified TCP congestion avoidance
- algorithm that I promised at the teleconference.
-
- BTW, on re-reading, I noticed there were several errors in
- Lixia's note besides the problem I noted at the teleconference.
- I don't know whether that's because I mis-communicated the
- algorithm at dinner (as I recall, I'd had some wine) or because
- she's convinced that TCP is ultimately irrelevant :). Either
- way, you will probably be disappointed if you experiment with
- what's in that note.
-
- First, I should point out once again that there are two
- completely independent window adjustment algorithms running in
- the sender: Slow-start is run when the pipe is empty (i.e.,
- when first starting or re-starting after a timeout). Its goal
- is to get the "ack clock" started so packets will be metered
- into the network at a reasonable rate. The other algorithm,
- congestion avoidance, is run any time *but* when (re-)starting
- and is responsible for estimating the (dynamically varying)
- pipesize. You will cause yourself, or me, no end of confusion
- if you lump these separate algorithms (as Lixia's message did).
-
- The modifications described here are only to the congestion
- avoidance algorithm, not to slow-start, and they are intended to
- apply to large bandwidth-delay product paths (though they don't
- do any harm on other paths). Remember that with regular TCP (or
- with slow-start/c-a TCP), throughput really starts to go to hell
- when the probability of packet loss is on the order of the
- bandwidth-delay product. E.g., you might expect a 1% packet
- loss rate to translate into a 1% lower throughput but for, say,
- a TCP connection with a 100 packet b-d p. (= window), it results
- in a 50-75% throughput loss. To make TCP effective on fat
- pipes, it would be nice if throughput degraded only as function
- of loss probability rather than as the product of the loss
- probabilty and the b-d p. (Assuming, of course, that we can do
- this without sacrificing congestion avoidance.)
-
- These mods do two things: (1) prevent the pipe from going empty
- after a loss (if the pipe doesn't go empty, you won't have to
- waste round-trip times re-filling it) and (2) correctly account
- for the amount of data actually in the pipe (since that's what
- congestion avoidance is supposed to be estimating and adapting to).
-
- For (1), remember that we use a packet loss as a signal that the
- pipe is overfull (congested) and that packet loss can be
- detected one of two different ways: (a) via a retransmit
- timeout or (b) when some small number (3-4) of consecutive
- duplicate acks has been received (the "fast retransmit"
- algorithm). In case (a), the pipe is guaranteed to be empty so
- we must slow-start. In case (b), if the duplicate ack
- threshhold is small compared to the bandwidth-delay product, we
- will detect the loss with the pipe almost full. I.e., given a
- threshhold of 3 packets and an LBL-MIT bandwidth-delay of around
- 24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75%
- full when fast-retransmit detects a loss (actually, until
- gateways start doing some sort of congestion control, the pipe
- is overfull when the loss is detected so *at least* 75% of the
- packets needed for ack clocking are in transit when
- fast-retransmit happens). Since the pipe is full, there's no
- need to slow-start after a fast-retransmit.
-
- For (2), consider what a duplicate ack means: either the
- network duplicated a packet (i.e., the NSFNet braindead IBM
- token ring adapters) or the receiver got an out-of-order packet.
- The usual cause of out-of-order packets at the receiver is a
- missing packet. I.e., if there are W packets in transit and one
- is dropped, the receiver will get W-1 out-of-order and
- (4.3-tahoe TCP will) generate W-1 duplicate acks. If the
- `consecutive duplicates' threshhold is set high enough, we can
- reasonably assume that duplicate acks mean dropped packets.
-
- But there's more information in the ack: The receiver can only
- generate one in response to a packet arrival. I.e., a duplicate
- ack means that a packet has left the network (it is now cached
- at the receiver). If the sender is limitted by the congestion
- window, a packet can now be sent. (The congestion window is a
- count of how many packets will fit in the pipe. The ack says a
- packet has left the pipe so a new one can be added to take its
- place.) To put this another way, say the current congestion
- window is C (i.e, C packets will fit in the pipe) and D
- duplicate acks have been received. Then only C-D packets are
- actually in the pipe and the sender wants to use a window of C+D
- packets to fill the pipe to its estimated capacity (C+D sent -
- D received = C in pipe).
-
- So, conceptually, the slow-start/cong.avoid/fast-rexmit changes
- are:
-
- - The sender's input routine is changed to set `cwnd' to `ssthresh'
- when the dup ack threshhold is reached. [It used to set cwnd to
- mss to force a slow-start.] Everything else stays the same.
-
- - The sender's output routine is changed to use an effective window
- of min(snd_wnd, cwnd + dupacks*mss) [the change is the addition
- of the `dupacks*mss' term.] `Dupacks' is zero until the rexmit
- threshhold is reached and zero except when receiving a sequence
- of duplicate acks.
-
- The actual implementation is slightly different than the above
- because I wanted to avoid the multiply in the output routine
- (multiplies are expensive on some risc machines). A diff of the
- old and new fastrexmit code is attached (your line numbers will
- vary).
-
- Note that we still do congestion avoidance (i.e., the window is
- reduced by 50% when we detect the packet loss). But, as long as
- the receiver's offered window is large enough (it needs to be at
- most twice the bandwidth-delay product), we continue sending
- packets (at exactly half the rate we were sending before the
- loss) even after the loss is detected so the pipe stays full at
- exactly the level we want and a slow-start isn't necessary.
-
- Some algebra might make this last clear: Say U is the sequence
- number of the first un-acked packet and we are using a window
- size of W when packet U is dropped. Packets [U..U+W) are in
- transit. When the loss is detected, we send packet U and pull
- the window back to W/2. But in the round-trip time it takes
- the U retransmit to fill the receiver's hole and an ack to get
- back, W-1 dup acks will arrive (one for each packet in transit).
- The window is effectively inflated by one packet for each of
- these acks so packets [U..U+W/2+W-1) are sent. But we don't
- re-send packets unless we know they've been lost so the amount
- actually sent between the loss detection and the recovery ack is
- U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion
- avoidance allows us to send (if we add in the rexmit of U). The
- recovery ack is for packet U+W so when the effective window is
- pulled back from W/2+W-1 to W/2 (which happens because the
- recovery ack is `new' and sets dupack to zero), we are allowed
- to send up to packet U+W+W/2 which is exactly the first packet
- we haven't yet sent. (I.e., there is no sudden burst of packets
- as the `hole' is filled.) Also, when sending packets between
- the loss detection and the recovery ack, we do nothing for the
- first W/2 dup acks (because they only allow us to send packets
- we've already sent) and the bottleneck gateway is given W/2
- packet times to clean out its backlog. Thus when we start
- sending our W/2-1 new packets, the bottleneck queue is as empty
- as it can be.
-
- [I don't know if you can get the flavor of what happens from
- this description -- it's hard to see without a picture. But I
- was delighted by how beautifully it worked -- it was like
- watching the innards of an engine when all the separate motions
- of crank, pistons and valves suddenly fit together and
- everything appears in exactly the right place at just the right
- time.]
-
- Also note that this algorithm interoperates with old tcp's: Most
- pre-tahoe tcp's don't generate the dup acks on out-of-order packets.
- If we don't get the dup acks, fast retransmit never fires and the
- window is never inflated so everything happens in the old way (via
- timeouts). Everything works just as it did without the new algorithm
- (and just as slow).
-
- If you want to simulate this, the intended environment is:
-
- - large bandwidth-delay product (say 20 or more packets)
-
- - receiver advertising window of two b-d p (or, equivalently,
- advertised window of the unloaded b-d p but two or more
- connections simultaneously sharing the path).
-
- - average loss rate (from congestion or other source) less than
- one lost packet per round-trip-time per active connection.
- (The algorithm works at higher loss rate but the TCP selective
- ack option has to be implemented otherwise the pipe will go empty
- waiting to fill the second hole and throughput will once again
- degrade at the product of the loss rate and b-d p. With selective
- ack, throughput is insensitive to b-d p at any loss rate.)
-
- And, of course, we should always remember that good engineering
- practise suggests a b-d p worth of buffer at each bottleneck --
- less buffer and your simulation will exhibit the interesting
- pathologies of a poorly engineered network but will probably
- tell you little about the workings of the algorithm (unless the
- algorithm misbehaves badly under these conditions but my
- simulations and measurements say that it doesn't). In these
- days of $100/megabyte memory, I dearly hope that this particular
- example of bad engineering is of historical interest only.
-
- - Van
-
- -----------------
- *** /tmp/,RCSt1a26717 Mon Apr 30 01:35:17 1990
- --- tcp_input.c Mon Apr 30 01:33:30 1990
- ***************
- *** 834,850 ****
- * Kludge snd_nxt & the congestion
- * window so we send only this one
- ! * packet. If this packet fills the
- ! * only hole in the receiver's seq.
- ! * space, the next real ack will fully
- ! * open our window. This means we
- ! * have to do the usual slow-start to
- ! * not overwhelm an intermediate gateway
- ! * with a burst of packets. Leave
- ! * here with the congestion window set
- ! * to allow 2 packets on the next real
- ! * ack and the exp-to-linear thresh
- ! * set for half the current window
- ! * size (since we know we're losing at
- ! * the current window size).
- */
- if (tp->t_timer[TCPT_REXMT] == 0 ||
- --- 834,850 ----
- * Kludge snd_nxt & the congestion
- * window so we send only this one
- ! * packet.
- ! *
- ! * We know we're losing at the current
- ! * window size so do congestion avoidance
- ! * (set ssthresh to half the current window
- ! * and pull our congestion window back to
- ! * the new ssthresh).
- ! *
- ! * Dup acks mean that packets have left the
- ! * network (they're now cached at the receiver)
- ! * so bump cwnd by the amount in the receiver
- ! * to keep a constant cwnd packets in the
- ! * network.
- */
- if (tp->t_timer[TCPT_REXMT] == 0 ||
- ***************
- *** 853,864 ****
- else if (++tp->t_dupacks == tcprexmtthresh) {
- tcp_seq onxt = tp->snd_nxt;
- ! u_int win =
- ! MIN(tp->snd_wnd, tp->snd_cwnd) / 2 /
- ! tp->t_maxseg;
-
- if (win < 2)
- win = 2;
- tp->snd_ssthresh = win * tp->t_maxseg;
- -
- tp->t_timer[TCPT_REXMT] = 0;
- tp->t_rtt = 0;
- --- 853,864 ----
- else if (++tp->t_dupacks == tcprexmtthresh) {
- tcp_seq onxt = tp->snd_nxt;
- ! u_int win = MIN(tp->snd_wnd,
- ! tp->snd_cwnd);
-
- + win /= tp->t_maxseg;
- + win >>= 1;
- if (win < 2)
- win = 2;
- tp->snd_ssthresh = win * tp->t_maxseg;
- tp->t_timer[TCPT_REXMT] = 0;
- tp->t_rtt = 0;
- ***************
- *** 866,873 ****
- tp->snd_cwnd = tp->t_maxseg;
- (void) tcp_output(tp);
- !
- if (SEQ_GT(onxt, tp->snd_nxt))
- tp->snd_nxt = onxt;
- goto drop;
- }
- } else
- --- 866,879 ----
- tp->snd_cwnd = tp->t_maxseg;
- (void) tcp_output(tp);
- ! tp->snd_cwnd = tp->snd_ssthresh +
- ! tp->t_maxseg *
- ! tp->t_dupacks;
- if (SEQ_GT(onxt, tp->snd_nxt))
- tp->snd_nxt = onxt;
- goto drop;
- + } else if (tp->t_dupacks > tcprexmtthresh) {
- + tp->snd_cwnd += tp->t_maxseg;
- + (void) tcp_output(tp);
- + goto drop;
- }
- } else
- ***************
- *** 874,877 ****
- --- 880,890 ----
- tp->t_dupacks = 0;
- break;
- + }
- + if (tp->t_dupacks) {
- + /*
- + * the congestion window was inflated to account for
- + * the other side's cached packets - retract it.
- + */
- + tp->snd_cwnd = tp->snd_ssthresh;
- }
- tp->t_dupacks = 0;
- *** /tmp/,RCSt1a26725 Mon Apr 30 01:35:23 1990
- --- tcp_timer.c Mon Apr 30 00:36:29 1990
- ***************
- *** 223,226 ****
- --- 223,227 ----
- tp->snd_cwnd = tp->t_maxseg;
- tp->snd_ssthresh = win * tp->t_maxseg;
- + tp->t_dupacks = 0;
- }
- (void) tcp_output(tp);
-
- >From van@helios.ee.lbl.gov Mon Apr 30 10:37:36 1990
- To: end2end-interest@ISI.EDU
- Subject: modified TCP congestion avoidance algorithm (correction)
- Date: Mon, 30 Apr 90 10:36:12 PDT
- From: Van Jacobson <van@helios.ee.lbl.gov>
- Status: RO
-
- I shouldn't make last minute 'fixes'. The code I sent out last
- night had a small error:
-
- *** t.c Mon Apr 30 10:28:52 1990
- --- tcp_input.c Mon Apr 30 10:30:41 1990
- ***************
- *** 885,893 ****
- * the congestion window was inflated to account for
- * the other side's cached packets - retract it.
- */
- ! tp->snd_cwnd = tp->snd_ssthresh;
- }
- - tp->t_dupacks = 0;
- if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
- tcpstat.tcps_rcvacktoomuch++;
- goto dropafterack;
- --- 885,894 ----
- * the congestion window was inflated to account for
- * the other side's cached packets - retract it.
- */
- ! if (tp->snd_cwnd > tp->snd_ssthresh)
- ! tp->snd_cwnd = tp->snd_ssthresh;
- ! tp->t_dupacks = 0;
- }
- if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
- tcpstat.tcps_rcvacktoomuch++;
- goto dropafterack;
-
- 12) Can I use a single bit subnet?
-
- A) It would seem that the consensus is no. The best citable answer
- follows.
-
- >From RFC1122:
- "3.3.6 Broadcasts
- Section 3.2.1.3 defined the four standard IP broadcast address
- forms:
- Limited Broadcast: {-1, -1}
- Directed Broadcast: {<Network-number>,-1}
- Subnet Directed Broadcast:
- {<Network-number>,<Subnet-number>,-1}
- All-Subnets Directed Broadcast: {<Network-number>,-1,-1}"
-
- All-Subnets Directed broadcasts are being deprecated in favor of IP
- multicast, but were very much defined at the time RFC1122 was written.
- Thus a Subnet Directed Broadcast to a subnet of all ones is not
- distinguishable from an All-Subnets Directed Broadcast.
-
- For those old systems that used all zeros for broadcast in IP addresses,
- a similar argument can be made against the subnet of all zeros.
-
- Also, for old routing protocols like RIP, a route to subnet zero
- is not distinguishable from the route to the entire network number
- (except possibly by context).
-
- Most of today's systems don't support variable length subnet masks
- (VLSM), and for such systems the above is true. However, all the major
- router vendors and *some* Unix systems (BSD 4.4 based ones) support
- VLSMs, and in that case the situation is more complicated :-)
-
- With VLSMs (necessary to support CIDR, see RFC 1519), you can utilize the
- address space more efficiently. Routing lookups are based on *longest*
- match, and this means that you can for instance subnet the class C net
- with a mask of 255.255.255.224 (27 bits) in addition to the subnet mask
- of 255.255.255.192 (26 bits) given above. You will then be able to use
- the addresses x.x.x.33 through x.x.x.62 (first three bits 001) and the
- addresses x.x.x.193 through x.x.x.222 (first three bits 110) with this
- new subnet mask. And you can continue with a subnet mask of 28 bits, etc.
- (Note also, by the way, that non-contiguous subnet masks are deprecated.)
-
- This is all very nicely covered in the paper by Havard Eidnes:
-
- Practical Considerations for Network Address using a CIDR Block Allocation
- Proceedings of INET '93
-
- This paper is available with anonymous FTP from
-
- aun.uninett.no:/pub/misc/eidnes-cidr.ps
-
- The same paper, with minor revisions, is one of the articles in the
- special Internetworking issue of Communications of the ACM (last month,
- I believe).
-
- > I have be told that some network equipment (Cisco I think was the vendor
- > named) will not correctly handle subnets that violated that standard.
- As far as I know cisco is one of the router vendors that *do* handle
- VLSMs correctly. Could you substantiate this claim?
-
- Steinar Haug, SINTEF RUNIT, University of Trondheim, NORWAY
- Email: Steinar.Haug@runit.sintef.no
-
- --
- George V. Neville-Neil work: gnn@wrs.com home:gnn@netcom.com
- NIC: GN82
-
- This signature kept blank due to the CDA.
-
-