Henrik
Frystyk, July 1994
The
Internet Protocol Stack
As mentioned in the Internet Section the
Internet is an abstraction from the underlying network technologies
and physical address resolution. This section introduces the basic
components of the Internet protocol stack and relates the stack to the
ISO OSI reference protocol stack model. The model of the Internet
protocol stack is illustrated in the figure below.

This documents describes the various parts presented in this diagram.
The upper layer protocols, e.g., FTP, Telnet, TFTP etc. are described
in the Presentation Layer Protocol
section. This leaves the following topics as sections in this
document:
- Internet Protocol (IP)
- User Datagram Protocol (UDP)
- Transmission Control Protocol (TCP)
- Transactional Transmission Control Protocol (T/TCP)
- TCP/IP and OSI/RM
Internet Protocol (IP)
As seen in the figure above, the Internet protocol stack provides a
connection oriented reliable branch (TCP) and an connectionless
unreliable branch (UDP) both build on top of the Internet Protocol.
The Internet Protocol
layer in the TCP/IP protocol stack is the first layer that introduces
the virtual network abstraction that is the basic principle of the
Internet model. All physical implementation details (ideally even
though this is not quite true) are hidden below the IP layer. The IP
layer provides an unreliable, connectionless delivery system. The
reason why it is unreliable stem from the fact the protocol
does not provide any functionality for error recovering for datagrams
that are either duplicated, lost or arrive to the remote host in
another order than they are send. If no such errors occur in the
physical layer, the IP protocol guarantees that the transmission is
terminated successfully.
The basic unit of data exchange in the IP layer is the Internet
Datagram. The format of an IP datagram and a short description of the
most important fields are included below:
- LEN
- The number of 32 bit-segments in the IP header. Without any
OPTIONS, this value is 5
- TYPE OF SERVICE
- Each IP datagram can be given a precedence value ranging from 0-7
showing the importance of the datagram. This is to allow
out-of-band data to be routed faster than normal data. This
is very important as Internet Control Message
Protocol (ICMP) messages travels as the data part of an IP
datagram. Even though an ICMP message is encapsulated in a IP
datagram, the ICMP protocol is normally thought of as a integral part
of the IP layer and not the UDP or TCP layer. Furthermore, the TYPE OF
SERVICE field allows a classification of the datagram in order to
specify is the service desired requires short delay time, high
reliability or high throughput. However, in order for this to have any
effect, the gateways must know more than one route to the remote host
and as described in the Introduction,
this is not the case.
- IDENT, FLAGS, and FRAGMENT OFFSET
- These fields are used to describe fragmentation of a datagram.
The actual length of an IP datagram is in principle independent of the
length of the physical frames being transferred on the network,
referred to as the network's Maximum Transfer Unit (MTU). If
a datagram is longer than the MTU then it is divided in to a set of
fragments having almost the same header as the original datagram but
only the amount of data that fits into a physical frame. The IDENT
flag is used to identify segments belonging to the same datagram, and
the FRAGMENT OFFSET is the relative position of the fragment within
the original datagram. Once a datagram is fragmented it stays like
that until it receives the final destination. If one or more segments
are lost or erroneous the whole datagram is discarded.
However, the underlying network technology is not completely hidden
below the IP layer in spite of the fragmentation functionality. The
reason is that the MTU can vary from 128 or less to several thousands
of bytes dependent of the physical network (Ethernet has a MTU of 1500
bytes). It is hence question of efficiency when choosing the right
datagram size so that fragmentation is minimized. It is recommended
that gateways are capable of handling datagrams of at least 576 bytes
without having to use fragmentation.
- TIME
- This is the remaining Time To Live (TTL) for a datagram
when it travels on the Internet. The Routing Information Protocol
(RIP) specifies that at most 15 hops are allowed.
- SOURCE IP-ADDRESS and DESTINATION IP-ADDRESS
- Both the source and destination address is indicated in the
datagram header so that the recipient can send an answer back to the
transmitting host. However, note that only the host address is
specified - not the port number. This is because the IP protocol is an
IMP-to-IMP protocol - it is not an end-to-end protocol. A
layer more is needed to actually specify which two processes on the
transmitting host and the final destination that should receive the
datagrams.
Note that the IP-datagram only leaves space for the original source
IP-address and the original destination IP-addrss. As mentioned in the
section Gateways and Routing the
next hop address is specified by encapsulation. The
Internet Layer passes the IP-addres of the next hop
address to the Network Layer. This IP-address is bound to a
physical address and a new frame is formed with this address. The rest
of the original frame is then encapsulated in the new frame before it is
send over the communication channel.
User Datagram Protocol (UDP)
The User Datagram Protocol
(UDP) is a very thin protocol build on top of the Internet Protocol. The basic unit of data is a User
datagram and the UDP protocol provides the same unreliable,
connectionless service transferring user datagrams as the IP protocol
does transferring its datagrams. The main difference is that the UDP
protocol is an end-to-end protocol. That is, it contains
enough information to transfer a user datagram from one process on the
transmitting host to another process on the receiving host. The format
of a user datagram is illustrated below:

The LENGTH field is the length of the user datagram including the
header, that is the minimum value of LENGTH is 8 bytes. The SOURCE
PORT and DESTINATION PORT are the connection between a IP-address and
a process running on a host. A network port is normally identified by
an integer. However, the user datagram does not contain any IP-address
so how does the UDP protocol know when the final destination is
reached?
When calculating the CHECKSUM header, the UDP protocol appends a
12-byte pseudo header consisting of the SOURCE IP-ADDRESS, the
DESTINATION IP-ADDRESS and some additional fields. When a host
receives a UDP datagram it takes the UDP header and creates a new
pseudo header using its own IP-address as the DESTINATION IP-ADDRESS
and the SOURCE IP-ADDRESS extracted from the IP datagram. Then it
calculates a checksum and if it equals the UDP checksum, then the
datagram has received the final destination.
As indicated in the Internet Protocol Stack
Figure the UDP protocol is often used as the basic protocol in
client-server application protocols such as TFTP, DNS etc. where the
overhead of making a reliable, connection oriented transmission is
considerable. This problem will be considered further in the next two
sections.
Transmission Control Protocol (TCP)
The Transmission Control
Protocol provides a full duplex, reliable, connection oriented
service to the application layer as indicated in the Internet Protocol Stack Figure. This section
described the basic principle of the TCP protocol and how it provides
a reliable service to the application layer protocols.
The TCP protocol is a stream oriented protocol. It is designed to
provide the application layer software with a service to transfer
large amount of data in a reliable way. It establishes a full duplex
virtual circuit between the two transmitting hosts so that both host
simultaneously can put data out on the Internet without specifying the
destination host once the connection is established. In the Transactional Transmission Control Protocol (T/TCP)
section an client-server based extension to the TCP protocol is
presented as an alternative to the stream architecture.
TCP Segment Format
A segment is the basic data unit in the TCP protocol. As much of the
following sections are based on this data unit, the format is
presented here:

- SOURCE PORT, DESTINATION PORT
- The TCP protocol uses the same trick of using a pseudo header
instead of transmitting the source IP-address and the destination
IP-address as is already included in the IP-datagram. Therefore only
the port numbers are required to uniquely define the communicating
hosts.
- CODE
- This field is used to indicate the content of the segment and if
a specific action has to be taken such as if the sender has reached
EOF in the stream.
- OPTIONS
- The TCP protocol uses the OPTIONS field to exchange information
like maximum segment size accepted between the TCP layers on
the two hosts. The flags currently defined are
- URG Urgent pointer field is valid
- ACK Acknowledgement field is valid
- PSH This segment requests a push
- RST Reset the connection
- SYN Synchronize sequence numbers
- FIN Sender has reached end of its byte stream
- OFFSET
- This integer indicates the offset of the user data within the
segment. This field is only required as the number of bits used in the
OPTIONS field can vary
- URGENT POINTER
- This field can be initialized to point to a place in the user
data where urgent information such as escape codes etc. are placed.
Then the receiving host can process this part immediately when it
receives the segment.
Reliable Transmission
At the IP-protocol layer packets can get discarded due to network
congestion, noise gateway failure etc. In order to provide a reliable
service, the TCP must recover from data that is damaged, lost,
duplicated, or delivered out of order by the Internet communication
system. This is achieved by assigning a SEQUENCE NUMBER to each byte
transmitted, and requiring a positive acknowledgment (ACK)
from the receiving host. If the ACK is not received within a timeout
interval, the data is retransmitted. At the receiver, the sequence
numbers are used to correctly order segments that may be received out
of order and to eliminate duplicates. Damage is handled by adding a
checksum to each segment transmitted, checking it at the receiver, and
discarding damaged segments. The principle is illustrated in the
figure below:
Host A is transmitting a packet of data to Host B, but
the packet gets lost before it reaches its destination. However,
Host A has set up a timer when to expect the ACK from Host
B so when this timer runs out, the packet is retransmitted. The
difficult part of the method is to find a value of the time out period
as a TCP segment can travel through different speed networks with
different loads. This means that the Round trip Time (RTT)
can vary from segment to segment. A simple way of calculating the RTT
is by using a recursive mean value with an exponential window to
decrease the importance of old values.
As mentioned in the introduction to the TCP
section, the protocol is a stream oriented protocol. It uses
unstructured streams with no method of indexing the user data, e.g. as
records etc. Furthermore, the length of a TCP segment can vary as is
the case for the IP-datagram and the UDP user datagram. Therefore the
acknowledgement can not be based on the segment number but must be
based on bytes successfully transferred.
However, the PAR principle is very inefficient as the sending host
must await the acknowledgement before it can send the next segment.
This means that the the minimum time between two segments is 1 RTT
plus the time required to serve the segments at both ends. The TCP
protocol solves this by using sliding windows at both ends.
This method permits the transmitting host to send as many bytes as can
be stored in the sending window and then wait for acknowledgements as
the remote host receives the segments and sends data in the other
direction. The acknowledgement send back is cumulative so that it at
all times shows the next byte that the receiving host expects
to see. An example with a large window size and selective
retransmission is shown in the figure:

Byte number 1 is lost so Host B never sends back a positive
acknowledgement. When Host A times out on byte 1 it retransmit
it. However, as the rest of the bytes from 2-5 are transmitted
successfully the next acknowledgement can immediately jump to 6 which
is the next expected byte. Byte 2 is also retransmitted as Host
A does not know exactly how many bytes are erroneous. Host
B just discards byte 2 as it already is downloaded.
The window technique can also be used to provide a congestion control
mechanism. As indicated in the TCP Segment
Format Figure every segment has a WINDOW field that specifies how
much data a host is willing to receive. If the host is heavyly loaded,
it can decrease the WINDOW parameter and hence the transmission speed
drops.
However, as the TCP protocol is an end-to-end protocol it can not see
if a congestion problem has occurred in an intermediate Interface
Message Processor (IMP) (often called a packet switched
node) and hence, it has no means to control it by adjusting the
window size. TCP solves this problem by using the Internet Control Message
Protocol (ICMP) source quench messages.
Connection Establishment
When a TCP connection is to be opened a 3-way handshake (3WHS) is used
in order to establish the virtual circuit that exists until the
connection is closed at the end of the data transfer. The 3WHS is
described in the following as it is an important part of the TCP
protocol but also shows some inefficiencies in the protocol. The
principle of a 3WHS is illustrated in the figure below:
The blocks in the middle symbolizes the relevant part of the TCP
segment, that is the SEQUENCE NUMBER, the ACKNOWLEDGEMENT NUMBER and
the CODE. The active Host A sends a segment indicating that it
starts its SEQUENCE NUMBER from x. Host B replies with an ACK
and indicates that it starts with SEQUENCE NUMBER y. On the third
segment both hosts agree on the sequence numbers and that they are
ready to transmit data.
In the figure only Host A does an active open. Actually the two
hosts can do a simultaneously open in which case both hosts perform a
SYN-RECEIVED and then synchronize accordingly. The principle reason
for the 3WHS is to prevent old duplicate connection initiations from
causing confusion.
Note that the SEQUENCE NUMBER of segment 3 and 4 is the same because
the ACK does not occupy sequence number space (if it did, the protocol
would wind up ACKing ACK's!).
However, the TCP connection establishment is somewhat long cumbersome
in many applications, especially in the client-server applications
such as the World-Wide Web. In the next section an alternative having
a lighter connection establishment is presented.
Transactional Transmission Control Protocol (T/TCP)
The TCP protocol is a highly symmetric protocol in that both hosts can
transmit and receive data simultaneously. However, not all
applications are symmetrical by nature. A typical example is a
client-server protocol such as the Domain
Name Service. The Transactional Transmission
Control Protocol (T/TCP) that is a very new protocol (July 1994)
offers an alternative to TCP when high performance is required in
client-server applications. Some of the requirements of an high
performance transaction oriented protocol are listed below:
- The interaction between the client and the server is based on a
request followed by a response, that is a stateless approach.
- The protocol must guarantee that a transaction is carried out at
most one time and any duplicate packets received by the server should
be discarded.
- No explicit open or close procedure of the connection. This is
opposite to TCP and the 3WHS as described above.
- The minimum transaction latency for a client should be Round
Trip Time (RTT) + Server Processing Time (SPT). That is
basically the same requirement as no explicit open or close procedure.
- The protocol should be able to handle a reliable minimum
transaction of exactly 1 segment in both directions.
This section describes how the TTCP protocol deals with these
requirements and also that might affect the World-Wide
Web model with respect to performance.
Implicit Connection Establishment
The T/TCP protocol is as indicated by the name based on the TCP
protocol and T/TCP is backwards compatible with TCP. However, one of
the features of the T/TCP protocol is that it can bypass the 3WHS
described in the previous section but in case of
failure can resolve to the 3WHS procedure.
The 3HWS has been introduced in order to prevent old duplicate
connection initiations from causing confusion. However, T/TCP provides
an alternative to this by introducing three new parameters in the
OPTION field in the TCP Segment:
- CONNECTION COUNT (CC)
- This is a 32-bit incarnation number where a distinct value is
assigned to all segments send from Host A to Host B and
another distinct number the other way. The kernel on both hosts keeps
a cache of all the CC numbers currently used by connections to remote
hosts. On every new connection the client CC number is monotonically
incremented by 1 so that a segment belonging to a new connection can
be separated from old duplicates from previous connections.
- CONNECTION COUNT NEW (CC.NEW)
- In some situations, the principle of a monotonically increasing
value of CC can be violated, either due to a host crash or that the
maximum number, that is 4G, is reached and the counter returns to 0.
This is possible in practice because the same CC number is global to
all connections. In this situation a CC.NEW is send and the the remote
host resets its cache and returns to a normal 3WHS TCP connection
establishment. This signal will always be send from the
client and to the server.
- CONNECTION COUNT ECHO (CC.ECHO)
- In the server response the CC.ECHO field contains the CC value
send by the client so that the client can validate the response as
belonging to a specific transaction.
The bypass of the 3WHS is illustrated in the following figure:

In the example, two segments are send in both directions. The connection is
established when the first segment reaches the server. The client is left in a
TIME-WAIT state which is explained in the next section.
Connection Shotdown
Every TCP or UDP connection between two hosts are uniquely identified
by the following 5-tuple:
- Protocol (UDP, TCP)
- IP-address of Host A
- Port number of Host A
- IP-address of Host B
- Port number of Host B
Whenever a TCP connection has been closed, the association described
by the 5-tuple enters a wait state to assure that both hosts have
received the final acknowledgement from the closing procedure. The
time of the wait state is called TIME-WAIT and is by default 2*MSL
(120 seconds) where MSL is the Maximum Segment LifeTime. That is,
two hosts can not perform an new transaction using the same 5-tuple
until at least 120 seconds after the previous connection has been
terminated. One way to circumvent this problem is to select another
5-tuple but as mentioned in Extending TCP for
Transactions -- Concepts this does not scale due to the excessive
amount of kernel space occupied by terminated TCP connections hanging
around.
However, the T/RCP CC numbers gives a unique identification of each
transaction so the T/TCP protocol is capable of truncating the
WAIT-STATE by comparing the CC numbers. This principle can be looked
at as expanding the state machine of one transaction to also include
information on previous and future transactions using the same 5-tuple.
TTCP and the World-Wide Web
As will be shown in the description of World-Wide Web of this thesis, the principle
of the World-Wide Web is a transaction oriented exchange of data
object. This is the reason why the T/TCP protocol is very interesting
in this prospective.
TCP/IP and OSI/RM
International Standards Organization (ISO) has designed the second
dominating protocol layering scheme, called ISO Open System
Interconnection Reference Model (OSI/RM). This section presents
the OSI reference model and compares it to the TCP/IP protocol stack
as illustrated in the figure.

- Physical Layer
- Specifies the physical connection between the host computers and
IMPs and how bits are transferred over a communication channel.
- Data Link Layer
- This layer specifies how data travels between IMPs using
frames. Its maintask is to change the service from the
physical layer into a packet oriented error-free transmission.
- Network Layer
- The frames from the Data Link Layer are organized into
packets and directed through the network. The communication
is still between IMPs.
- Transport Layer
- The first layer that provides an end-to-end transport service. It
ensures that the transferred data arrive correctly at the other end.
- Session Layer
- This layer specifies how two hosts can establish sessions where
data can be transferred in both directions on a virtual connection
between the two hosts.
- Presentation Layer
- The Presentation layer introduces a set of syntax and semantics of the
information transmitted through the lower protocol layers.
- Application Layer
- This layer defines a platform independent virtual network
terminal so that application programs can exchange data regardless of
the internal data representation used.
Even though OSI/RM and TCP/IP can be compared like this, there still
exist several significant differences between OSI/RM and the TCP/IP
protocol stack but on the the most fundamental is that OSI/RM is a
standardized model for how the functionality of a protocol
stack can be organized. It doesn't specify the exact services and
protocols to be used in each layer whereas the TCP/IP is an result of
experimental research. In spite of this, the OSI/RM model has been the
basis of several protocol stack implementation such as X.25, discussed
in A Critique of X.25
Another difference is where the intelligence is placed in the
layering. OSR/RM introduces a reliable service on the Data Link Layer
whereas the TCP/IP only has intelligence in the Transport Layer. Both
solutions have advantages and disadvantages. When a reliable data
transfer service is placed in the lower layers the clients using the
network for communication can be kept very simple as they do not have
to handle complicated error situations. The disadvantage is that
performance decreases due to an excessive amount of control information
transferred and processed in every host.
Henrik
Frystyk, frystyk@info.cern.ch, July 1994