Henrik Frystyk, July 1994

The Internet Protocol Stack

As mentioned in the Internet Section the Internet is an abstraction from the underlying network technologies and physical address resolution. This section introduces the basic components of the Internet protocol stack and relates the stack to the ISO OSI reference protocol stack model. The model of the Internet protocol stack is illustrated in the figure below.

This documents describes the various parts presented in this diagram. The upper layer protocols, e.g., FTP, Telnet, TFTP etc. are described in the Presentation Layer Protocol section. This leaves the following topics as sections in this document:

Internet Protocol (IP)
User Datagram Protocol (UDP)
Transmission Control Protocol (TCP)
Transactional Transmission Control Protocol (T/TCP)
TCP/IP and OSI/RM

Internet Protocol (IP)

As seen in the figure above, the Internet protocol stack provides a connection oriented reliable branch (TCP) and an connectionless unreliable branch (UDP) both build on top of the Internet Protocol.

The Internet Protocol layer in the TCP/IP protocol stack is the first layer that introduces the virtual network abstraction that is the basic principle of the Internet model. All physical implementation details (ideally even though this is not quite true) are hidden below the IP layer. The IP layer provides an unreliable, connectionless delivery system. The reason why it is unreliable stem from the fact the protocol does not provide any functionality for error recovering for datagrams that are either duplicated, lost or arrive to the remote host in another order than they are send. If no such errors occur in the physical layer, the IP protocol guarantees that the transmission is terminated successfully.

The basic unit of data exchange in the IP layer is the Internet Datagram. The format of an IP datagram and a short description of the most important fields are included below:

LEN: The number of 32 bit-segments in the IP header. Without any OPTIONS, this value is 5
TYPE OF SERVICE: Each IP datagram can be given a precedence value ranging from 0-7 showing the importance of the datagram. This is to allow out-of-band data to be routed faster than normal data. This is very important as Internet Control Message Protocol (ICMP) messages travels as the data part of an IP datagram. Even though an ICMP message is encapsulated in a IP datagram, the ICMP protocol is normally thought of as a integral part of the IP layer and not the UDP or TCP layer. Furthermore, the TYPE OF SERVICE field allows a classification of the datagram in order to specify is the service desired requires short delay time, high reliability or high throughput. However, in order for this to have any effect, the gateways must know more than one route to the remote host and as described in the Introduction, this is not the case.
IDENT, FLAGS, and FRAGMENT OFFSET: These fields are used to describe fragmentation of a datagram. The actual length of an IP datagram is in principle independent of the length of the physical frames being transferred on the network, referred to as the network's Maximum Transfer Unit (MTU). If a datagram is longer than the MTU then it is divided in to a set of fragments having almost the same header as the original datagram but only the amount of data that fits into a physical frame. The IDENT flag is used to identify segments belonging to the same datagram, and the FRAGMENT OFFSET is the relative position of the fragment within the original datagram. Once a datagram is fragmented it stays like that until it receives the final destination. If one or more segments are lost or erroneous the whole datagram is discarded.
However, the underlying network technology is not completely hidden below the IP layer in spite of the fragmentation functionality. The reason is that the MTU can vary from 128 or less to several thousands of bytes dependent of the physical network (Ethernet has a MTU of 1500 bytes). It is hence question of efficiency when choosing the right datagram size so that fragmentation is minimized. It is recommended that gateways are capable of handling datagrams of at least 576 bytes without having to use fragmentation.
TIME: This is the remaining Time To Live (TTL) for a datagram when it travels on the Internet. The Routing Information Protocol (RIP) specifies that at most 15 hops are allowed.
SOURCE IP-ADDRESS and DESTINATION IP-ADDRESS: Both the source and destination address is indicated in the datagram header so that the recipient can send an answer back to the transmitting host. However, note that only the host address is specified - not the port number. This is because the IP protocol is an IMP-to-IMP protocol - it is not an end-to-end protocol. A layer more is needed to actually specify which two processes on the transmitting host and the final destination that should receive the datagrams.

Note that the IP-datagram only leaves space for the original source IP-address and the original destination IP-addrss. As mentioned in the section Gateways and Routing the next hop address is specified by encapsulation. The Internet Layer passes the IP-addres of the next hop address to the Network Layer. This IP-address is bound to a physical address and a new frame is formed with this address. The rest of the original frame is then encapsulated in the new frame before it is send over the communication channel.

User Datagram Protocol (UDP)

The User Datagram Protocol (UDP) is a very thin protocol build on top of the Internet Protocol. The basic unit of data is a User datagram and the UDP protocol provides the same unreliable, connectionless service transferring user datagrams as the IP protocol does transferring its datagrams. The main difference is that the UDP protocol is an end-to-end protocol. That is, it contains enough information to transfer a user datagram from one process on the transmitting host to another process on the receiving host. The format of a user datagram is illustrated below:

The LENGTH field is the length of the user datagram including the header, that is the minimum value of LENGTH is 8 bytes. The SOURCE PORT and DESTINATION PORT are the connection between a IP-address and a process running on a host. A network port is normally identified by an integer. However, the user datagram does not contain any IP-address so how does the UDP protocol know when the final destination is reached?

When calculating the CHECKSUM header, the UDP protocol appends a 12-byte pseudo header consisting of the SOURCE IP-ADDRESS, the DESTINATION IP-ADDRESS and some additional fields. When a host receives a UDP datagram it takes the UDP header and creates a new pseudo header using its own IP-address as the DESTINATION IP-ADDRESS and the SOURCE IP-ADDRESS extracted from the IP datagram. Then it calculates a checksum and if it equals the UDP checksum, then the datagram has received the final destination.

As indicated in the Internet Protocol Stack Figure the UDP protocol is often used as the basic protocol in client-server application protocols such as TFTP, DNS etc. where the overhead of making a reliable, connection oriented transmission is considerable. This problem will be considered further in the next two sections.

Transmission Control Protocol (TCP)

The Transmission Control Protocol provides a full duplex, reliable, connection oriented service to the application layer as indicated in the Internet Protocol Stack Figure. This section described the basic principle of the TCP protocol and how it provides a reliable service to the application layer protocols.

The TCP protocol is a stream oriented protocol. It is designed to provide the application layer software with a service to transfer large amount of data in a reliable way. It establishes a full duplex virtual circuit between the two transmitting hosts so that both host simultaneously can put data out on the Internet without specifying the destination host once the connection is established. In the Transactional Transmission Control Protocol (T/TCP) section an client-server based extension to the TCP protocol is presented as an alternative to the stream architecture.

TCP Segment Format

A segment is the basic data unit in the TCP protocol. As much of the following sections are based on this data unit, the format is presented here:

SOURCE PORT, DESTINATION PORT

The TCP protocol uses the same trick of using a pseudo header instead of transmitting the source IP-address and the destination IP-address as is already included in the IP-datagram. Therefore only the port numbers are required to uniquely define the communicating hosts.

CODE

This field is used to indicate the content of the segment and if a specific action has to be taken such as if the sender has reached EOF in the stream.

OPTIONS

The TCP protocol uses the OPTIONS field to exchange information like maximum segment size accepted between the TCP layers on the two hosts. The flags currently defined are

URG Urgent pointer field is valid
ACK Acknowledgement field is valid
PSH This segment requests a push
RST Reset the connection
SYN Synchronize sequence numbers
FIN Sender has reached end of its byte stream

OFFSET

This integer indicates the offset of the user data within the segment. This field is only required as the number of bits used in the OPTIONS field can vary

URGENT POINTER

This field can be initialized to point to a place in the user data where urgent information such as escape codes etc. are placed. Then the receiving host can process this part immediately when it receives the segment.

Reliable Transmission

At the IP-protocol layer packets can get discarded due to network congestion, noise gateway failure etc. In order to provide a reliable service, the TCP must recover from data that is damaged, lost, duplicated, or delivered out of order by the Internet communication system. This is achieved by assigning a SEQUENCE NUMBER to each byte transmitted, and requiring a positive acknowledgment (ACK) from the receiving host. If the ACK is not received within a timeout interval, the data is retransmitted. At the receiver, the sequence numbers are used to correctly order segments that may be received out of order and to eliminate duplicates. Damage is handled by adding a checksum to each segment transmitted, checking it at the receiver, and discarding damaged segments. The principle is illustrated in the figure below:

Host A is transmitting a packet of data to Host B, but the packet gets lost before it reaches its destination. However, Host A has set up a timer when to expect the ACK from Host B so when this timer runs out, the packet is retransmitted. The difficult part of the method is to find a value of the time out period as a TCP segment can travel through different speed networks with different loads. This means that the Round trip Time (RTT) can vary from segment to segment. A simple way of calculating the RTT is by using a recursive mean value with an exponential window to decrease the importance of old values.

As mentioned in the introduction to the TCP section, the protocol is a stream oriented protocol. It uses unstructured streams with no method of indexing the user data, e.g. as records etc. Furthermore, the length of a TCP segment can vary as is the case for the IP-datagram and the UDP user datagram. Therefore the acknowledgement can not be based on the segment number but must be based on bytes successfully transferred.

However, the PAR principle is very inefficient as the sending host must await the acknowledgement before it can send the next segment. This means that the the minimum time between two segments is 1 RTT plus the time required to serve the segments at both ends. The TCP protocol solves this by using sliding windows at both ends.

This method permits the transmitting host to send as many bytes as can be stored in the sending window and then wait for acknowledgements as the remote host receives the segments and sends data in the other direction. The acknowledgement send back is cumulative so that it at all times shows the next byte that the receiving host expects to see. An example with a large window size and selective retransmission is shown in the figure:

Byte number 1 is lost so Host B never sends back a positive acknowledgement. When Host A times out on byte 1 it retransmit it. However, as the rest of the bytes from 2-5 are transmitted successfully the next acknowledgement can immediately jump to 6 which is the next expected byte. Byte 2 is also retransmitted as Host A does not know exactly how many bytes are erroneous. Host B just discards byte 2 as it already is downloaded.

The window technique can also be used to provide a congestion control mechanism. As indicated in the TCP Segment Format Figure every segment has a WINDOW field that specifies how much data a host is willing to receive. If the host is heavyly loaded, it can decrease the WINDOW parameter and hence the transmission speed drops.

However, as the TCP protocol is an end-to-end protocol it can not see if a congestion problem has occurred in an intermediate Interface Message Processor (IMP) (often called a packet switched node) and hence, it has no means to control it by adjusting the window size. TCP solves this problem by using the Internet Control Message Protocol (ICMP) source quench messages.

Connection Establishment

When a TCP connection is to be opened a 3-way handshake (3WHS) is used in order to establish the virtual circuit that exists until the connection is closed at the end of the data transfer. The 3WHS is described in the following as it is an important part of the TCP protocol but also shows some inefficiencies in the protocol. The principle of a 3WHS is illustrated in the figure below:

The blocks in the middle symbolizes the relevant part of the TCP segment, that is the SEQUENCE NUMBER, the ACKNOWLEDGEMENT NUMBER and the CODE. The active Host A sends a segment indicating that it starts its SEQUENCE NUMBER from x. Host B replies with an ACK and indicates that it starts with SEQUENCE NUMBER y. On the third segment both hosts agree on the sequence numbers and that they are ready to transmit data.

In the figure only Host A does an active open. Actually the two hosts can do a simultaneously open in which case both hosts perform a SYN-RECEIVED and then synchronize accordingly. The principle reason for the 3WHS is to prevent old duplicate connection initiations from causing confusion.

Note that the SEQUENCE NUMBER of segment 3 and 4 is the same because the ACK does not occupy sequence number space (if it did, the protocol would wind up ACKing ACK's!).

However, the TCP connection establishment is somewhat long cumbersome in many applications, especially in the client-server applications such as the World-Wide Web. In the next section an alternative having a lighter connection establishment is presented.

Transactional Transmission Control Protocol (T/TCP)

The TCP protocol is a highly symmetric protocol in that both hosts can transmit and receive data simultaneously. However, not all applications are symmetrical by nature. A typical example is a client-server protocol such as the Domain Name Service. The Transactional Transmission Control Protocol (T/TCP) that is a very new protocol (July 1994) offers an alternative to TCP when high performance is required in client-server applications. Some of the requirements of an high performance transaction oriented protocol are listed below:

The interaction between the client and the server is based on a request followed by a response, that is a stateless approach.
The protocol must guarantee that a transaction is carried out at most one time and any duplicate packets received by the server should be discarded.
No explicit open or close procedure of the connection. This is opposite to TCP and the 3WHS as described above.
The minimum transaction latency for a client should be Round Trip Time (RTT) + Server Processing Time (SPT). That is basically the same requirement as no explicit open or close procedure.
The protocol should be able to handle a reliable minimum transaction of exactly 1 segment in both directions.

This section describes how the TTCP protocol deals with these requirements and also that might affect the World-Wide Web model with respect to performance.

Implicit Connection Establishment

The T/TCP protocol is as indicated by the name based on the TCP protocol and T/TCP is backwards compatible with TCP. However, one of the features of the T/TCP protocol is that it can bypass the 3WHS described in the previous section but in case of failure can resolve to the 3WHS procedure.

The 3HWS has been introduced in order to prevent old duplicate connection initiations from causing confusion. However, T/TCP provides an alternative to this by introducing three new parameters in the OPTION field in the TCP Segment:

CONNECTION COUNT (CC): This is a 32-bit incarnation number where a distinct value is assigned to all segments send from Host A to Host B and another distinct number the other way. The kernel on both hosts keeps a cache of all the CC numbers currently used by connections to remote hosts. On every new connection the client CC number is monotonically incremented by 1 so that a segment belonging to a new connection can be separated from old duplicates from previous connections.
CONNECTION COUNT NEW (CC.NEW): In some situations, the principle of a monotonically increasing value of CC can be violated, either due to a host crash or that the maximum number, that is 4G, is reached and the counter returns to 0. This is possible in practice because the same CC number is global to all connections. In this situation a CC.NEW is send and the the remote host resets its cache and returns to a normal 3WHS TCP connection establishment. This signal will always be send from the client and to the server.
CONNECTION COUNT ECHO (CC.ECHO): In the server response the CC.ECHO field contains the CC value send by the client so that the client can validate the response as belonging to a specific transaction.

The bypass of the 3WHS is illustrated in the following figure:

In the example, two segments are send in both directions. The connection is established when the first segment reaches the server. The client is left in a TIME-WAIT state which is explained in the next section.

Connection Shotdown

Every TCP or UDP connection between two hosts are uniquely identified by the following 5-tuple:

Protocol (UDP, TCP)
IP-address of Host A
Port number of Host A
IP-address of Host B
Port number of Host B

Whenever a TCP connection has been closed, the association described by the 5-tuple enters a wait state to assure that both hosts have received the final acknowledgement from the closing procedure. The time of the wait state is called TIME-WAIT and is by default 2*MSL (120 seconds) where MSL is the Maximum Segment LifeTime. That is, two hosts can not perform an new transaction using the same 5-tuple until at least 120 seconds after the previous connection has been terminated. One way to circumvent this problem is to select another 5-tuple but as mentioned in Extending TCP for Transactions -- Concepts this does not scale due to the excessive amount of kernel space occupied by terminated TCP connections hanging around.

However, the T/RCP CC numbers gives a unique identification of each transaction so the T/TCP protocol is capable of truncating the WAIT-STATE by comparing the CC numbers. This principle can be looked at as expanding the state machine of one transaction to also include information on previous and future transactions using the same 5-tuple.

TTCP and the World-Wide Web

As will be shown in the description of World-Wide Web of this thesis, the principle of the World-Wide Web is a transaction oriented exchange of data object. This is the reason why the T/TCP protocol is very interesting in this prospective.

TCP/IP and OSI/RM

International Standards Organization (ISO) has designed the second dominating protocol layering scheme, called ISO Open System Interconnection Reference Model (OSI/RM). This section presents the OSI reference model and compares it to the TCP/IP protocol stack as illustrated in the figure.

Physical Layer: Specifies the physical connection between the host computers and IMPs and how bits are transferred over a communication channel.
Data Link Layer: This layer specifies how data travels between IMPs using frames. Its maintask is to change the service from the physical layer into a packet oriented error-free transmission.
Network Layer: The frames from the Data Link Layer are organized into packets and directed through the network. The communication is still between IMPs.
Transport Layer: The first layer that provides an end-to-end transport service. It ensures that the transferred data arrive correctly at the other end.
Session Layer: This layer specifies how two hosts can establish sessions where data can be transferred in both directions on a virtual connection between the two hosts.
Presentation Layer: The Presentation layer introduces a set of syntax and semantics of the information transmitted through the lower protocol layers.
Application Layer: This layer defines a platform independent virtual network terminal so that application programs can exchange data regardless of the internal data representation used.

Even though OSI/RM and TCP/IP can be compared like this, there still exist several significant differences between OSI/RM and the TCP/IP protocol stack but on the the most fundamental is that OSI/RM is a standardized model for how the functionality of a protocol stack can be organized. It doesn't specify the exact services and protocols to be used in each layer whereas the TCP/IP is an result of experimental research. In spite of this, the OSI/RM model has been the basis of several protocol stack implementation such as X.25, discussed in A Critique of X.25

Another difference is where the intelligence is placed in the layering. OSR/RM introduces a reliable service on the Data Link Layer whereas the TCP/IP only has intelligence in the Transport Layer. Both solutions have advantages and disadvantages. When a reliable data transfer service is placed in the lower layers the clients using the network for communication can be kept very simple as they do not have to handle complicated error situations. The disadvantage is that performance decreases due to an excessive amount of control information transferred and processed in every host.

Henrik Frystyk, frystyk@info.cern.ch, July 1994