Henrik Frystyk, July 1994

Thesis on the World-Wide Web

Note: This document is still under construction. Any suggestions or ideas are welcome at frystyk@info.cern.ch.

This paper is an attempt to document my work as a technical student at CERN, "the European Laboratory for Particle Physics" from February 1 1994 to August 1 1994. It is, however, also an attempt to visualize the potential features of the information exchange system called the World-Wide Web that I have been working on during this period. The document serves as my master thesis from Aalborg University, Denmark, Institute of Electronic Systems, Department of Communication Technology

The documentation is organized into several hypertext documents which gives the reader the possibility of reading it in an unknown number of ways which makes it fundamentally different from what any paper version can provide. All internal references are made as hyperlinks so that any jump back, or forth, in the documentation is immediately accessible using whatever means are available within the World-Wide Web client used. All external references available on the Internet are also accessible through hyperlinks. This means that the reader is not limited within this particular information but at any instant can jump to any other information provider on the Internet using the World-Wide Web. Another characteristic is, that unlike a paper version, this documentation is constantly changing as the Web itself is constantly changing. This is an inherent consequence of using a global information system with thousands of highly independent users and information providers.

The documentation is organized in a tree structure in which this document serves as the top node or the root document for further browsing. The following list is a `list of contents' that indicate the overall structure of the tree and a proposed way of traversing it.

However, any other path might be just as good, and I hope that when you are reading it you will find your own way through. All documents related to this work have a "home link" symbolized by the little icon to the left that points back to this page.

Finally, I would like to thank my supervisor at CERN, Tim Berners-Lee and Ari Luotonen for their great support and source of inspiration throughout the project.

Henrik Frystyk, libwww@info.cern.ch, July 1994

Introduction to the Internet

Henrik Frystyk, July 1994

Introduction to the Internet

This chapter gives an overview of the Internet. It presents the history and the basic model of the Internet, but it is not an attempt to describe the Internet in all its detail which would be out of scope at this place. Please see Douglas Comer for an excellent description of the Internet. The structure of the document is as follows:

Architectural Model
Address Scheme
Domain Name Server
Gateways and Routing

In the late 1960s the American Defense Advanced Projects Research Agency, ARPA (later DARPA) started a research project on the subject of computer networks. One of the first results of this project was an experimental four node network starting in 1969. Later the network expanded to include several military installations and scientific research centers. In the mid 1970s work began towards the Internet with the architecture and protocols taking their current form around 1978-79.

The Internet as we know it today started around 1980 when DARPA started to use the TCP/IP protocol stack on all installations connected to the DARPA Internet. The transition ended in the beginning of 1983 when TCP/IP became the only protocol stack allowed on the Internet. This is still the current situation on the Internet, but now it has grown to several thousands of nodes and millions of users. The countries connected to the Internet are illustrated in the figure below.

In 1988, DARPA decided that the experiment of ARPANET was complete and started to dismantle the ARPANET that, until then, was the backbone of the Internet. However at the same time, the American National Science Foundation established the NFSNET which then became the new backbone network with a capacity (1992) of 45 MBPS.

Many Internet organizations other than DARPA have an important influence on the further development of the Internet. A few of them are mentioned below:

The Internet Activities Board (IAB): This organization was created in 1983 in order to guide the evolution of the Internet development. It now has two major components: Internet Engeneering Task Force and the Internet Research Task Force.
Internet Engeneering Task Force (IETF): The IETF is the protocol engeneering, development, and standardization branch of the Internet Architecture Board (IAB). IETF manages the Request for Comments (RFC) documents
Internet Research Task Force (IRTF): The IRTF is the research and development branch of IAB. They do research in new network technologies and
InterNIC Information Services: The InterNIC is a collaboration project of three organizations: General Atomics, AT&T, and Network Solutions, Inc. Their goal is to make networking and network information more easily accessible to researchers, educators and the general public. They work together with the Network Information Centers (NICs) located throughout the Internet.

Architectural Model

The term "Internet" is a generalization that covers thousands of interconnected networks around the world based on very different technologies. The networks differ in almost any possible network specific parameter such as transmission medium, geographical size, number of nodes, transmission speed, throughput, reliability etc. The only reason why this generalization is possible is because the Internet is based on an abstraction that is independent of the physical hardware. In short, it represents a homogenous interface to its users in spite of the heterogeneous hardware that it is based on.

The diversity among networks connected to the Internet is partly due to an evolution of technology resulting in new networks having higher reliability, better throughput etc. However, there will (at least for a long time) exist a need for fundamental different network architectures as no network technology today can supply a solution that covers all aspects of internetworking.

This section introduces the basic architecture of how the Internet is organized. The description starts at a certain abstraction level that does not include a description of the underlying physical network technologies such as Ethernet, Token Ring, FDDI etc. These are all described in Computer Networks. The basic idea of an internet is to provide the possibility of transporting data from one network to another through a connection in a way that both parties agree on and understand. The connection between the two consists of a gateway computer that is physically or logically connected to both networks (logically in the case of a cordless network). The situation between two networks looks like:

Each cloud is a network with an arbitrary number of connected nodes. The gateway between them serve as the only way of exchanging data directly between the two networks. Later in this chapter it is described how two hosts can communicate even though they are not connected directly but must go through intermidiate networks.

Address Scheme

In order to reference any node as a unique point on the Internet, a global two dimensional 32-bit integer address space has been defined which gives a maximum number of 4G connected nodes on the Internet. The first element is a netid and the second is a nodeid, that is:

	address = (netid, hostid)

A common notation for specifying an Internet address is by using four fields of decimal integer numbers ranging from 0 to 255 separated by decimal points, e.g.:

	128.141.201.214

which is the IP-address of the World-Wide Web info server at CERN.

Address Classes

In order to provide IP-addresses which suit both large networks with millions of hosts and small networks with a few hundred hosts, the netid part and the hostid part can occupy a varying part of the IP-address. The number of possible nodes on a network, being the amount of bits assigned to the hostid, categorizes the address space into 5 classes:

The definition of the classes is as follows:

Class A: This class has a 1 byte netid and a 3 byte hostid. As networks in this category are characterized by having a 0 as the first bit in the address, the maximum number of networks is 128. However, as 24 bits are available for the hostid, each network can contain 16M connections. A network can be categorised by the first fields in the address and for a Class A network the value of first field is in the range 0-127.
Class B: Class B networks have 2 bytes for the netid, but as they are required to start with the bit combination 10b, the maximum number of networks is 16K. The number of connected nodes is 64K and the value of the first field ranges from 128-191 and the second from 1-254.
Class C: This class is for small networks with a maximum number of nodes limited to 256. This class is characterized by having the leading bit pattern 110b which leaves the maximum number of networks to 2M. The value of the first field is from 192-233, the second from 0-255, and the third from 1-254.
Class D: Class D networks are networks without the possibility of addressing any individual node. All 32 bites are used by the netid and hence any reference to the network is automatically a broadcast message to all the connected hosts. The characteristic leading bit pattern for this class is 1110b.
Class E: This is currently not in use but reserved for future use. However, the caracteristic leading bit pattern for this class is defined as 11110b.

From this description it can be seen that the IP-address given above is a Class B network with the possibility of 64K nodes.

An interesting thing to note about having the IP-address containing information of the network is that a gateway as a consequence of being connected to two networks must also have two IP-addresses in order to be accessible from both sides. This is the reason for not referring to a number of hosts but nodes or connections to the network. In the Gateways and Routing it is described how the current addressing scheme influences the routing algorithms used on the Internet.

Physical Addresses

It is important to note that Internet addresses are an abstraction from the addresses in a physical network implementation like Ethernet. They assure that the same addressing scheme can be used in every part of the Internet regardless of the implementation of the underlying physical network. In order to do this, a binding must exist between the IP-address and the physical address. Dependent on the physical network addressing scheme, this binding can either be static or dynamic. An example of the latter is the Ethernet addressing scheme that is a 48-bit integer. As it is not possible to map 48 bit into a 32-bit IP-address without loosing information, the binding must be determined dynamically. The Addressing Resolution Protocol (ARP) is specially designed for binding Ethernet addresses dynamically to IP-addresses but can be used for other schemes as well.

Subnetworks

As will be explained in the section Gateways and Routing, Internet routing between gateways is based on the netid part of the IP-address. In the past few years a very large number of small networks with only a few hundred nodes have been connected to the Internet. Having so many netids makes the routing procedure complicated and time consuming. One solution to this is to introduce a subnet addressing scheme where a single IP-address spans a set of physical networks. This scheme can also be used to divide a large number of nodes into logical groups within the same network.

The scheme is standardized and described in the RFC IP Subnet Extension. The idea is basically to use three coordinates in the IP-address instead of two, that is:

	address = (netid, subnetid, nodeid)

However, the subnetid only has a special meaning "behind" the front subnet gateway. The rest of the Internet can not see it and treats the subnetid and the nodeid as the hostid. Only the gateways indicated in the figure need to know of the subnets and can then make the routing accordingly.

Furthermore, the subnet hierarchy does not have to be symmetric. This is indicated in the figure where subnet 3 and 4 are subnets of subnet 2, whereas subnet 1 does not have any subnets.

A 32-bit subnet mask for each level in the subnet hierarchy is required in order to make the gateway routing possible between the subnets. This mask specifies what part of the IP-address is the subnetid and what part is the nodeid by simple boolean AND'ing.

Special Addresses

One advantage of having the network encoded as a part of the IP-address is that it is possible to refer to the network as well as individual hosts. Three special cases have been specifically allocated for exploiting this feature:

Broadcast Messages

It is possible to generate as broadcast message to all nodes on a network by specifying the netid and letting the hostid be all 1s. However, there is is no guarantee that the physical actually supports broadcast messages, so the feature is only an indicator. It is not possible to make a broadcast message to the whole Internet in one operation. This is to prevent the network from flooding the Internet with global broadcast messages.

This network

Situations might appear where a host on a network does not know the netid of the network that it is connected to. This happens every time a host without stationary memory wants to get on to the net. However, the host does know its physical address which is sufficient for communicating locally within the network. In this situation it sets the netid to 0 and sends out a broadcast message on the local network. Two Internet protocols are available for doing this:

Reverse Address Resolution Protocol (RARP): This protocol is adapted from the Address Resolution Protocol that is especially created to resolve 48-bit physical Ethernet addresses into 32-bit IP-addresses. Only a dedicated RARP server on the network will answer the reply by filling out the netid and send it back to the requester. In case the main RARP server is down a backup RARP can be chosen to perform the job.
Internet Control Message Protocol (ICMP): This is a generic low level error and information protocol that can be used for sending error and information messages between any host gateway (also from gateway to gateway and host to host). It also has the possibility of sending out a simple information request message, and this can be used to obtain the netid of the network. In this situation, the gateways on the local network will respond to the request with an information message having the right netid.

Local Host

By convention the Class A address 127.0.0.1 is known as a loopback address for the local host. This address provides the possibility of accessing resources local to your own system. On Unix platforms, this is defined in the /etc/hosts system file.

Domain Name Server

This section is an introduction to the Internet Domain Name Service (DNS). See DNS and Bind for a complete description of the service. The DNS is build on top of a distributed database where every data record is indexed by a name that is a part of the Domain Name Space. The index itself is a hierarchically organized tree structure as illustrated in the following figure:

where the top node is called the root domain with the null label (empty string) but referenced as a single dot. Each node in the tree is labeled with a name consisting of at most 63 characters taken from the set of

letters from A-Z (case insensitive)
digits from 0-9
hyphen

The advantage of having a hierarchical structure of the name space is that administration of the space can be delegated to different organizations without any risk for name collision. This is very important as the size of the DNS database is foreseen to be proportional to the number of users on the Internet as the database not only can contain information about hosts but also about personal mail addresses.

The structure shown above is very similar to the Unix file system. The most important difference is that a record in the DNS database is indexed from the bottom of the tree and up whereas a Unix file is indexed from the top of the tree, e.g.:

info.cern.ch: The info is the host name and the cern.ch is the domain name.
/usr/local/bin/emacs: emacs is the file name and /usr/local/bin is the path

Another similarity is aliases that are pointers to the official host name in the DNS database. In the Unix file system it is implemented as (soft) links.

DNS is a client-server based application consisting of the Domain Name Servers and the resolvers. A server contains information about some segment of the DNS database and makes it available to clients or resolvers. Resolvers are often just software libraries that is linked into any Internet program by default.

In the next section it is described what happens when a host has more than one physical connection to the Internet and hence more than one host name.

Gateways and Routing

When a message is to be send from one host to another, some mechanism must provide the functionality of choosing the exact path of which the message is to be transmitted. When routing a message, two distinct situations can occur:

The transmitting and receiving host are connected to the same physical network
The transmitting and receiving host are separated by one or many networks

In the first case, routing is a question of resolve the IP-address into a physical address as described in Physical Addresses. This section will give an overview of how the latter case is handled using gateways.

kobling til address scheme
multi homed hosts
Routing in Subnets

Problems in the Internet Model

Now when the basic properties of the Internet model has been introduced some problems or weaknesses in the current model have become clear. This section will shortly summarize the most important limitations in the current Internet architecture.

ADDRESSING g
ROUTING
EFFICIENCY OF TCP

Another important aspect not described here is security considerations when using the Internet. What means do people have to gain access to classified information when communicating to Internet sites. Today security precautions on the Internet is often based on the assumption that the transport service provided by the Internet can be considered as a trusted carrier. This is equivalent to the generally accepted assumption that letters send via the public postal system is actually delivered to the addressee without being read by anyone during transportation.

This is, however, not true on the Internet and many problems have arisen simply from people listening to the net traffic. Especially protocols like FTP and the Telnet protocol (the control connection in the FTP protocol is actually a telnet connection) have proven to be very insecure as passwords are transmitted unencoded across the Internet.

Henrik Frystyk, July 1994

The Internet Protocol Stack

Henrik Frystyk, July 1994

The Internet Protocol Stack

As mentioned in the Internet Section the Internet is an abstraction from the underlying network technologies and physical address resolution. This section introduces the basic components of the Internet protocol stack and relates the stack to the ISO OSI reference protocol stack model. The model of the Internet protocol stack is illustrated in the figure below.

This documents describes the various parts presented in this diagram. The upper layer protocols, e.g., FTP, Telnet, TFTP etc. are described in the Presentation Layer Protocol section. This leaves the following topics as sections in this document:

Internet Protocol (IP)
User Datagram Protocol (UDP)
Transmission Control Protocol (TCP)
Transactional Transmission Control Protocol (T/TCP)
TCP/IP and OSI/RM

Internet Protocol (IP)

As seen in the figure above, the Internet protocol stack provides a connection oriented reliable branch (TCP) and an connectionless unreliable branch (UDP) both build on top of the Internet Protocol.

The Internet Protocol layer in the TCP/IP protocol stack is the first layer that introduces the virtual network abstraction that is the basic principle of the Internet model. All physical implementation details (ideally even though this is not quite true) are hidden below the IP layer. The IP layer provides an unreliable, connectionless delivery system. The reason why it is unreliable stem from the fact the protocol does not provide any functionality for error recovering for datagrams that are either duplicated, lost or arrive to the remote host in another order than they are send. If no such errors occur in the physical layer, the IP protocol guarantees that the transmission is terminated successfully.

The basic unit of data exchange in the IP layer is the Internet Datagram. The format of an IP datagram and a short description of the most important fields are included below:

LEN: The number of 32 bit-segments in the IP header. Without any OPTIONS, this value is 5
TYPE OF SERVICE: Each IP datagram can be given a precedence value ranging from 0-7 showing the importance of the datagram. This is to allow out-of-band data to be routed faster than normal data. This is very important as Internet Control Message Protocol (ICMP) messages travels as the data part of an IP datagram. Even though an ICMP message is encapsulated in a IP datagram, the ICMP protocol is normally thought of as a integral part of the IP layer and not the UDP or TCP layer. Furthermore, the TYPE OF SERVICE field allows a classification of the datagram in order to specify is the service desired requires short delay time, high reliability or high throughput. However, in order for this to have any effect, the gateways must know more than one route to the remote host and as described in the Introduction, this is not the case.
IDENT, FLAGS, and FRAGMENT OFFSET: These fields are used to describe fragmentation of a datagram. The actual length of an IP datagram is in principle independent of the length of the physical frames being transferred on the network, referred to as the network's Maximum Transfer Unit (MTU). If a datagram is longer than the MTU then it is divided in to a set of fragments having almost the same header as the original datagram but only the amount of data that fits into a physical frame. The IDENT flag is used to identify segments belonging to the same datagram, and the FRAGMENT OFFSET is the relative position of the fragment within the original datagram. Once a datagram is fragmented it stays like that until it receives the final destination. If one or more segments are lost or erroneous the whole datagram is discarded.
However, the underlying network technology is not completely hidden below the IP layer in spite of the fragmentation functionality. The reason is that the MTU can vary from 128 or less to several thousands of bytes dependent of the physical network (Ethernet has a MTU of 1500 bytes). It is hence question of efficiency when choosing the right datagram size so that fragmentation is minimized. It is recommended that gateways are capable of handling datagrams of at least 576 bytes without having to use fragmentation.
TIME: This is the remaining Time To Live (TTL) for a datagram when it travels on the Internet. The Routing Information Protocol (RIP) specifies that at most 15 hops are allowed.
SOURCE IP-ADDRESS and DESTINATION IP-ADDRESS: Both the source and destination address is indicated in the datagram header so that the recipient can send an answer back to the transmitting host. However, note that only the host address is specified - not the port number. This is because the IP protocol is an IMP-to-IMP protocol - it is not an end-to-end protocol. A layer more is needed to actually specify which two processes on the transmitting host and the final destination that should receive the datagrams.

User Datagram Protocol (UDP)

The User Datagram Protocol (UDP) is a very thin protocol build on top of the Internet Protocol. The basic unit of data is a User datagram and the UDP protocol provides the same unreliable, connectionless service transferring user datagrams as the IP protocol does transferring its datagrams. The main difference is that the UDP protocol is an end-to-end protocol. That is, it contains enough information to transfer a user datagram from one process on the transmitting host to another process on the receiving host. The format of a user datagram is illustrated below:

The LENGTH field is the length of the user datagram including the header, that is the minimum value of LENGTH is 8 bytes. The SOURCE PORT and DESTINATION PORT are the connection between a IP-address and a process running on a host. A network port is normally identified by an integer. However, the user datagram does not contain any IP-address so how does the UDP protocol know when the final destination is reached?

When calculating the CHECKSUM header, the UDP protocol appends a 12-byte pseudo header consisting of the SOURCE IP-ADDRESS, the DESTINATION IP-ADDRESS and some additional fields. When a host receives a UDP datagram it takes the UDP header and creates a new pseudo header using its own IP-address as the DESTINATION IP-ADDRESS and the SOURCE IP-ADDRESS extracted from the IP datagram. Then it calculates a checksum and if it equals the UDP checksum, then the datagram has received the final destination.

As indicated in the Internet Protocol Stack Figure the UDP protocol is often used as the basic protocol in client-server application protocols such as TFTP, DNS etc. where the overhead of making a reliable, connection oriented transmission is considerable. This problem will be considered further in the next two sections.

Transmission Control Protocol (TCP)

The Transmission Control Protocol provides a full duplex, reliable, connection oriented service to the application layer as indicated in the Internet Protocol Stack Figure. This section described the basic principle of the TCP protocol and how it provides a reliable service to the application layer protocols.

The TCP protocol is a stream oriented protocol. It is designed to provide the application layer software with a service to transfer large amount of data in a reliable way. It establishes a full duplex virtual circuit between the two transmitting hosts so that both host simultaneously can put data out on the Internet without specifying the destination host once the connection is established. In the Transactional Transmission Control Protocol (T/TCP) section an client-server based extension to the TCP protocol is presented as an alternative to the stream architecture.

TCP Segment Format

A segment is the basic data unit in the TCP protocol. As much of the following sections are based on this data unit, the format is presented here:

SOURCE PORT, DESTINATION PORT

The TCP protocol uses the same trick of using a pseudo header instead of transmitting the source IP-address and the destination IP-address as is already included in the IP-datagram. Therefore only the port numbers are required to uniquely define the communicating hosts.

CODE

This field is used to indicate the content of the segment and if a specific action has to be taken such as if the sender has reached EOF in the stream.

OPTIONS

The TCP protocol uses the OPTIONS field to exchange information like maximum segment size accepted between the TCP layers on the two hosts. The flags currently defined are

URG Urgent pointer field is valid
ACK Acknowledgement field is valid
PSH This segment requests a push
RST Reset the connection
SYN Synchronize sequence numbers
FIN Sender has reached end of its byte stream

OFFSET

This integer indicates the offset of the user data within the segment. This field is only required as the number of bits used in the OPTIONS field can vary

URGENT POINTER

This field can be initialized to point to a place in the user data where urgent information such as escape codes etc. are placed. Then the receiving host can process this part immediately when it receives the segment.

Reliable Transmission

At the IP-protocol layer packets can get discarded due to network congestion, noise gateway failure etc. In order to provide a reliable service, the TCP must recover from data that is damaged, lost, duplicated, or delivered out of order by the Internet communication system. This is achieved by assigning a SEQUENCE NUMBER to each byte transmitted, and requiring a positive acknowledgment (ACK) from the receiving host. If the ACK is not received within a timeout interval, the data is retransmitted. At the receiver, the sequence numbers are used to correctly order segments that may be received out of order and to eliminate duplicates. Damage is handled by adding a checksum to each segment transmitted, checking it at the receiver, and discarding damaged segments. The principle is illustrated in the figure below:

Host A is transmitting a packet of data to Host B, but the packet gets lost before it reaches its destination. However, Host A has set up a timer when to expect the ACK from Host B so when this timer runs out, the packet is retransmitted. The difficult part of the method is to find a value of the time out period as a TCP segment can travel through different speed networks with different loads. This means that the Round trip Time (RTT) can vary from segment to segment. A simple way of calculating the RTT is by using a recursive mean value with an exponential window to decrease the importance of old values.

As mentioned in the introduction to the TCP section, the protocol is a stream oriented protocol. It uses unstructured streams with no method of indexing the user data, e.g. as records etc. Furthermore, the length of a TCP segment can vary as is the case for the IP-datagram and the UDP user datagram. Therefore the acknowledgement can not be based on the segment number but must be based on bytes successfully transferred.

However, the PAR principle is very inefficient as the sending host must await the acknowledgement before it can send the next segment. This means that the the minimum time between two segments is 1 RTT plus the time required to serve the segments at both ends. The TCP protocol solves this by using sliding windows at both ends.

This method permits the transmitting host to send as many bytes as can be stored in the sending window and then wait for acknowledgements as the remote host receives the segments and sends data in the other direction. The acknowledgement send back is cumulative so that it at all times shows the next byte that the receiving host expects to see. An example with a large window size and selective retransmission is shown in the figure:

Byte number 1 is lost so Host B never sends back a positive acknowledgement. When Host A times out on byte 1 it retransmit it. However, as the rest of the bytes from 2-5 are transmitted successfully the next acknowledgement can immediately jump to 6 which is the next expected byte. Byte 2 is also retransmitted as Host A does not know exactly how many bytes are erroneous. Host B just discards byte 2 as it already is downloaded.

The window technique can also be used to provide a congestion control mechanism. As indicated in the TCP Segment Format Figure every segment has a WINDOW field that specifies how much data a host is willing to receive. If the host is heavyly loaded, it can decrease the WINDOW parameter and hence the transmission speed drops.

However, as the TCP protocol is an end-to-end protocol it can not see if a congestion problem has occurred in an intermediate Interface Message Processor (IMP) (often called a packet switched node) and hence, it has no means to control it by adjusting the window size. TCP solves this problem by using the Internet Control Message Protocol (ICMP) source quench messages.

Connection Establishment

When a TCP connection is to be opened a 3-way handshake (3WHS) is used in order to establish the virtual circuit that exists until the connection is closed at the end of the data transfer. The 3WHS is described in the following as it is an important part of the TCP protocol but also shows some inefficiencies in the protocol. The principle of a 3WHS is illustrated in the figure below:

The blocks in the middle symbolices the relavant part of the TCP segment, that is the SEQUENCE NUMBER, the ACKNOWLEDGEMENT NUMBER and the CODE. The active Host A sends a segment indicating that it starts its SEQUENCE NUMBER from x. Host B replies with an ACK and indicates that it starts with SEQUENCE NUMBER y. On the third segment both hosts agree on the sequence numbers and that they are ready to transmit data.

In the figure only Host A does an active open. Actually the two hosts can do a simultaneously open in which case both hosts perform a SYN-RECEIVED and then synchronize accordingly. The principle reason for the 3WHS is to prevent old duplicate connection initiations from causing confusion.

Note that the SEQUENCE NUMBER of segment 3 and 4 is the same because the ACK does not occupy sequence number space (if it did, the protocol would wind up ACKing ACK's!).

However, the TCP connection establishment is somewhat long cumbersome in many applications, especially in the client-server applications such as the World-Wide Web. In the next section an alternative having a lighter connection establishment is presented.

Transactional Transmission Control Protocol (T/TCP)

The TCP protocol is a highly symmetric protocol in that both hosts can transmit and receive data simultaneously. However, not all applications are symmetrical by nature. A typical example is a client-server protocol such as the Domain Name Service. The Transactional Transmission Control Protocol (T/TCP) that is a very new protocol (July 1994) offers an alternative to TCP when high performance is required in client-server applications. Some of the requirements of an high performance transaction oriented protocol are listed below:

The interaction between the client and the server is based on a request followed by a response, that is a stateless approach.
The protocol must guarantee that a transaction is carried out at most one time and any duplicate packets received by the server should be discarded.
No explicit open or close procedure of the connection. This is upposite to TCP and the 3WHS as described above.
The minimum transaction latency for a client should be Round Trip Time (RTT) + Server Processing Time (SPT). That is basically the same requirement as no explicit open or close procedure.
The protocol should be able to handle a reliable minimum transaction of exactly 1 segment in both directions.

This section describes how the TTCP protocol deals with these requirements and also that might affect the World-Wide Web model with respect to performance.

Implicit Connection Establishment

The T/TCP protcocol is as indicated by the name based on the TCP protocol and T/TCP is backwards compatible with TCP. However, one of the features of the T/TCP protocol is that it can bypass the 3WHS described in the previous section but in case of failure can resolve to the 3WHS procedure.

The 3HWS has been introduced in order to prevent old duplicate connection initiations from causing confusion. However, T/TCP provides an alternative to this by introducing three new parameters in the OPTION field in the TCP Segment:

CONNECTION COUNT (CC): This is a 32-bit incarnation number where a distinct value is assigned to all segments send from Host A to Host B and another destinct number the other way. The kernel on both hosts keeps a cache of all the CC numbers currently used by connections to remote hosts. On every new connection the client CC number is monotonically incremented by 1 so that a segment belonging to a new connection can be separated from old dublicates from previous connections.
CONNECTION COUNT NEW (CC.NEW): In some situations, the principle of a monotonically increasing value of CC can be violated, either due to a host crash or that the maximum number, that is 4G, is reached and the counter returns to 0. This is possible in practice because the same CC number is global to all connections. In this situation a CC.NEW is send and the the remote host resets its cache and returns to a normal 3WHS TCP connection establishment. This signal will always be send from the client and to the server.
CONNECTION COUNT ECHO (CC.ECHO): In the server response the CC.ECHO field contains the CC value send by the client so that the client can validate the response as belonging to a specific transaction.

The bypass og the 3WHS is illustrated in the following figure:

Connection Shotdown

Every TCP or UDP connection between two hosts are uniquely identified by the following 5-tuple:

Protocol (UDP, TCP)
IP-address of Host A
Port number of Host A
IP-address of Host B
Port number of Host B

Whenever a TCP connection has been closed, the association described by the 5-tuple enters a wait state to assure that both hosts have received the final acknowledgement from the closing procedure. The time of the wait state is called TIME-WAIT and is by default 2*MSL (120 seconds) where MSL is the Maximimum Segment LifeTime. That is, two hosts can not perform an new transaction using the same 5-tuple until at least 120 seconds after the previous connection has been terminated. One way to circumvent this problem is to select another 5-tuple but as mentioned in Extending TCP for Transactions -- Concepts this does not scale due to the exessive amount of kernel space occupied by terminated TCP connections hanging around.

However, the T/RCP CC numbers gives a unique identification of each transaction so the T/TCP protocol is capable of truncating the WAIT-STATE by comparing the CC numbers. This principle can be looked at as expanding the state machine of one transaction to also include information on previos and future transactions using the same 5-tuple.

TTCP and the World-Wide Web

As will be shown in the description of World-Wide Web of this thesis, the principle of the World-Wide Web is a transaction oriented exchange of data object. This is the reason why the T/TCP protocol is very interesting in this prospective.

TCP/IP and OSI/RM

OSI Model
X-25
Intelligence in network

Henrik Frystyk, July 1994

Presentation Layer Protocols

Henrik Frystyk, July 94

Presentation Layer Protocols

This section introduces some of the presentation layer protocols on the Internet that are related to the World-Wide Web project. The main WWW protocol, Hypertext Transfer Protocol is described in the The HTTP Protocol. The protocols presented are:

Multipurpose Internet Mail Extensions
File Transfer Protocol
Network News Transfer Protocol
WhoIs++

Multipurpose Internet Mail Extensions

The Multipurpose Internet Mail Extensions (MIME) protocol is an extension to the Standard for the Format of ARPA Internet Text Messages. This protocol has defined the standard format of textual mail messages on the Internet since it came out in 1982. It describes the format of message headers but it tells little about the content of the body of the message which is limited to 7-bit ASCII characters. The MIME protocol provides the necessary extension to the MAIL protocol in order to transfer possibly multi-part textual and non-textual data object in the body of a MAIL message.

The protocol basically specifies a set of header lines that together with a set of name-value pairs. The reason for describing the MIME protocol in this document is that it is an important part of the Hypertext Transfer Protocol (HTTP) described later. In the following, the most important header lines are introduced:

Content Type

The content type header field is a set of types and subtypes that specifies the content of the body part of a MAIL message. The protocol specifies 7 content types and a large set of subtypes, specified in the header as

	Content-Type: <type> / <subtype> *(";" parameter)

The idea behind this format is to let MIME compliant software know the maintype of a data object even though it might not be able to handle the specific subtype. The main types are:

Text: Used to represent textual data objects. The charset used can be specified as a parameter and can in be any ASCII or ISO charset.
Multipart: This type does not describe the specific content of the body but allows the body to consist of several body parts into the same message. Each body can then have its own content-type that again might be a multi-part message. The multi-part content type provides the possibility of a hierarchical data object structure in MIME conformant messages.
Application: This content type describes binary data in some form. In practice it is often used to describe data objects that a MIME frontend does not know how to handle otherwise. The default action taken is to dump the content to, e.g., a local file to allow further processing by other processes.
Message: It is often desirable to encapsulate a MAIL message into another, e.g. when forwarding a message to a new recipient. The Message content type has been defined for this purpose.
Image: This content type describes still images and pictures
Audio: Used for audio or voice data objects
Video: Used for transmitting video frames or moving image data. The current subtype is mpeg

The last three content-types all have a vaste amount of subtypes as the number of graphical and audio data formats is very big. However the content type allows every MIME conformant fronend to handle it in an intelligent manner.

Content Transfer Encoding

The content transfer encoding allows 8-bit data objects to be encoded so that they can be transferred over 7-bit ASCII lines without loss of data.

File Transfer Protocol

Network News Transfer Protocol

WHOIS++

Henrik Frystyk, July 1994

The World-Wide Web

Henrik Frystyk, July 1994

The World-Wide Web

The official description of the World-Wide Web (WWW, W3) is a "wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents". It is a way of viewing all the on-line information available on the Internet as a seamless, browsable continuum.

This section introduces the current model of the World-Wide Web idea and explains the basic element in this model. The content of the section is as follows:

Basic World-Wide Web Model
Universal Resource Identifies
Hypertext Transfer Protocol
Hypertext Markup Language
Interactive World-Wide Web Model
Other Information Systems

Basic World-Wide Web Model

For the first time on the Internet, a generic information exchange tool is made available that not only accesses information within its own model but also incorporates all information accessible through other information systems such as Gopher and WAIS.

The basic idea behind the World-Wide Web is based on a client server application and hypertext documents as illustrated in the following figure:

The Client: The client is the user's interface to the Internet. Whatever type of service requested this interface stays the same, so users do not need to understand the differences between the many protocols in common use on the Internet.
Uniform Resource Identifier URI: The user initiates a request by specifying a Uniform resource Identifier or a "hyperlink". This link can specify any accessible information on the Internet as long as it can be identified as a data object. In the model shown the request is a HTTP request but it can in principle be any Internet protocol supported by the WWW.
Hypertext Transport Protocol HTTP: The client sends of the user request to a WWW server using the stateless HTTP Protocol.
The Server: The HTTP server gets the document requested and returns it to the client. In the figure shown, it is a HTML-hypertext document but it can in principle be an arbitrary data object.
Hypertext Markup Language HTML: An data object is returned to the client. The object is written in the HTML language that is a hypertext language possibly containing new links that the user can follow.

The model basically reflects the first version of the World-Wide Web as it is described in the HTTP Protocol version 0.9 and HTML version 1. However, the WWW specifications have been rapidly changing during the last 3-4 years, so even though the model is still useful to show the basic way of using the WWW, the current model is somewhat more advanced. Though, before explaining what new features are either being discussed or already implemented, it is necessary to get an overview of the basic elements in the WWW model mentioned above.

Universal Resource Identifies

In order to address a data object or more general, a resource, in the model above it is necessary to define a name space that not only contains information about hosts but also about all resources avalilable on each host. The World-Wide Web defines Uniform Resource Identifiers or URIs that defines a syntax for encoding the names and addresses of data objects on the Internet and how they can be accessed through an Internet protocol:

Universal Resource Identifier (URI): A generic set of all addresses in the address space of all resources on the Internet. The describe a hierarchical naming scheme that together with the HTTP protocol makes a significant difference between the World-Wide Web model and other Internet access schemes such as FTP that has a flat address space.
Uniform Resource Locator (URL): The term "URI" has been introduced by the IETF and is a a general description of all URI that are not persistent. In practice the URLs consist of the current set of Internet protocols supported by the WWW, i.e, HTTP, FTP, Gopher, WAIS, etc., followed by a directory path and a file name.
Uniform Resource Name (URN): However, the ultimate goal for URIs is to be a persistent naming scheme independent of the mean of access, i.e., the protocol used. However, the only way they also can be independent of the physical structure of resources on the specific host is to have a naming scheme like the Internet Domain Name Service. URNs are currently under consideration in IETF but little is known about the status of the research.
Uniform Resource Citation (URC): This is meta information about a URI. They consist of pairs of attribute/value that might contain information on the author, publisher etc. etc. Not currently used.

Hypertext Transfer Protocol

The Hypertext Transport Protocol (HTTP) is a generic stateless presentation layer protocol with many elements from other Internet protocols. The HTTP protocol is build on a client-server model where the client initiates a request and the server replies with a response. The protocol is in spite of its name not limited to transfer hypertext objects but can transfer any 8-bit data object.

MORE MORE MORE MORE

Hypertext Markup Language

The Hypertext Markup language (HTML) is the user interface to create the World-Wide Web. It is important to note that the description of the World-Wide Web until now has been an introduction to the technology that due to specifications and conventions provide the functionality necessary to request and serve information on the Internet. However, the information is provided by human beings and the HTML language provides the functionality of making hyperlinks across the Internet as is demonstrated in this document. HTML can be used to represent:

Hypertext news, mail, online documentation, and collaborative hypermedia
Menus and options
Database query results
Simple structured documents with inlined multi media elements like images, audio and movie
URI-Links to other resources on the Internet.

HTML is build on top of the International Standard ISO 8879 Standard Generalized Markup Language (SGML). SGML is a system for defining structures document types and markup languages to represent instances of these document types. That is, HTML is a Document Type Definition (DTD) used on top of a SGML parser. Every SGML based document contains three elements as illustrated in the figure:

HTML is now superseeded by HTML+ that is a much enriched DTD with possibilities of handling tables, math, images etc. Currently many browsers support a subset of the HTML+ specification in addition to the basic HTML features.

Interactive World-Wide Web Model

MORE MORE MORE MORE

Other Information Systems

Henrik Frystyk, July 1994

World-Wide Web Software at CERN

Henrik Frystyk, July 94

World-Wide Web Software at CERN

This document is an overview of the World-Wide Web software developed at CERN. The development of the World-Wide Web code was started by Tim Berners-Lee in 1990. Eversince the code has been subject for changes due to modifications in the architectural model, additions of new features etc. etc. The code is freely available as public domain software with a very mild Copyright specification.

During the last two years more and more World-Wide Web applications have become available from a large amount of software providers on almost every platform connected to the Internet. Many of them are based on the same architectural model as the CERN software but with additional functionality and increased performance. Most of the software is characterised by being freely available as public domain for educational institutions and other non-profit organisations whereas commercial companies must pay a fee for using the software.

The CERN World-Wide Web software is especially designed to be used on a large set of different platforms. A newly started collaboration between WWW software providers will expand this portability to also include MS-DOS so that the most popular platforms for large computers down to PCs are covered. This document describes the WWW software products that are developed and maintained at CERN:

The Library of Common Code
The Line Mode Browser
The HTTP Server
The Proxy Server

The description is meant as an introduction, as the current specifications and hence the documentation is rapidly changing. However, the document can be thought of as the topnode of the current documentation for the World-Wide Web Software currently available at CERN.

The Library of Common Code

The CERN World-Wide Web Library of Common Code is a general code base that can be used to build clients and servers. It contains code for accessing HTTP, FTP, Gopher, News, WAIS, Telnet servers, and the local file system. Furthermore it provides modules for parsing, managing and presenting hypertext objects to the user and a wide spectra of generic programming utilities. The Library is the basic of many World-Wide Web applications and all the CERN WWW software is build on top of it. The following figure is an overview of the current architecture of Library. The view is especially for the client side of the library. The proxy and the HTTP server have different views of the architecture.

The Line Mode Browser

The CERN Line Mode Browser is a character based World-Wide Web Browser. It is developed for use on dumb terminals and as a test tool for the CERN Common Code Library. It can be run in interactive mode, non-interactive mode, as a proxy client and a set of other run modes that all are explained in Command Line Options. Eventhough it is not often used as a World-Wide Web browser, the possibility of executing it in the background or from a batchjob makes it a useful tool. Furthermore it gives a variety of possibilities for data format conversion, filtering etc.

The HTTP Server

CERN httpd is a generic hypertext server which can be used as a regular HTTP server. The running typically on port 80 to serve hypertext and other documents, and also as a proxy -- a server on a firewall machine -- that provides access for people inside a firewall to the outside world. When running as proxy httpd may be configured to do caching of documents resulting in faster response times.

The Proxy Server

The CERN World-Wide Web Server can also be run in as a proxy server. A WWW proxy server, proxy for short, provides access to the Web for people on closed subnets who can only access the Internet through a firewall machine. The hypertext server developed at CERN, cern_httpd, is capable of running as a proxy, providing seamless external access to HTTP, Gopher, WAIS, FTP and cern_httpd has had gateway features for a long time, but only this spring they were extended to support all the methods in the HTTP protocol used by WWW clients. Clients don't lose any functionality by going through a proxy, except special processing they may have done for non-native Web protocols such as Gopher and FTP.

Henrik Frystyk, July 1994

Current Implementation of the HTTP Client in the CERN Common Code Library

Henrik Frystyk, July 1994

Implementation of the HTTP Client

This document describes the current implementation of the HTTP Protocol as from the latest released version of the CERN Common Code Library, see Current Version Number. Most of what follows is from HTTP version 1.0 but also version 0.9 is supported.

HTTP Request

The following features are supported in the clients HTTP Request:

Methods

Currently only GET is supported as a HTTP Methods, but POST is on the Library Working List.

Request Headers

The normal set of HTTP headers send in a HTTP request is:

Accept:: The current implementation uses one accept line for each format. This should be changed so that it is a comma separated list instead.
Referer:: If any parent anchor is known to the requested URL then this is send in the referer field.
From:: The full email-address is send along the request. It can be set to a value other than the physical location.
User-Agent:: The user agent is now send in a somewhat verbose format, but that will get changed to follow the current spec.
Authorization:: The authorization header is send if present

HTTP Response

These features are supported when a HTTP response returns from the server:

Success Codes

The following success codes are supported:

200 OK: The request was fulfilled
203 Partial Information: The body of the response is passed to the user and the client is informed that it is a partial response
204 No Response: No body is passed to the client

Automatic Redirection

The following two types of redirection are currently implemented:

301 Moved: The load procedure is recursively called when a 301 redirection is returned from the server. The new URL is parsed back to the user as information via the error stack, and a new request generated. The new request can be of any access scheme in a URL. An upper limit of redirections has been defined (default to 10).
302 Found: The functionality is the same as for a 301 found return status. A clever client can use the returned URL to change the document on the fly so that the old URL gets overwritten.

Access Authentication

If a 401 Unauthorized status code is returned the client asks the user for a user id and a password, see also the Access Authorization Scheme.

Error Handling

If a 500 Error code is returned from the server, the result is handled back to the client via the Error Stack.

Henrik Frystyk, July 1994

Specification of Multiple Threads

Henrik Frystyk, July 1994

Specification of Multiple Threads

This is the specification of the multi-threaded, interruptable I/O HTTP Client as it is implemented in the World-Wide Web Library of Common Code. See also the specification of the current HTTP Client.

This document is divided into the following sections:

The Principle of Threads
Platform Independent Implementation
Modes of Operation
Data Structures and Control Flow
Discussion

Introduction

In a single-process, single-threaded environment all requests to, e.g., the I/O interface blocks any further processing in the process. Any combination of a multi-process or multi-threaded implementation of the Library of common code makes provision for the user of the client application to request several independent documents at the same time without getting blocked by slow I/O operations. As a World-Wide Web client is expected to use much of the execution time doing I/O operation such as "connect" and "read", a high degree of optimization can be obtained if multiple threads can run at the same time.

A multi-process environment requires an extensive support from the underlying operating system. Unix is the classic platform that provides this functionality. The Unix system call "fork" generates a new child process that is an exact copy of the parent but in another address space. Only static parts like the text segment can be shared between the two processes and doesn't require a copy of the memory. All open file descriptors and socket descriptors are copied in the same manner so that the child can continue any operation including I/O from the same state as the parent.

The process of forking a child process is not unique for Unix, but the exact behavior is often quite platform dependent. Under VMS, "fork" is an extremely resource expensive procedure that in practice is unusable for fast program execution. Due to extensive security regulations in VMS, every process has a large set of environment variables that has to be initialized on the creation of the process. Furthermore, a process is created in an initial state independent of the parent process, so synchronization of the state of the parent and child process has to be established before the child is ready to execute the request.

Threads provide a more general and lightweight solution than process forking and this is the reason for their implementation in the Library of Common Code.

Platform Independent Implementation

The major concern in the design was to make an implementation that is as platform independent as possible. This means that it has not been possible to use traditional thread packages like DECthreads etc.

Instead the multi-threaded functionality of the HTTP client was designed to be used in a single-processor, single-threaded, environment. In order to do this, the following rules must be kept:

Global variables can be used only if they at all time are independent of the current state of the active thread.
Automatic variables can be used only if they are initialized on every entry to the function and stay state independent of the current thread througout their lifetime.
All information necessary for completing a thread must be kept in an autonome data object that is passed round the control flow via the stack.

These rules makes it possible to animate a multi-threaded environment using one stack and furthermore, the implementation is very portable as only plain C is used on top of the I/O interface.

Modes of Operation

In order to keep the functionality of the HTTP Client as general as possible, three different modes of operation are implemented:

Base Mode

This mode is strictly single-threaded and is what the Library is today, that is version 2.16pre2 (and also 2.17). The difference between this mode and the other two are basically that all sockets are made blocking instead of non-blocking. The HTTP client itself is the same as for the other modes and is basically a state machine.

The mode is implemented to make clients that use the CERN Library able to work using the new version without any changes at all (or very little). This is also the mode used for the CERN Proxy server using forking. Currently this mode doesn't provide interruptable I/O as this is a integral part of the event loop.

Active Mode

In this mode the event loop (select-function) is placed in the library. This mode is for dumb terminal clients that only can interrupt through stdin using the keyboard. The client can, however, still be multi-threaded in the sense that it can activate pre-fetch of documents not yet requested by the user. If a key is hit, the library has a call back function to the client so that the client decides whether the current operation should be interrupted or not. If so, the library stops all I/O activity and handles the execution back to the client.

The Active Mode should only cause minor changes to the client in order to obtain a simple form of multiple threads.

Passive mode

This is the mode that requires the most advanced client, e.g., a GUI client. On every HTTP request from the client, the Library initiates the connection and as soon as it is ready for reading or writing, it returns an updated list of active socket descriptors used in the Library to the client.

When the client sees that a socket is ready for action or it has been interrupted, it calls a Library socket-handler passing the socket number and what has to be done. Then the socket handler finds the corresponding request and executes the read, write or interrupt.

As soon as the thread has to access the net again, the socket handler stops and returns the execution to the client.

Data Structures

The basic data structure for all requests to the Library regardless of the access scheme used is the HTRequest structure. This structure was introduced in the 2.15 release of the library, but was a completely flat data model in the first version. In version 2.16 and later, the request structure has turned into a hierarchical data model in order to establish more clean interfaces between the data structures in the Library.

As no automatic or global variables are available in this implementation model every thread has to be state based and must contain all necessary information in a separate data object. In order to make a homogeneous interface to the HTRequest structure the new protocol specific data structure HTNetInfo has been defined.

The definition of this data object is highly object oriented as every protocol module in practice can define a sub class of the HTNetInfo structure in order to add information necessary for completing a thread. Again this is all done in plain C in order to maintain a high degree of portability.

Control Flow

As the current implementation of multiple threads is valid for HTTP access only, the data flow of the library has basically been preserved, see the general control flow diagram.

All other access schemes but HTTP protocol still use blocking I/O and the user will not notice any difference from the current implementation. The result of this is that full multi-threaded functionality only is enabled if the client uses consecutive HTTP requests. When a request is initiated having another access scheme than HTTP, e.g. FTP, the multi-threaded functionality partly stops as the new request gets served using blocking I/O. It is currently for the client to decide whether a new non-HTTP request can be activated when one or more HTTP request are already active. It is strongly recommended for the active mode that the client awaits the return from the HTTP event-loop, i.e., that no more HTTP requests are active or pending.

Even though the FTP and Gopher Client now are implemented as state machines, only the HTTP client has been implemented having multiple threads.

For the HTTP access, however, a socket event loop has been introduced. This might as indicated in the Introduction either be implemented by the client or the library. When other protocol modules than the HTTP client are fully implemented as multi-threaded clients they can be moved down under the event loop just like the HTTP client.

The event loop is designed as an event driven call back functions. When used in active mode, the only events recognized are from a given set of file descriptors including standard input (often specified as file descriptor 0). As indicated in the figure, the event loop handles two kinds of call back functions: the ones that are internal library functions such as the loading function in the protocol modules, and the ones that require an action taken by the client application.

Interrupting a HTTP Request

The current interrupt handler meant for Active Mode is quite lazy as it only looks for interrupts when about to execute a blocking I/O operation and program execution returns to the event loop. The reason for this is that the user is not blocked even though the interrupt doesn't get caught right away so it is not as critical as in a single-threaded environment.

Discussion

The following list are topic concerning limitations and problems in the current implementation:

Interruptable multi-threading is only implemented for the HTTP protocol module. The structural changes in the FTP module, Gopher module and local file access in the CERN Library 2.16 make provision for interruptable multiple threads but they are yet implemented as such.
From Library version 2.16 a new error/information module has been added. It is based on the assumption that every request has its own information stack independent of other requests. However, how should error information coming from the multi-thread module be passed back to the user if not via stderr?
The interrupt handling in Active Mode is quite lazy as it only checks whether an interrupt when a I/O operation would block and program execution returns to the event loop.

Henrik Frystyk, July 1994

Client Interface for Posting in the CERN Common Code Library

Henrik Frystyk, July 1994

Client Interface for Posting

This document describes the posting interface between a World-Wide Web client and the Library of Common Code. The design has been made so that the interface is flexible enough to support not only the POST method in the HTTP Protocol, but at the same time provide functionality for posting to MAIL addresses using SMTP and NNTP news groups. Posting to other Internet servers such as FTP servers is not considered in this document, but the concept is designed with the possibility of including additional Internet protocols.

The document only describes the Client interface to the library but not how the actual posting should take place using the HTTP protocol. A draft description of a proposal for HTTP PUT and POST Method is available but still needs a lot of discussion.

Building a POST Web

The basic idea of the post interface to the library is that when using POST, the user is often interested in posting the data object to more than one recipient, e.g. the user can send the same data object to two mailing lists, a news group and at the same time store the data object as a file on a remote HTTP server. The term "data object" is used instead of "document" because the object to be posted can be in any MIME format and furthermore, it can be compressed or encoded before transmission.

As indicated in the figure, the user builds a "POST Web" around an anchor element before the POST is committed and passed to the Library. The rounded boxes are remote logical data objects that are still to be created and the lines are logical links between the post anchor and the remote data objects. The two links to the NNTP box represents that the data object is to be posted to two different news groups. In practice only one physical link would actually exist to the NNTP as all posting to a NNTP Server is done in one transaction only.

The actual user interface is for the client to implement, but typically a GUI-client could use drag-and-drop icons for building the Web. The POST Web could be visualized using a user implemented menu of icons of the most used recipients and then let the user drag lines between the data object to be posted and the recipients. The CERN Library helps the user building the Web by providing the functionality of linking the anchors together.

Posting a Document

When the user has decided to commit the POST, the post anchor is passed from the client to the library together with a request structure. This is very similar to the current implementation of a HTTP GET Method.

An important consideration in the design has been to include the multi-threaded functionality in the posting module so that both PUT and POST in principle behaves exactly like a GET or any other method using non-blocking I/O, see also the multi-threaded control flow.

However, PUT and POST methods are inherently different from the GET method as they require a data flow in both directions. This is especially the case using PUT as a data object often will be returned from the remote server, see also Put and Post in HTTP.

The routine for handling the POST in the library parses the POST Web and groups the individual requests into protocol categories. In the case of a NNTP request, all POST requests are handled at the same time and send to the actual NNTP server. When the parsing of the Web is done, the post module recursively calls the protocol modules to execute the POST.

The posting mechanism is designed to be compatible to the multithreaded structure of the Library. This means that the client is asked for data based on an event driven action taken by the event loop. The client then gets a stream so that the document to be posted can be pumped down the stream and out on the net.

Handling the POST Result

The result of the posting varies as a function of the protocol used. It is a general rule through out the design of the Library of Common Code that other protocols than HTTP should be supported but not extended beyond their individual limitations. This means that the library has to be flexible enough to handle more than one result from a posting transaction dependent on the protocol used.

An immediate result from a post transaction is available using NNTP or HTTP but when using SMTP, the result might be delayed several days. In practice there is no way that the client can await a response for that amount of time. Therefore it is proposed to introduce "sentto" as a new URI scheme specifically to indicate the state of the posting to a mail address. This code means that the data object has been posted but no guarantee is given whether the recipient has actually received the data object or if the data object has been modified under the posting transaction. The returned URLs from the posting will therefore have the following syntax:

http://remote.server/local/location/foo.bar: The syntax of the generated HTTP URL as a result of a succeeded posting is identical to any other HTTP URL
news:messageid@news.host: The message ID is a unique string generated by the Library NNTP Module.
sentto:recipient@host.name: As the success of a mail posting can not be guaranteed, the access scheme has changed from "mailto" to "sentto". Then it is for the user or client to decide what to do with the URL. One solution could be to keep a copy of the posted document in a local cache for a limited time period. Though it is not foreseen that the World-Wide Web client can actually handle delayed return codes from the SMTP protocol, as the client can in fact be a HTTP Proxy server.

In the figure, postings using SMTP are still rounded because it is not possible for the user to actually access the document using the URL returned from the Library. As the NNTP posting failed in the figure, the document is still non-existent. Only the HTTP posting has turned into a physical document reference on the remote server.

When all postings are terminated, a new anchor is generated containing the allocated links on the remote servers for future references (except in the SMTP case). The amount of links returned might be a subset of the requested recipients as only posting that terminated successfully are registered. Using this set, the client must determine what to do with failed postings. One possibility would be to try again using the failed part of the POST Web, or simply to discard it.

Summary

This document describes the posting interface between the client application and the World-Wide Web Library of Common Code. The user first builds a "POST Web" where logical links are created between the document to be posted and the set of desired recipients. On commit of the posting the POST web and a request structure is parsed to the general posting routine within the Library. The posting routine groups the postings and recursively calls the NNTP, SMTP and or the HTTP module of the Library until the POST Web is traversed. The NNTP module is only done once due to the NNTP protocol implementation of posting.

The "sentto" access scheme has been proposed as a new URI access scheme to compensate for the fact that when posting to a SMTP server, no guarantee is given that the data object has actually reached the recipient.

As mentioned in the introduction this document doesn't describe how the HTTP protocol should handle POST as this is all handled internally in the Library. However, the interface between the Library and the client stays the same regardless of the HTTP POST Implementation.

Henrik Frystyk, July 1994

HTTP Multi Part Transactions

Henrik Frystyk, July 1994

HTTP Multi Part Transactions

Henrik Frystyk, July 1994

Summary

Henrik Frystyk, July 1994

Summary

Henrik Frystyk, July 1994

References

Henrik Frystyk, July 1994

References

The following references are strictly references to information that I havn't been able to find on the Internet.

Advanced Programming in the UNIX Environment, W. Richard Stevens, Addison-Wesley 1992, ISBN 0-201-56317-7
Computer Networks, Second Edition, Andrew S. Tanenbaum, Prentice-Hall International Editions 1989, ISBN 0-13-166836-6
DNS and BIND, Paul Albitz & Cricet Liu, O'Reilly & Associates, Inc. 1992, ISBN 1-56592-010-4
Guide to DECThreads, Digital Equipment Corporation, November 1991, Order Number AA-PJNEA-TK
Internetworking with TCP/IP, Principles, Protocols, and Architecture, Douglas Comer, Prentice-Hall 1988, ISBN 0-13-470154-2
UNIX Internetworking, Uday O. Pabrai, Artech House 1993, ISBN 0-89006-686-X
Modern Operation Systems, Andrew S. Tanenbaum, Prentice-Hall International Editions, 1992, ISBN 0-13-595752-4
UNIX Network Programming, W. Richard Stevens, Prentice-Hall 1990, ISBN 0-13-949876-1

Henrik Frystyk, July 1994