W3C libwww Architecture

Protocol Modules as State machines

A part of the libwww thread model is to keep track of the current state in the communication interface to the network. As an example, this section describes the current implementation of the HTTP module and how it has been implemented as a state machine. The HTTP module is based on the HTTP 1.0 specification but is backwards compatible with the 0.9 version. The major difference between the implementation before version 3.0 of the Library is that this version is a state machine based on the state diagram illustrated below. This implementation has several advantages even though the HTTP protocol is stateless by nature.

The individual states and the transitions between them are explained in the following sections.

This state is the idle state or initial state where the HTTP module awaits a new request passed from the application.
The HTTP module is now ready for setting up a connection to the remote host. The connection is always initiated by a connect system call. In order to minimize the access to the Domain Name Server, all host names to previous visited hosts are stored in a local host cache as explained in section "DNS Cache and Host Name Canonicalization". The cache handles multi homed hosts in a special way in that it measures the time it takes to actually make a connection to one of the IP-addresses. This time is stored together with the specific IP-address and the host name in the cache and on the next connection to the same host the IP-address with the fastest connect time is chosen.
The HTTP Request is what the application sends to the remote HTTP server just after the establishment of the connection. The request consists of a HTTP header line, a set of HTTP Headers, and possibly a data object to be posted to the server. The header line has the following format:
When the request is sent the module waits until a response is given from the server or the connection is timed out in case or an error situation. As the module does not know whether the remote server is a HTTP 0.9 server or a HTTP 1.0 it must look at the first part of the response to figure out what version of HTTP is returned. The reason is that the HTTP protocol 0.9 does not contain a HTTP header line in the response. It simply starts to send the requested data object as soon as the GET request is handled.
If a 401 Unauthorized status code is returned the module asks the user for a user id and a password, see also the " HTTP Basic Access Authorization Scheme". The connection is closed before the user is asked for the user-id and password so any new request initiated upon a 401 status code causes a new connection to be established. This is done in order to avoid having the connection hanging around waiting while the applications is waiting for user input.
The remote server returns a redirection status code if the URI has either been moved temporarily or permanent to another location, possibly on another HTTP server or any other service, for example FTP or gopher. The HTTP module supports both a temporarily and a permanent redirection code returned from the server:
301 Moved
The load procedure is recursively called on a 301 redirection code. The new URI is parsed back to the user as information via the Error and Information module, and a new request generated. The new request can be of any access scheme accepted in a URI. An upper limit of redirections has been defined (default to 10) in order to avoid infinite loops.
302 Found
The functionality is the same as for a 301 Moved return status. A clever application can use the returned URI to change the document in which the URI originates so that the URI points to the new location.
When a return code indicates that no data object or resource follows the HTTP headers the HTTP module can terminate the request and pass control back to the application.
If a body is included in the response from the server, the module must prepare to read the data from the network and direct it to the destination set up by the application. This is done by setting up a stream stack with the required conversions.
When the data object has been parsed through the stream stack, the HTTP module terminates the request and handles control back to the application.
If at any point in the request handling a fatal error occurs the request is aborted and the connection closed. All information about the error is parsed back to the application via the Error and Information Module. As the HTTP protocol is stateless, all errors are fatal between the server and the server. If the erroneous request is to be repeated, the request starts in the initial state.

Henrik Frystyk Nielsen, libwww@w3.org,
@(#) $Id: HTTPFeatures.html,v 1.16 1996/12/09 03:20:54 jigsaw Exp $