Henrik
Frystyk, July 1994
Implementation of the HTTP Client
This document describes the current implementation of the HTTP
Protocol as from version 3.0 (unreleased, August 1994) of the Library
of Common Code (see
current version number). The HTTP client is based on the HTTP 1.0
specification but is backwards compatible with the 0.9 version. The
major difference between the implementation before version 3.0 is that
this version is a state machine based on the state diagram illustrated
below. The advantage of this approach will become obvious in the
section on Multi Threaded Clients even
though the HTTP protocol is stateless by nature.
The individual states and the transitions between them are explained
in the following sections. Note the difference in notation between the
client (client side application built un top of the library)
and HTTP Client that is the protocol module in the library.
- BEGIN State
- This state is the idle state or initial state where the HTTP
client awaits a new request passed from the client. If the user at
this point has typed in a userid and a passwd for
access authorization the HTTP client also prepares the
Authorization header.
- NEED_CONNECTION State
- The HTTP client is now ready for setting up a connection to the
remote host. The connection is always initiated by a connect
system call. In order to minimize the access to the Domain Name Server, all host names to
previous visited hosts no matter of the
access schemes used) are stored in a local host cache. The cache
handles multi-homed hosts in a special way in that it measures the
time it takes to actually make a connection to one of the
IP-addresses. This time is stored together with the specific
IP-address and the hostname in the cache and on the next connection to
the same host the IP-address with the fastest connect time is chosen.
- NEED_REQUEST State
- The HTTP
Request is what the client sends to the remote HTTP server just after the
establishment of the connection. The request consists of a HTTP header line, a
set of MIME Headers, and possibly a data
object to be posted to the server. The header line has the following format:
<METHOD> <URI> <HTTP-VERSION> CRLF
Current methods supported in the clients are
- GET
- This is for requesting a URI or for specifying a
text search. Text searches are initiated by placing a "?" in the URI.
- HEAD
- The HEAD method is equivalent to GET in the sense that it is
requesting a URI at the remote server. The difference is that the
server only returns the HTTP headers but no data object. This is used
for updating cache information, getting information on the size of the
data object (or body) etc.
In the section on Put and Post the
implementation of the client interface to the library is described.
However, the actual implementation of
PUT and POST
in the HTTP protocol is yet to be specified. The reason is that the
current specification is limited and does not allow the HTTP protocol
to be a superset of existent Internet
Presentation Protocols.
The HTTP headers are as mentioned a set of MIME headers even though they are
not all officially accepted by the MIME specifications. The HTTP client
supports the following headers:
- Accept:
- The current implementation uses one accept line for each
MIME-type supported by the client. For advanced clients this means
that the "Accept: " sequence is repeated 20-30 times which gives a
overhead of 200-300 bytes per request (including the CRLF telnet
EOL-sequence). This should be changed so that either a comma separated
list is transmitted instead or only the MIME content types without any
subtypes.
-
Referer:
- If any parent anchor is known to the requested URI this is send
in the referer field. This is to let the server know what link has led
to the current request. Nothing is sent if the parent anchor is
unknown or does not exist.
- From:
- The full email address is sent along the request. It is meant as
an informative service to the recipient and can be changed to any
value the user wishes to sign the request with. As it is possible to
manipulate the email address, this field can not be used for any
security verification or precaution.
- User-Agent:
- The user agent is by many clients currently generated in a
somewhat verbose format. The goal is to make this field machine
readable so it can be used on the server side to perform individual
actions as a function of the client version. As a side effect it can
also be used for statistics etc.
- Authorization:
- The authorization header as introduced in the NEED_CONNECTION state.
- SENT_REQUEST State
- When the request is sent the client waits until a response is given from
the server or the connection is timed out in case or an error situation. As
the client does not know whereas the remote server is a HTTP 0.9 server or a
HTTP 1.0 it must look at the first part of the response to figure out what
version of HTTP is returned. The reason is that the HTTP protocol 0.9 does not
contain a HTTP header line in the response. It simply starts to send the
requested data object as soon as the GET request is handled. Future versions
of the HTTP protocol will all contain a header line with the protocol version
like the MIME protocol.
- NEED_ACCESS_AUTHORIZATION State
- If a 401
Unauthorized status code is returned the client asks the user for
a user id and a password, see also the
Access Authorization Scheme. The connection is closed before the
user is asked for the userid and password so any new request initiated
upon a 401
status code causes a new connection to be established.
- REDIRECTION State
- The remote server returns a redirection status
code if the URI has either been moved temporaryly or permanent to another
location, possibly on another HTTP server or any other server supported by the
WWW-model. The HTTP Client supports both a temporaryly and a permanent redirection code returned from the server:
-
301 Moved
- The load procedure is recursively called on a 301 redirection
code. The new URI is parsed back to the user as information via the
Error and Information module, and a new request generated. The new
request can be of any access
scheme accepted in a URI. An upper limit of redirections has been
defined (default to 10) in order to avoid infinite loops.
- 302
Found
- The functionality is the same as for a 301 Moved return status. A
clever client can use the returned URI to change the document in which
the URI originates so that the URI points to the new location.
-
- NO_DATA State
- When a return code indicates that no data object or resource
follows the HTTP headers the HTTP client can terminate the request and
pass control back to the client.
- NEED_BODY State
- If a body is included in the response from the server, the client
must prepare to read the data from the network and direct it to the
destination set up by the client. This is done by setting up a stream stack with the required
conversions.
- GOT_DATA State
- When the data object has been parsed through the stream stack,
the HTTP client terminates the request and handles control back to the
client.
- ERROR or FAILURE State
- If at any point in the request handling a fatal error occurs the
request is aborted and the connection closed. All information about
the error is parsed back to the client via the
Error and Information Module. As the HTTP protocol is stateless,
all errors are fatal between the server and the client. If the
erroneous request is to be repeated, the request starts in the initial state.
Henrik
Frystyk, frystyk@info.cern.ch, July 1994