1991
HyperText Transfer Protocol Design Issues
See also: Why a new protocol? and the document
on penalties
Here are some design decisions to be made for protocols for information retrieval
for hypertext.
Underlying protocol
There are various distinct possible bases for the protocol - we can choose
-
Something based on, and looking like, an Internet protocol. This has the
advantage of being well understood, of existing implementations being all
over the place. It also leaves open the possibility of a universal FTP/HTTP
or NNTP/HTTP server. This is the case for the current HTTP.
-
Something based on an RPC standard. This has the advantage of making it easy
to generate the code, that the parsing of the messages is done automatically,
and that the transfer of binary data is efficient. It has the disadvantage
that one needs the RPC code to be available on all platforms. One would have
to chose one (or more) styles of RPC. Another disadvantage may be that existing
RPC systems are not efficient at transferring large quantities of text over
a stream protocol unless (like DD-OC-RPC) one has a let-out and can access
the socket directly.
-
Something based on the OSI stack, as is Z39.50. This would have to be run
over TCP in the internet world.
Current HTTP uses the first alternative, to make it simple to program, so
that it will catch on: conversion to run over an OSI stack will be simple
as the structure of the messages is well defined.
Another choice is whether to make the protocol idempotent or not. That is,
does the server need to keep any state information about the client? (For
example, the NFS protocol is idempotent, but the FTP and NNTP protocols are
not.) In the case of FTP the state information consists of authorisation,
which is not trivial to establish every time but could be, and current directory
and transfer mode which are basically trivial. The proposed protocol IS
idempotent.
This causes, in principle, a problem when trying to map a non-idempotent
system (such as library search systems which stored "result sets" on behalf
of the client) into the web. The problem is that to use them in an idempotent
way requires the re-evaluation of the intermediate result sets at each query.
This can be solved by the gateway intelligently caching result sets for a
reasonable time.
Request: Information transferred from client
Parameters below, however represented on the network, are given in upper
case, with parameter names in lower case. This set assumes a model of format
negotiation in which in which the client says what he can take, and the server
decides what to give him. One imagines that each function would return a
status, as well as information specified below.
When running over a byte stream protocol, SGML would be an encoding possibility
(as well as ASN/1 etc).
Here are some possible commands and parameters:
-
GET document name
-
Please transfer a named document back. Transfer the results back in a standard
format or one which I have said I can accept. The reply includes the format.
In practice, one may want to transfer the document over the same link (a
la NNTP) or a different one (a la FTP). There are advantages in each technique.
The use of the same link is standard, with moving to a different link by
negotiation (see PORT).
-
SEARCH keywords
-
Please search the given index document for all items with the given word
combination, and transfer the results back as marked up hypertext. This could
elaborate to an SQL query. There are many advantages in making the search
criterion just a subset of the document name space.
-
SINCE datetime
-
For a search, refer to documents only dated on or after this date. Used typically
for building a journal, or for incremental update of indexes and maps of
the web.
-
BEFORE datetime
-
For a search, refer to documents before this data only.
-
ACCEPT format penalty
-
I can accept the given formats . The penalty is a set of numbers giving an
estimate of the data degradation and elapsed time penalty which would be
suffered at the CLIENT end by data being received in this way. Gateways may
add or modify these fields.
-
PORT
-
See the RFC959 PORT command. We could change
the default so that if the port command is NOT specified, then data must
be sent back down the same link. In an idempotent world, this information
would be included in the GET command.
-
HEAD doc
-
Like GET, but get only header information. One would have to decide whether
the header should be in SGML or in protocol format (e.g. RPC parameters or
internet mail header format). The function of this would be to allow overviews
and simple indexes to be built without having to retrieve the whole document.
See the RFC977 HEAD command.
The process of generation of the header of a document from the source (if
that is how it is derived) is subject to the same possibilities (caching,
etc) as a format conversion from the source.
-
USER id
-
The user name for logging purposes, preferably a mail address. Not for
authentication unless no other authentication is given.
-
AUTHORITY authentication
-
A string to be passed across transparently. The protocol is open to the
authentication system used.
-
HOST
-
The calling host name - useful when the calling host is not properly registered
with a name server.
-
Client Software
-
For interest only, the application name and version number of the client
software. These values should be preserved by gateways.
Response
Suppose the response is an SGML document, with the document type a function
of the status. ( Example )
-
Status
-
A status is required in machine-readable format. See the 3-figure status
codes of FTP for example. Bad status codes should be accompanied by an
explanatory document, possible containing links to further information. A
possibility would be to make an error response a special SGML document type.
Some special status codes are mentioned below .
-
Format
-
The format selected by the server
-
Document
-
The document in that format
-
Success
-
Accompanied by format and document.
-
Forward
-
Accompanied by new address. The server indicates a new address to be used
by the client for finding the document. the document may have moved, or the
server may be a name server.
-
Need Authorisation
-
The authorisation is not sufficient. Accompanied by the address prefix for
which authorisation is required. The browser should obtain authorisation,
and use it every time a request is made for a document name matching that
prefix.
-
Refused
-
Access has been refused. Sending (more) authorization won't help.
-
Bad document name
-
The document name did not refer to a valid document.
-
Server failure
-
Not the client's fault. Accompanied by a natural language explanation.
-
Not available now
-
Temporary problem - trying at a later time might help. This does not i,ply
anything about the document name and authorisation being valid. Accompanied
by a natural language explanation.
-
Search fail
-
Accompanied by a HTML hit-list without any hits, but possibly containing
a natural explanation.
Tim BL