An Alternative Architecture for Distributed Hypertext

T. Berners-Lee, R. Cailliau, N. Pellow, B. Pollermann

CERN, 1211 Geneva 23 Switzerland.

To be presented to the Hypertext '9 1 conference , San Antonio, Texas, December 1991. See also Outline .

Contact: Tim Berners-Lee, timbl@nxoc01.cern.ch. tel +41(22)767 3755, Fax:+41(22)767 7155. Paper mail as above.

Abstract

The project ("WorldWideWeb", or W3) at the European Particle Physics Laboratory (CERN) to build a hypertext information system across diverse platforms was faced with a multitude of different environments. In an attempt to allow interworkingof many vendor's products, an architecture was developed which does not assume a common file system nor a common link repository. Format negotiation between information supplier and requester, and equivalence of hypertext and index searches provided a architecture flexible enough to allow a high level of interworking without constraining future developments.

Context and Constraints of a heterogeneous environment

Work in High Energy Physics goes on in a small community of computer-literate people. They are located in national and international laboratories and universities spread all over the world. They rely on rapid dissemination of information, today mainly achieved by electronic mail and fax, but increasingly by direct file transfer between computers. The community uses computing equipment of all kinds without constraints.

In this context, the problem we address is the design of a flexible hypertext system to allow both centralized distribution of data and collaborative work. The specific goal of W3 is to share information with ease between different systems, on which different standards are current. We wished to provide access to existing data without affecting the way that data is sourced and managed. Hypertext data must exist alongside normal text and graphic data, and indexes.

Our survey of the existing documentation formats, and the historical impossibility of enforcing a given standard throughout the High Energy Physics community, indicated that we would have to deal with a large variety of formats.

These include SGML* of various colors, LaTeX*, MicroSoft's Rich Text Format*, nroff, Postscript* and "DVI" files, several graphics formats, and plain ASCII text. As hypermedia development is progressing so rapidly, we would be foolish not to leave the list open-ended.

User requirements

Whereas within a single working group, a high functionality is required of an authoring system, distant groups are often prepared to accept a reduced service for the sake of good connectivity. For example, the members of a design team will tend to use the same types of machine, will run the same application, and may use sophisticated co-authoring facilities, while users at a distance are often happy to have read-only copies of the same documents in plain text. With the aid of existing conversion tools, it is in fact often possible for both parties to be satisfied.

Therefore, the system must adapt, giving a high functionality when there is a high overlap between environments, and falling back on lower functionality when there is not. This "L-shaped" strategy [ pic ] makes the best use of existing facilities.

[Picture: Some conversion paths between a few of the more well-known formats for textual data]

Format negotiation

The WWW model [fig] is a client-server model, in which the information is stored by hypertext data base servers and accessed by browser clients (henceforth called browsers). When requesting a document by name or by keyword search, the browser informs the server of the set of formats which it understands, with a rough estimate of the degradation of quality and the time penalty which it would impose on data accepted in that format.

(!! I changed "client" to "browser")

(?? do you propose a numerical value for quality degradation, or a set of classes?)

The server knows the format or formats in which it has the document. It can make its own estimate of the desirability of converting the document into other formats for transmission. It will transmit the document to the browser in that format it considers optimal.

Note that on the browser side, various conversions might then still occur, and a particular application may be run to present the incoming data.

In the special case in which the browser and server share a file system, the server is an access library linked to the browser code, so explicit transmission of the data between server and browser is not necessary.

Any browser must be prepared to handle a basic subset of data formats. This includes a simple marked-up hypertext format, and plain ASCII text.

Uniform view of information

Much discussion [...] has come of the relationship between hypertext and conventional index-based information retrieval methods. The WWW architecture treats an index as a hypertext document which accepts an index query operation. The result of the index query is a "virtual hypertext node" containing links to all the information found. [pic: example]. In this model, hypertext documents and indexes may both lead to other documents or indexes. The reader does not perceive any difference in behaviour, and this equivalence gives the system simplicity of use, and makes browsing more intuitive.

Certain other existing data may be converted by the server into hypertext on the fly. In this way, a simple browser program gives the reader a uniform view of many sources of data. An example is a usenet news article [RFC] which in its header field contains links to other articles and newsgroups; another is information derived from a database, for which fields may be linked to the results of further database queries.

Uniform Addressing

There are several reasons for having a unique name for any document, including the virtual hypertext produced by an index search.

One reason is that this aids the caching of data. Any format conversion done by the server, the transmission of the data across a wide area network, and any format conversion done by the browser, are all operations which could take time. Server and browser programs will very possibly wish to cache the results of some of these steps for some minutes or days.

A second reason is that it is useful to make hypertext links to the result of an index search.

In our arbitrary convention, the name of the result of a search is the WWW address of the index, followed by a question mark and a list of keywords:

http://node/path/path/path?keyword+keyword+keyword

The format of such an address must be the same throughout the system. Whilst a browser does not need to understand the details, it must be able to determine whether two addresses are equivalent. An obvious extension to this format would be to allow a sophisticated database query to be included: in its general form the address may be regarded as a functional expression for the document. This suggests the use of a generalised query language such as SQL*. For interchange, one must avoid a plethora of proprietary scripting languages.

This address is treated as a document address. The virtual document to which it refers may be treated as any other document. For instance, one can make a link to it. In this case, any time a user follows a link, the search will be reevaluated (subject to caching). In general, this is a function of the server. The index document itself contains text which instructs a human reader how a search may be made. To help the casual user, different index documents may exist which in fact cover the same data, but use different search algorithms, such as full text word indexing, or regular expression matching with boolean operators.

@@problem of identifying equivalent addresses.

Protocol

The proper definition of the interface between browser and server requires a protocol to be defined. The functionality required is similar to that of the FTP protocol* but has the added format negotiation and search features. In practice we used a trivial protocol on top of TCP, with a separate connection being made for each (atomic) request.

The remote search facility, in the case in which there is a wide area network between client and browser, obviously saves the time required to perform a remote access of a large index by the browser. Furthermore, it allows indexes to be maintained and used in different ways on different servers, perhaps using proprietary products most suited to the server's environment.

When the hypertext node has been located by the server, its contents are transferred to the browser's machine.

Lack of central database

When a document is transferred, then so is the information about any links from that document. The link information must therefore be available to the server directly.

(!! I do not understand the above: choose from these:

The browser must have access to all the links emanating from a node, therefore, since the node contents are transferred to the browser's machine, the links must come along with these contents or be part of them.

No third party is involved in the protocol. In the case of SGML marked up files, the data is stored in the same file as the document. The advantages of this approach are

The disadvantage of this approach is that if a document is moved, the link information stored in remote nodes must be updated. This problem, which is still present with a link repository model once the database of links becomes distributed, may be solved using directory systems such as [NCS location broker, X500] to convert between names and addresses as is required and done for electronic mail*. The WWW open naming convention allows logical name spaces served by various name servers to be included.

Experience to date

At the time of writing, our prototype system includes a hypertext browser/authoring system written on top of the NeXTStep* human interface tools, and a dumb-terminal oriented browser. A prototype server on a large mainframe provides access to indexes and documents for High Energy Physics including internal newsletters, manuals, news and help; internet news articles using NNTP protocol*, and local and remote files in a number of formats. These indexes are maintained using existing FIND* software. Hypertext documents may contain pointers to other documents, to indexes, and also to the results of keyword searches. Non-hypertext documents are included in the scheme, to a total of around @@@@ documents of which @@@@ are in hypertext. We have written personal notes and project documentation in hypertext, linking it in to external information where relevant. The policy of access to existing data has allowed the system to break the threshold of usefulness early on. [picture of existing software]

Conclusion

The chief advantage of the architecture described has been the great independence of browser and server technologies. The various parts use very different technologies [see fig] and have been enhanced without any global changes being made. As new products arise, provided conversion facilities exist, we feel confident that we can continue to offer wide-area hypertext connectivity, maybe at relatively low functionality, but without impairing the high functionality for those who in a local area share documents in a single data format. We hope that this situation will allow freer interchange of information in the High Energy Physics community, and allow de facto standards for interchange formats to arise naturally.

References

(!! I need to scan the book by Nelson)

_________________________________________________________________

Tim BL