An Alternative Architecture for Distributed
Hypertext
T. Berners-Lee, R. Cailliau, N.
Pellow, B. PollermannCERN, 1211 Geneva 23 Switzerland.
See also Outline , and figues ( L-shaped
, conversions )See Writenow file
which is definitive, and the poster
..
Contact: Tim Berners-Lee, timbl@nxoc01.cern.ch.
tel +41(22)767 3755, Fax:+41(22)767
7155. Paper mail as above.
The project ("WorldWideWeb", or W3)
at the European Particle Physics
Laboratory (CERN) to build a wide-area
hypertext information system was
faced with a multitude of different
environments on diverse platforms.
In an attempt to allow interworking
of many vendor's products, an architecture
was developed which does not assume
a common file system nor a common
link repository. Format negotiation
between information supplier and
requester, and equivalence of hypertext
and index searches provided a architecture
flexible enough to allow a high level
of interworking without constraining
future developments.
Work in High Energy Physics goes
on in a small community of computer-literate
people. They are located in national
and international laboratories and
universities spread all over the
world. They rely on rapid dissemination
of information, today mainly achieved
by electronic mail and fax, but increasingly
by direct file transfer between computers.
The community uses computing equipment
of all kinds without constraints.
In this context, the problem we address
is the design of a flexible hypertext
system to allow both centralized
distribution of data and collaborative
work. The specific goal of W3 is
to share information with ease between
different systems on which different
standards are current. We wished
to provide access to existing data
without affecting the way that data
is sourced and managed. Hypertext
data must exist alongside normal
text and graphic data, and indexes.
Our survey of the existing documentation
formats, and the historical impossibility
of enforcing a given standard throughout
the High Energy Physics community,
indicated that we would have to deal
with a large variety of formats.
(footnote)These include SGML* of
various colors, LaTeX*, MicroSoft's
Rich Text Format*, nroff, Postscript*
and "DVI" files, several graphics
formats, and of course plain ASCII
text. As hypermedia development is
progressing so rapidly, we would
be foolish not to leave the list
open-ended.
In a single small working group an
authoring system must have a high
functionality. However, in large
groups spread over a large geographical
area, members are often prepared
to accept a reduced service for the
sake of good connectivity. For example,
a physics detector design team in
Geneva will tend to use the same
type of machine, will run the same
application, and may use sophisticated
co-authoring facilities. But the
other teams of the collaboration,
perhaps in Sweden, Portugal, Chicago
etc., may be happy to have read-only
copies of the same documents in plain
text. With the aid of existing conversion
tools, it is in fact often possible
for both parties to be satisfied.
Therefore, the system must adapt,
giving a high functionality when
there is a high overlap between environments,
and falling back on lower functionality
when there is not. This "L-shaped"
strategy [ pic ] makes the best use
of existing facilities.
The WWW model [fig] is a client-server
model, in which the information is
stored by hypertext data base servers
and accessed by browser clients (henceforth
called browsers). The contents of
a hypertext node is sent to the browser
by the server. When requesting a
document the browser informs the
server of the set of formats which
it understands, with a rough estimate
of the degradation of quality and
the time penalty which it would accept
for data in that format.
The server knows the format or formats
in which it posesses the document.
It can make its own estimate of the
desirability of converting the document
into other formats for transmission.
It will transmit the document to
the browser in that format it considers
optimal.
[Picture: Some conversion paths between
a few of the more well-known formats
for textual data]
Note that on the browser side various
conversions might then still occur,
and a particular application may
be run to present the incoming data.
In the special case in which the
browser and server share a file system,
the server can be an access library
linked to the browser code, so explicit
transmission of the data between
server and browser is not necessary.
Any browser must be prepared to handle
a basic subset of data formats. This
includes a simple marked-up hypertext
format, and plain ASCII text.
Uniform view of information
Much discussion [CROFT90] has come
of the relationship between hypertext
and conventional index-based information
retrieval methods. Because the WWW
servers are decoupled from the browsers,
they can provide essentially any
function: a server can be built to
treat an index as a hypertext document
which accepts an index query operation.
The result of the index query is
a "virtual hypertext node" containing
links to all the information found.
[pic: example]. In this model, hypertext
documents and indexes may both lead
to other documents or indexes. The
reader does not perceive any difference
in behaviour, and this equivalence
gives the system simplicity of use,
and makes browsing more intuitive.
Certain other existing data may be
converted by the server into hypertext
on the fly. In this way, even a simple
browser program gives the reader
a uniform view of many sources of
data. An example is a usenet news
article [RFC] which in its header
field contains links to other articles
and newsgroups; another is information
derived from a database, for which
fields may be linked to the results
of further database queries.
Uniform Addressing
There are several reasons for having
a unique name for any document, including
the virtual hypertext produced by
a server after an index search.
One reason is that this aids the
caching of data. Any format conversion
done by the server, the transmission
of the data across a wide area network,
and any format conversion done by
the browser, are all operations which
could take time. Server and browser
programs will very possibly wish
to cache the results of some of these
steps for some minutes or days.
A second reason is that it is useful
to make hypertext links to the result
of an index search.
In our arbitrary convention, the
name of the result of a search is
the WWW address of the index, followed
by a question mark and a list of
keywords:
http://node/path/path/path?keyword+keyword+keyword
The format of such an address must
be the same throughout the system.
Whilst a browser does not need to
understand the details, it must be
able to determine whether two addresses
are equivalent. An obvious extension
to this format would be to allow
a sophisticated database query to
be included: in its general form
the address may be regarded as a
functional expression for the document.
This suggests the use of a generalised
query language such as SQL*. However,
for interchange, one must avoid a
plethora of proprietary scripting
languages.
This address is treated as a document
address. The virtual document to
which it refers may be treated as
any other document. For instance,
one can make a link to it. In this
case, any time a user follows a link,
the search will be reevaluated (subject
to caching). In general, this is
a function of the server. The index
document itself contains text which
instructs a human reader how a search
may be made. To help the casual user,
different index documents may exist
which in fact cover the same data,
but use different search algorithms,
such as full text word indexing,
or regular expression matching with
boolean operators.
@@problem of identifying equivalent
addresses.
Protocol
The proper definition of the interface
between browser and server requires
a protocol to be defined. The functionality
required is similar to that of the
FTP protocol* but has the added format
negotiation and search features.
In practice we used a trivial protocol
on top of TCP, with a separate connection
being made for each (atomic) request.
The remote search facility, in the
case in which there is a wide area
network between client and browser,
obviously saves the time required
to perform a remote access of a large
index by the browser. Furthermore,
it allows indexes to be maintained
and used in different ways on different
servers, perhaps using proprietary
products most suited to the server's
environment.
When the hypertext node has been
located by the server, its contents
are transferred to the browser's
machine.
Lack of central database
There are two schemes of storing
links:
- as part of the contents of the hypertext
node to which they belong,
- in a separate data base of links
outside the nodes.
The browser must have access to all
the links emanating from a node.
Therefore, since in WWW the node
contents are transferred to the browser's
machine, the links must be shipped
along with these contents or be part
of them. This implies that the server
knows how to get the relevant links
in the case they are stored in a
separate data base. WWW therefore
keeps the links as part of the node
contents.
No third party is involved in the
protocol. In the case of SGML marked
up files, the data is stored in the
same file as the document. The advantages
of this approach are
- Coordination is not needed between
a link database and an editor when
a new document is made. (expand
pros and cons - does system scale?)
- If a link database is needed, for
example for performing statistics,
management, or for the generation
of overview representations, then
it can be generated by a background
process, in much the same way as
a full text index will typically
be updated from the source documents
overnight. Such databases, like indexes,
become value added services performed
by a "knowbot*". Indeed, the indexes
and the links become much more similar
when viewed in this way, enhancing
the homogenous nature of the combined
system. Given the "generic link"
concept [microcosm, perseus90], in
which the distinction between an
explicit link and an index search
becomes blurred, this homogeneity
is all the more important.
The disadvantage of this approach
is that if a document is moved, the
link information stored in remote
nodes must be updated. This problem,
which is still present with a link
repository model once the database
of links becomes distributed, may
be solved using directory systems
such as [NCS location broker, X500]
to convert between names and addresses
as is required and done for electronic
mail*. The WWW open naming convention
allows logical name spaces served
by various name servers to be included.
Experience to date
At the time of writing, our prototype
system includes a hypertext browser/authoring
system written on top of the NeXTStep*
human interface tools, and a dumb-terminal
oriented browser. A prototype server
on a large mainframe provides access
to indexes and documents for High
Energy Physics including internal
newsletters, manuals, news and help;
internet news articles using NNTP
protocol*, and local and remote
files in a number of formats. These
indexes are maintained using existing
FIND* software. Hypertext documents
may contain pointers to other documents,
to indexes, and also to the results
of keyword searches. Non-hypertext
documents are included in the scheme,
to a total of around @@@@ documents
of which @@@@ are in hypertext. We
have written personal notes and project
documentation in hypertext, linking
it in to external information where
relevant. The policy of access to
existing data has allowed the system
to break the threshold of usefulness
early on. [picture of existing software]
Conclusion
The chief advantage of the architecture
described has been the great independence
of browser and server technologies.
The various parts use very different
technologies [see fig] and have been
enhanced without any global changes
being made. As new products arise,
provided conversion facilities exist,
we feel confident that we can continue
to offer wide-area hypertext connectivity,
maybe at relatively low functionality,
but without impairing the high functionality
for those who in a local area share
documents in a single data format.
We hope that this situation will
allow freer interchange of information
in the High Energy Physics community,
and allow de facto standards for
interchange formats to arise naturally.