Tim Berners-Lee

Date: September 1998. Last modified: $Date: 1998/12/19 18:35:03 $

Status: An attempt to give a high-level overview of the architecture of the WWW. Editing status: Draft. Comments welcome.Commenst have been explicitly solicted from all W3C chairs, W3C Advisory Commitee members, and the IETF Apps and Transport Areas.

Up to Design Issues


Web Architecture from 50,000 feet

This document attempts to be a high-level view of the architecture of the World Wide Web. It is not a definitive complete explanation, but it tries to enumerate the architectural decisions which have been made, show how they are related, and give references to more detailed material for those interested. Necessarily, from 50,000 feet, large things seem to get a small mention. It is architecture, then, in the sense of how things hopefully will fit together. I have resisted the urge, and requests, to try to write an architecture for a long time, as in a sense, any attempt to select which, of the set of living ideas which seem to be most stable and logically connected and essential should be included produced a dead document. So we should recognise that while it might be slowly changing, this is also a living document.

The document is written for those who are technically aware or intend soon to be, so it sparse on explanation and heavy in terms of terms.

Goal

The W3C Consortium's broadly stated mission is to lead the Web to its "full potential", whatever that means. My definition of the Web is a universe of network-accessible information, and I break the "full potential" into two by looking at it first as a means of human-to-human communication, and then as a space in which software agents can, though access to a vast amount of everything which is society, science and its problems, become tools to work with us.

(See keynote speeches such as "Hopes for the future" at the Quebec Internet Forum, and have written up in outline for example in short essay "Realizing the full potential of the Web")

In this overview first I will deal with the properties of the space itself, and then look it is use as a human medium and then as a medium for machine reasoning.

This article assumes that the goals of interoperability, and creating an evolvable technology, are taken for granted and assumed throughout. The principles of universality of access irrespective of hardware or software platform, network infrastructure, language, culture, geographical location, or physical or mental impairment are core values in Web design: they so permeate the work described that they cannot be mentioned in any one place by will likewise be assumed throughout. (See Internationalization and the Web Accessibility Initiative)

Principles of Design

Similarly, we assume throughout the design process certain general notions of what makes good design. Brian Carpenter has ennumerated some of these [carpenter]. Principles such as simplicity and modularity are the stuff of software engineering; decentralization and tolerance are the life and breath of Internet. To these we might add the principles of least powerful language, and the test of independent invention when considering evolvable Web technology. I do not not elaborate on these here.

The fundamentals: The Universal Web

The most fundamental specification of Web architecture, while one of the simpler, is that of the Universal Resource Identifier, or URI. The principle that anything, absolutely anything, "on the Web" should identified distinctly by an otherwise opaque string of characters is core to the universality.

The URI specification in fact had a rocky history and has spent a lot of time on the back burner, but has recently been making progress.

(See the URI specification)

There are many design decisions about the properties of URIs which are fundamental in that they determine the properties of the Web, but which I will not go into here. They include the rules for the parsing and use of relative URI syntax, the relationship of view identifiers (fragment ids) to URIs. It is important that these are respected in the design of new URI schemes.

(See the first few Design Issues articles for detailed discussions of these)

URI schemes

The web is by design and philosophy a decentralized system, and its vulnerabilities lie wherever a central facility exists. The URI specification raises one such general vulnerability, in that the introduction of new URI scheme is a potential disaster, immediately breaking interoperability.

Guidelines for new Web developments are that they should respect the generic definition and syntax of URIs, not introduce new URI schemes without due cause, not introduce any different scheme which puts itself forward as to be universal as a superset of URIs which would effectively require information apart from a URI to be used as a reference. Also, in new developments, all significant objects with any form of persistent identity should be "first class objects" for which a URI exists. New systems should use URIs where a reference exists, without making constraint on the scheme (old or new) which is chosen.

The principle of minimalist design requires that the URI super-space itself makes the minimum constraint upon any particular URI scheme space in terms of properties such as identity, persistence and dereferencability. In fact, the distinction between names and addresses blurs and becomes dangerously confusing in this context. (See Name myths). To discuss the architecture of that part of the web which is served using HTTP we have to become more specific.

Specific schemes

A few spaces are worthy of note which in which identity is fairly well defined, but have no defined dereferencing protocol: the message identifier (mid) and content identifier (cid) spaces adopted from the MIME world, the md5: hash code with verifiable pure identity, and the pseudo-random Universally Unique Identifier (uuid) from the Apollo domain system and followers. These may be underused as URIs.

It is also worth pointing out the usefulness of URIs which define communication endpoints which do have a persistent identity even for connection-oriented technologies for which there is no other addressable content. For example, the "mailto" scheme (which should have been called "mailbox"). This object is the most fundamental and very widely used object in the email world. It represents conceptually a mailbox. It is a mistake to take the URI as a verb. Typical browsers represent a "mailto:" URI as a window for sending a new message to the address, but opening an address book entry and a list of messages previous received from or sent to that mailbox would also be a useful representation.

There is an open question as to what the process should be for formulating new URI schemes, but it is clear that to allow unfettered proliferation would be a serious mistake. In almost all other areas, proliferation of new designs is welcomed and the Web can be used as a distributed registry of them, but not for the case of URI schemes.

It is reasonable to consider URI spaces which are designed to have greater persistence than most URIs have today, but not technical solutions with no social foundation.

The HTTP space

The most well-known URI space is the HTTP space, characterized by a flexible notion of identity (See Generic URIs), and a richness of information about and relating resources, and a dereferencing algorithm which currently is defined for reference by the HTTP 1.1 wire protocol. In practice, caching, proxying and mirroring schemes augment HTTP and so dereferencing may take place even without HTTP being invoked directly at all.

(See the HTTP 1.1 protocol specification.)

The HTTP space consists of two parts, one hierarchically delegated, for which the Domain Name System is used, and the second an opaque string whose significance is locally defined by the authority owning the domain name.

(See the DNS specification)

The Achilles' heel of the HTTP space is the only centralized part, the ownership and government of the root of the DNS tree. As a feature common and mandatory to the entire HTTP Web, the DNS root is a critical resource whose governance by and for the world as a whole in a fair way is essential. This concern is not currently addressed by the W3C.

The question of improving the persistence of URIs in the HTTP space involves issues of tool maturity, user education, and maturity of the Web society.

Research work elsewhere has included many "naming" schemes variously similar or dissimilar to HTTP, the phrase "URN" bring used either for any such or one such scheme. The existence of such projects should not be taken, to indicate that persistence of HTTP URIs should not also be pursued, or that URIs in general should be partitioned into "names" and "addresses". It is extremely important that if a new space is created, that it be available as a sub-space of the universal URI space, so that the universality of the web is preserved, and so that the power of the new space been usable for all resources.

One can expect HTTP to mature to provide alternate more modern standard ways of dereferencing HTTP addresses, whilst keeping the same (hierarchy+opaque string) address space.

State distribution protocols

Currently on the Internet, HTTP if used for Web pages, SMTP for email messages, and NNTP for network news. The curious thing about this is that the objects transfered are basically all MIME objects, and that the choice of protocol is an optimization made by the user often erroneously. An ideal situation is one in which the "system" (macheines, networks and software) decides adaptively which sorts of protocols to use to efficniently distribute information, dynamically as a function of readership. This question of an efficient flexible protocol blending fetching on demand to preemptive transmission is currently seen as too much of a research are for W3C involvement.

Content and Remote Operations

The URI specification effectively defines a space, that is a mapping between identifiers (URIs) and resources. This is all in theory which is needed to define the space, but in order to make the content of the space available, the operation of dereferencing an identifier is a fundamental necessity. In HTTP this is the "GET" operation. In the Web architecture, GET therefore has a special status. It is idempotent, and HTTP has many mechanisms for refining concepts of idempotency and identity. While other remote operations on resources (objects) in the Web are quite valid, and some are indeed included in HTTP, the properties of GET are an important principle. The use of GET for any operation which has side-effects is incorrect.

The introduction of any other method apart from GET which is idempotent is also incorrect, because the results of such an operation effectively form a separate address space, which violates the universality. A pragmatic symptom would be that hypertext links would have to contain the method as well as the URI in order to able to address the new space, which people would soon want to do.

The extension of HTTP to include an adaptive system for the proactive distribution of information as a function of real or anticipated need, and for the location of close copies, is a natural optimization of the current muddle of push and pull protocols (SMTP, NNTP, HTTP, and HTTP augmented by "channel" techniques). This is an area in which the answers are not trivial and research is quite appropriate. However, it is in the interests of everything which will be built on the Web to make the form of distribution protocols invisible wherever possible.

HTTP in fact combines a basic transport protocol with formats for a limited varieties of "metadata", information about the payload of information. This is a historical inheritance from the SMTP world and as an architectural feature which should be replaced by a clearer distinction between the basic HTTP functionality and a dramatically richer world of metadata.

(See old propagation activity statement)

Remote Operations

HTTP was originally designed as a protocol for remote operations on objects, with a flexible set of methods. The situation in which distributed object oriented systems such as CORBA, DCOM and RMI exist with distinct functionality and distinct from the Web address space causes a certain tension, counter to the concept of a single space. The HTTP-NG activity investigates many aspects of the future development of NG, including a possible unification of the world of Remote procedure Call (RPC) with existing Web protocols.

It is interesting to note that both HTTP and XML have come upon the problem of extensibility. The XML/RDF model for extensibility is general enough for what RPC needs, in my opinion, and I note that an RPC message is a special case of a structured document. To take the whole RPC system and represent it in the RDF model would be quite reasonable. Of course, a binary format (even if just compression) for XML would be required for efficient transmission. But the evolvability characteristics of RDF are just what RPC needs.

Level breaking: Messages and Documents.

There has been to date an artificial distinction between the transmission standards for "protocols"and "content". In the ever continuing quest for generalization and simplification, this is a distinction which cannot last. Therefore, new protocols should be defined in terms of the exchange of messages, where messages are XML, and indeed, RDF documents. The distinction has been partly historical, and partly useful, in that, with protocols defined on top of "messages", and defined in order to transport "documents" (or whatever vocabulary), the confusing but illuminating recursion of protocols being defined in terms of messages exchanged by protocols defined in terms of other messages and so on. In fact this recursion happens all teh time and is important. Email messages contain email messages. Business protocols are enacted using documents which are put on teh web or sent by SMTP or HTTP using internet messages. The observation that these are in fact the same (historically this almost lead to HTTP messages being defined in SGML) leads to a need for generalization and a gain from the multiplicative power for combining the ideas. For example, regarding documents and messages as identical gives you the ability to sign messages, where you could only sign documents, and to binary encode documents, where you could only binary encode messages. And so on. What was level breaking becomes an architectural reorganization and generalization.

The ideal goal, then, for the HTTP-NG project would include:

(See the HTTP-NG activity statement, the HTTP-NG architecture note)

Where new protocols address ground which is covered by HTTP-NG, awareness and lack of duplication is obviously desirable.

Extension of access protocols

The ubiquity of HTTP, while not a design feature of the Web, which could exist with several schemes in concurrence, has proved a great boon. This sunny situation is clouded a little by the existence of the "https" space which implied the use of HTTP through a Secure Socket Layer (SSL) tunnel. By making this distinction evident in the URI, users have to be aware of the secure and insecure forms of a document as separate things, rather than this being a case of negotiation in the process of dereferencing the same document. Whilst the community can suffer the occasional surfacing of a that which should be hidden, it is not desirable as a precedent, as many other dimensions of negotiation (including language, privacy level, etc) for which proliferation of access schemes is inappropriate.

Work at W3C on extension schemes for protocols has been undertaken for a long time and while not adopted in a wide-scale way in HTTP 1.1, currently takes the form of the Mandatory specification. Many features such as PICS or RTSP could have benefitted from this had it been defined in early HTTP versions.

(See the Mandatory Specification)

Extension of future protocols such as HTTP-NG is clearly an important issue, but hopefully the experience from teh extensibility of data formats will provide tools powerful enough to be picked up directly and used by the HTTP-NG community in due course.

Data Formats

Format Negotiation

When the URI architecture is defined, and when one has the use of at least one dereferencable protocol, then al lone needs for an interoperable global hypertext system is at least one common format for the content of a resource, or Web object.

The initial design of the Web assumed that there would continue to be a wild proliferation of proprietary data formats, and so HTTP was designed to have a feature of negotiation common formats between client and server. Historically this was not used due to, on the one hand, the proliferation of HTML as a common format, and, on the other hand, the size of the known formats list which a typical client had to send with each transaction.

As an architectural feature, this is still desirable. The Web is currently full of user awareness of data formats, and explicit user selection of data formats, which complicates it and hides the essential nature of the information.

The discussion of data formats should be seen in this light.

MIME types

In HTTP, the format of data is defined by a "MIME type". This formally refers to a central registry kept by IANA. However, architecturally this is an unnecessary central point of control, and there is no reason why the Web itself should not be used as a repository for new types. Indeed, a transition plan, in which unqualified MIME types are taken as relative URIs within a standard reference URI in an online MIME registry, would allow migration of MIME types to become first class objects.

The adoption by the community of a tested common recommended data format would then be a question not of (central) registry but of (possibly subjective) endorsement.

Currently the web architecture requires the syntax and semantics of the URI fragment identifier (the bit after the "#") to be a function of MIME type. This requires it to be defined with every MIME registration. This poses an unsolved problem when combined with format negotiation.

Common Syntax for Structured documents: XML

While HTML was, partly for political reasons, based upon the generic SGML language, the community has been quite aware that while sharing a common syntax for structured documents was a good idea, something simpler was required. XML was the result.

(See the XML Activity Statement)

While in principle anyone is free to use any syntax in a new language, the evident advantages from sharing the syntax are so great that new languages should where it is not overly damaging in other ways be written in XML. Apart from the efficiency of sharing tools, parsers, and understanding, this also leverages the work which has been put in to XML in the way of internationalization, and extensibility.

Namespaces

The extensibility in XML is essential in order to crack a well-known tension in the software world between free but undefined extension and well-defined but awkward extension in the RPC world. An examination of the needs for evolution of technology in a distributed community of developers shoes that the language must have certain features:

It must be possible to make documents in a mixture of languages (language mixing)

It must be possible to process a document understanding a subset of the vocabularies (partial understanding).

(See Evolvability Talk at WWW7, and design issues: Evolvability)
(See Note "Web architecture: extensible languages, )

These needs lie behind the evolution of data formats whether of essentially human-targeted or of machine-understandable (semantic) data.

When a new language is defined, XML should in general be used. When it is, the new language, or the new features extending an existing language, must be defined as a new namespace. (That is, new non-XML syntaxes, processing instructions, or tunnelling of functionality within other XML entities etc is inappropriate). While schema languages are not available to describe the language, still a URI must be used to identify the language. Where the functionality being introduced maps onto a logical assertion model, then the mapping onto the RDF model below should be defined, and, normally, RDF used.

Human Readable Information

By human readable information I mean documents in the traditional sense which are intended for human consumption. While these may be transformed, rendered, analysed and indexed by machine, the idea of them being understood is an artificial-intelligence complete problem which I do not address as part of the Web architecture. When I talk about machine-understandable documents, therefore, I mean data which has explicitly been prepared for machine reasoning: part of a semantic web. (The use of the term "semantics" by the SGML community to distinguish content from form is unfortunately confusing and not here).

Separation of Form and Content

An architectural rule which the SGML community embraced is the separation of form and content. It is an essential part of Web architecture, making possible the independence of device mentioned above, and greatly aiding the processing and analysis. The addition of presentation information to HTML when it could be put into a style sheet breaks this rule. The rule applies to many specifications apart from HTML: in the Math Markup Language (MathML) two levels of the language exist, one having some connection with mathematical meaning, and the other simply indicating physical layout.

Graphics

The development of different languages for human readable documents can be relatively independent. So 2D graphic languages such as PNG and SVG are developed essentially independently of 3D languages such as VRML (handled not by W3C but by the VRMLC) and text languages such as HTML and MathML. Coordination is needed when aspects of style, fonts, color and internationalization are considered, where there should be a single common model for all languages.

PNG was introduced as a compact encoding which improved on GIF both technically (color, flexibility and transparency) and politically (lack of encumbrance). SVG is required as a common format in response to the large number of suggestions for an object oriented drawing XML language.

HTML

The value of a common document language has been so enormous that HTML has gained a dominance on the Web, but it does not play a fundamental key role. Web applications are required to be able to process HTML, as it is the connective tissue of the Web, but it has no special place architecturally.

HTML has benefitted and suffered from the "ignore what you don't understand" rule of free extension. In future, the plan is to migrate HTML from being an SGML application to being defined as an XML namespace, making future extension a controlled matter of namespace definition. The first step is a specification for documents with precisely HTML 4.0 features but which are XML documents.

(See W3C Data Formats note)

Hypertext Link topology

A fundamental compromise which allows the Web to scale (but created the dangling link problem) was the architectural decision that links should be fundamentally mono-directional. Links initially had three parameters: the source (implicit when in the source document), destination and type. The third parameter, intended to add semantics, has not been heavily used, but XLINK activity has as one goal to reintroduce this to add richness especially to large sets of Web pages. Note however that the Resource Description Framework , introduced below, is a model (based on an equivalent 3-component assertion onto which a link maps directly), and so link relationships, like any other relation in future Web architecture, must be expressible in RDF. In this way, link relationships in HTML, and in future XML hypertext languages, should migrate to becoming first class objects.

XLINK will also define more sophisticated link topologies, and address the human interface questions related to them, still using the same URI space and using RDF as the defining language for relationship specification. (It may be appropriate for information based on the RDF model to be actually transferred in a different syntax for some reason,. but the specification must define the mapping, so that common processing engines can be used to process combinations of such information with other information in the RDF model.)

Style Sheets

The principle of modular design implies that when form and content are separated the choice of languages for each, if possible, be made an independent choice. HTML has dominated the text markup (content) languages, but the introduction of XML opens the door for the use of new XML markup languages between parties which share them. (See the Style activity at W3C)

Collaboration

The original idea of the Web being a creative space for people to work together in ("intercreative") seems to be making very slow progress

See W3C Collaboration Workshop

This field is very broad and can be divided into areas:

  1. Asynchronous collaboration tools
  2. Integration of real-time audio video collaboration and the Web (, integration of video in HTML, co-presence)
  3. Group editors (synchronous hypertext editors, whiteboards etc)
  4. Asynchronous distributed editing. (Amaya, Jigsaw, Jigedit, WebDAV)

A precursor to much collaborative work is the establishment of an environment with sufficient confidentiality to allow trust among its members. Therefore the consortiums work on a semantic web of trust addressed below may be a gating factor for much of the above.

Many of the above areas are research areas, and some are areas in which products exist. It is not clear that there is a demand among w3C members to address common specifications in this area right now but suggestions are welcome.. The Consortium tries to use whatever web-based collaborative techniques are available, including distributed editing of documents in the web, and automatic change tracking. The Live early Adoption and Demonstration (LEAD) philosophy of W3C was introduced specifically for areas like this where many small pieces need to be put together to make it happen, but one will never know how large any remaining problems are until one tries. Still, largely, this section in the architecture is left as a place-holder for later expansion. It may not be the time yet, but collaborative tools are a requirement for the Web and the work is not done until a framework for them exists.

Machine-Understandable information: Semantic Web

The Semantic Web is a web of data, in some ways like a global database. The rationale for creating an infrastructure is given elsewhere [Web future talks &c] here I only outline the architecture as I see it.

See:

When looking at a possible formulation of a universal Web of semantic assertions, the principle of minimalist design requires that it be based on a common model of great generality. Only when the common model is general can any prospective application be mapped onto the model. The general model is the Resource Description Framework.

Semantic Web: the pieces.

The architecture of RDF and the semantic web build on it is a plan but not yet all a reality. There are various pieces of the puzzle which seem to fall into a chronological order, although the turn of events may change that. (Links below are into the Semantic Web roadmap)

  1. The basic assertion model provides the concepts of assertion (property) and quotation. (This is provided by the RDF Model and Syntax Specification)
  2. The schema language provides data typing and allows document structure to be constrained to allow predictable computable processing.
  3. A conversion language allows the expression of inference rules allowing information in one schema to be inferred from a document in another.
  4. The logical layer turns a limited declarative language into a Turing-complete logical language, with inference and functions. This is powerful enough to be able to define all the rest, and allow any two RDF applications to be connected together. However, without being profiled for use, it does not address specific applications.
  5. A proof language is a form of RDF which allows one agent to send to another an assertion, together with the inference path to that assertion from assumptions acceptable to the receiver. This allows applications such as access control to use a generic validation engine as the kernel, with very case-specific tools for producing proofs of access according to whatever social rules have been devised for the case.
  6. An evolution rules language allows inference rules to be given which allow a machine with a certain algorithm to do convert documents from one RDF schema into another. This is a fundamental key to evolution of the technology.
  7. Query languages assume different forms of query engine. One can imagine standardizing both certain query engines and a language for defining query engines.

Once one has a proof language, then the introduction of digital signature turns what was a web of reason into a web of trust. The development of digital signature functionality in the RDF world can in principle happen in parallel with the stages above. As more expressive logical languages become available, then but it requires that the logical layer be defined as a basis for defining the new primitives which describe signature and inference in a world which includes digital signature.

A single digital signature format for XML documents is important. The power of the RDF logical layers will allow existing certificate schemes to be converted into RDF, and a trust path to be verified by a generic RDF engine.

Metadata applications

The driver for the semantic web at level 1 above is information about information, normally known as metadata. The following areas are examples of technologies which should use RDF, and which are or we expect to be developed within the W3C.

This is by no means an excludinve list. Any technology which involves information about web resources should express it according to the RDF model The plan is that HTML LINK relationships be transitioned into RDF properties. We can continue the examples for which RDF is clearly appropriate.

Indexes of terms

Given a worldwide semantic web of assertions, the search engine technology currently (1998) applied to HTML pages will presumably translate directly into indexes not of words, but of RDF objects. This itself will allow much more efficient searching of the Web as though it were one giant database, rather than one giant book.

The Version A to Version B translation requirement has now been met, and so when two databases exist as for example large arrays of (probably virtual) RDF files, then even though the initial schemas may not have been the same, a retrospective documentation of their equivalence would allow a search engine to satisfy queries by searching across both databases.

Engines of the Future

While search engines which index HTML pages find many answers to searches and cover a huge part of the Web, then return many inappropriate answers. There is no notion of "correctness" to such searches. By contrast, logical engines have typically been able to restrict their output to that which is provably correct answer, but have suffered from the inability to rummage through the mass of intertwined data to construct valid answers. The combinatorial explosion of possibilities to be traced has been quite intractable. However, the scale upon which search engines have been successful may force us to reexamine our assumptions here. If an engine of the future combines a reasoning engine with a search engine, it may be able to get the best of both worlds, and actually be able to construct proofs in a certain number of cases of very real impact. It will be able to reach out to indexes which contain very complete lists of all occurrences of a given term, and then use logic to weed out all but those which can be of use in solving the given problem. So while nothing will make the combinatorial explosion go away, many real life problems can be solved using just a few (say two) steps of inference out on the wild web, the rest of the reasoning being in a realm in which proofs are give, or there are constrains and well understood computable algorithms. I also expect a string commercial incentive to develop engines and algorithms which will efficiently tackle specific types of problem. This may involve making caches of intermediate results much analogous to the search engines' indexes of today.

Though there will still not be a machine which can guarantee to answer arbitrary questions, the power to answer real questions which are the stuff of our daily lives and especially of commerce may be quite remarkable.


References

B. Carpenter, Editor: "Architectural Principles of the Internet" Internet Architecture Board, June 1996, RFC1958


Notes uncompleted bits

Suggestions from DC: Recent history .... what would happen. Is there an architecture .. only tradition ... what happens if you write it down. Modularity, Minimal constraint, Downhill step principle, Tolerance, ....

Information loss: we have people putting information into the computer and the computer losing it for them. Diskettes ....



Up to Design Issues;