DRAFT: Web Achitecture Document

8 May 2002

This version:: http://www.w3.org/2001/tag/2002/0508-intro
Previous version:: http://www.w3.org/2001/tag/2002/0330-intro
Editor:: Ian Jacobs, W3C

Abstract

This document presents a view of World Wide Web Architecture, from the perspective of W3C's Technical Architecture Group (TAG).This introduction has two purposes: to give the reader a general sense of what the TAG means by World Wide Web Architecture, and to call out some of the principles regarded as fundamental to the success of the Web.

Status of this document

This document has been developed for discussion by the W3C Technical Architecture Group.

This document is the work of the editor. It incorporates the work of the other TAG participants by sewing together written pieces and discussions. It is a draft with no official standing. It does not necessarily represent the consensus opinion of the TAG.

Comments may be directed to the W3C TAG mailing list www-tag@w3.org (archive).

Publication of this document by W3C indicates no endorsement by W3C.

Introduction
Chapter 1: What does a URI mean?
Chapter 2: What does a representation mean?
Chapter 3: What goes on under the hood?
Chapter 4: What does a message mean?
Glossary
References

Introduction

On its surface, the Web can be viewed as a simple abstraction, consisting of resources and "Uniform Resource Identifiers" (URIs). A resource can be anything -- a book, a dream, a car, a family photo album -- the Web puts no limits on what we can share. URIs allow us to refer to resources in a way that machines can process reliably. Chapter one discusses what can be known from a URI, and some important properties of URIs generally. Note: In this document, when we talk about what a "machine can know," we are not talking about artificial intelligence, but rather encoding information according to technical specifications, and writing programs that conform to those specifications.

Some resources are documents. A document is an abstract thing, but can have a meaning. Not all resources are documents. Some, like telnet addresses or email addresses, are important network resources but do not have meaning. Other resources, like cars, are concrete, and others, like the kilogram, are abstract, but in neither case can they be called documents. However, documents make be about any resource: a web page can be about a car, for example.

A document itself may need to identify many things. A graphic document has many objects, a hypertext document many link end points, a semantic web document may be about many abstract things. These things can be identified by combining the URI of the document with the identifier within the document, and the result is called a URI reference.

While chapter one discusses URIs in general, chapter two is primarily about documents, and how they are interpreted.

Just as we do in the real world, on the Web we distinguish a document (which is abstract) from a representation of the document. A movie is not the same as the set of bits on a DVD, a song is not the same as a the bits of a particular recording, the Ford Model-T profile is not the same a a particular engineering drawing. We populate the Web with abstract documents, but we represent them on the Web in a manner that machines can process, using data formats such as XML, HTML, PNG, style sheets, etc. Chapter two discusses what can be known about resources through a representations of documents..

Under the hood, the Web teems with activity in the form of messages exchanged over a global network. Part of the World Wide Web Architecture ("Web Architecture" from here on) is intended to maintain the illusion of a stable Web, hiding from the user the noise of these low-level messages. For instance, the owner of a document can set expectations about the persistence of any given representation of it, ranging from "will never change" to "will not change for a week" to "will change one minute from now". Web protocols have been optimized to make use of this and other information about the document so that caching is possible. Caching not only promotes stability in the face of network failures, it improves performance generally and thus promotes confidence in the Web. Chapter three discusses some aspects of Web protocols that take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.

While chapter three is about messages intentionally hidden from the user's view, in chapter four we consider those messages that we wish to rise to the surface, to be part of the user's Web experience, and how this affects the Web Architecture. While in chapter two we describe the interpretation of document representations, in chapter four we describe the interpretation of messages.

Web Architecture alone does not guarantee that the Web will work or be useful: social behavior is as important. For instance, URIs that are stable (i.e., that continue to designate the same resource) add value to the Web. Resource owners should take pains to ensure that links don't break when it can be avoided, but the Web Architecture itself must allow links to break, otherwise the Web would not be able to scale. To keep the Web flexible, there should be as few architectural rules as possible that must be obeyed in order for it to work. However, there have to be some rules to ensure that people and computers can communicate at all (interoperate). And there have to be some expectations about good social behavior. We discuss both in the remainder of this document.

Chapter 1: What does a URI mean?

In this chapter, we discuss URIs and their relation to the resources they identify.

RFC 2396 [RFC2396] defines the generic syntax of URIs, what absolute and relative URIs are, the meaning of fragment identifiers (the string that follows "#" character), what characters can appear in a URI, and more. We do not delve into these subjects in the current document. We do define the essential concepts, and the way that the URI specifciation delegates to other specifications the task of defining futher what a URI stands for.

RFC 2396 introduces the terms URL and URN, subclasses of URIs, and attempts to explain a (somewhat confusing and vague) distinction between them.. These distinctions are not made in this document, and the term URI is used thoughout..

The scheme determine everything about a resource

The URI specification's primary job is to point to the specifciation which defines what the URI points to for a given scheme. This is done through a registry of URI schemes. Here we do not go into detail for every scheme, but we point out some commonality, and look at the HTTP specification, as one of the architecturally richest ones, in more detail.

A great power of the URI architecture is the diversity of properties of URIs in different schemes. This allows systems which use URIs to idenify things great flexibility to operate under different and new assumptions and condidtions.

Owners control resources

Within many schemes, each URI on the Web has an owner. The owner of a document is not necessarily the owner of any resource the document may be about; I can create a web page that is about "Dan's car" even if I don't own Dan's car.

For example, DNS-based schemes use the DNS delegation of ownership; for a UUID, anyone who creates one owns it. By contrast, other URI schemes (e.g., md5)have no concept of resource owner. In this case, the URI scheme alone determines the semantics of the URI's relationship to the resource.

The owner of an HTTP document determines everything about the resource. The following are some properties of resources or URIs determined by the owner of an HTTP scheme document::

How a document is represented on the Web. This, through the specifications of standard languages, determines what it can be taken to mean.
Whether the document will change over time. Dereferencing a URI yields a representation of the current value of the identified document. Since some documents can change over time, representations may vary over time as well. In the language of [Fielding], a resource is a "time-varying conceptual mapping to a (possibly empty) set of entities or values which are equivalent." Resources that do not change over time form an interesting class, whose properties may be exploited by protocols described in chapter 3.
Which URIs refer to identical resources. The HTTP specification itself implies that certian pairs of resources will always be identical, such as those whose URIs differ only in the case of the domain name. A web site owner can allocate many URis to the same thing. On the web, these resources will be identical. However, in general, it is an error to presume, just by looking at two URIs, that they do or do not refer to the same resource, unless the specifications or metadata received give one reason to.

Although information about URI cannonicalization is given in section 6 of RFC 2396. Systems should use the same stringwise URI to refer to the same resource. Don't rely on others doing URI canonicalization.
Whether a URI will continue to refer to a resource over time. Nothing prevents a resource owner from changing which URI refers to a resource, or from reusing a URI to refer to a different resource. However, it is good social behavior for resource owners to continue to use the same URIs to refer to the same resources.
Dereferencability, reusability of URIs (e.g., message identifiers), etc.

Here is an example of a URI/resource policy at W3C. W3C technical reports undergo frequent revision, so it is useful to be able to consult a specific version in the revision history and also to be able to find quickly the latest revision. W3C assigns a unique URI to each publication (for archival references), and also employs a "latest version URI" to designate the latest published draft. As a result of this policy, two URIs ("this version" and "the latest version") refer to the same resource at any point in time. With each publication, the latest URI designates the new resource, and is paired with a new archival URI.

In many cases, it is important for users to understand the policies associated with a particular resource, including the authority's commitment to ensuring that a given URI will continue to refer to that resource over time. Policies can be expressed in a variety of human-readable and machine-readable ways. W3C's Platform for Privacy Preferences (P3P) [P3P10] is an example of a specification that allows resource owners to describe privacy practices associated with their Web sites in a standard way that can be retrieved automatically and interpreted easily by user agents.

Communication between Web agent and resource owner about the meaning of a resource and associated policies takes place via the protocols described in chapter 3. Since meaning comes from this exchange, not from how the URI is spelled, the resource may evolve without requiring changes to deployed URIs. This makes the Web more robust.

What you can know about a resource by looking at a URI?

URIs identify. When the same URI is used in two places, it refers to the same thing. That is the most fundemental property. Any other properties depend on the scheme specification.

Each absolute URI starts with a scheme name (e.g., http, fax, ftp, gopher, and mailto; see Dan Connolly's list of URI schemes [URI Schemes]). The URI scheme determines, among other things, how to go about finding a representation of the resource (how to "dereference" the URI). URI scheme semantics are defined in specifications, such as the HTTP/1.1 specification [RFC2616].

There are times when a URI is useful even without dereferencing it to learn more about a resource (e.g., namespace URIs defined in "Namespaces in XML" [XMLNS]) In general, however, when a resource owner assigns a URI to a resource and the URI scheme allows dereferencing, the resource owner should make available useful representations (subject to confidentiality requirements, etc.).

A general property of URIs, therefore, is that they are "opaque", which means that it is an error to make any assumptions about a resource based on the spelling of a URI that refers to it, other than what is defined in RFC 2396 or through the scheme specification. In particular, it is an error to assume that a URI that happens to end with the string ".html" refers to a resource that has an HTML representation.

At times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use. But in general, URIs should be used but not seen. In general, URIs should be hidden from view since they are ugly to look at and they tend to lure us into thinking they hold definitive meaning about a resource. URIs merely identify.

Properties common to all URIs

All URIs share some common properties, such as being opaque. A particular URI scheme may add additional properties to that class of URIs. For instance, the part that follows "http://" in an http URI, is information about the authority governing the resource referenced by the URI. Some URI schemes do not include information about an authority.

The following are some properties common to all URIs.

URIs are Uniform

In any context that allows a URI, any URI may be used. It is an error to say that only URIs of a specific URI scheme are allowed in a certain context.

Uniformity "...allows different different types of resource identifier to be used in the same context, even when the mechanisms used to access those resources may differ; it allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers; it allows introduction of new types of resource identifiers without interfering with the way that existing identifiers are used; and, it allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a pre-existing, large and widely-used set of resource identifiers." [RFC2396]

URIs are Universal

URIs may be used to identify any identifiable thing, anywhere. See [Axioms] for more information.

@@Resources not referenced by URIs@@

In general, we consider that something is "on the Web" if it can be referenced by a URI. @@However@@...

Searching
RDF

@@[RFC2396] is clear that "A resource can be anything that has identity". This is not a closed definition. Are there more things that can be regarded as resources than just those with assigned URIs (or URI references)? RDF provides the ability to described resources by their relationship to one another which leads to the notion of existentially qualified identifiers. For example, there exists a person whose Internet mailbox is identified by the URI mailto:timbl@w3.org. This identifies the person of Tim Berners-Lee by reference to the URI of his Internet mailbox without it being necessary to assign a URI to identify the concept of the person Tim Berners-Lee. When the RDF information is drawn as a circles and arrows diagram, these nodes do not have URIs. They are refered to as blank nodes in the RDF specification.

When a resource is identified indirectly in such a way, there is a loss in that there is no identifier which can be referenced to find out some information about it. The web works best when identifiers can be dereferenced.

Chapter 2: What does a document mean?

People communicate ideas using words, facial expressions, gestures, pictures, sounds, and more. Understanding what another person means requires substantial shared context, linguistic and cultural. On the Web, people share knowledge by understanding the same thing from web documents. Computers represent documents to people through data formats such as XML-based formats, HTML, PNG, etc. In this chapter, we are not concerned with how people understand ideas (whether expressed via the Web, by telephone, in person, etc.). Rather, we are interested in a much simpler problem: how can we build useful data formats, and what can a machine do with these formats. In this chapter, "meaning" refers to what can be known by a machine through protocols and formats. The meaning of a human-readable document is transferred through the correct interpretation of the representation by a computer for presentation to a person. The meaning of semantic web data is transferred by the ability of a computer to manipulate the information consistently with the specification of the data format.

The only guaranteed way to interpret a given representation is to follow the specifications that define that representation. The meaning of an HTML document is exactly what the HTML specification says it is. Anything other interpretation, in general, may lead to error. .

How to interpret a representation

To interpret a representation properly, note that it consists of an Internet Media Type (defined in RFC2046 [RFC2046]) and a a set of bits.. The Internet Media Type identifies a specification of the data format, just as the URI prefix identifies the specification of the scheme. This specfication may include references to one or more additional specifications that define further interpretation. In general, the process is a recursive one: consume the representation according to one or more specifications, which themselves may refer normatively to other specifications.

Here is an example of interpreting an XHTML 1.0 [XHTML1] by starting with Media Type 'application/xhtml+xml' (defined in [RFC3236]):

RFC 3236 explains how to start interpreting the representation:

...it should suffice for now for the purposes of interoperability that user agents accepting 'application/xhtml+xml' content use the user agent conformance rules in [XHTML1].

This means go read the XHTML 1.0 specification!
XHTML 1.0 is (1) an XML application and (2) inherits the semantics of its elements and attributes from HTML 4.0 [HTML40]. Therefore, the meaning of an XHTML 1.0 document depends on what the XHTML 1.0 specification, and in turn the XML 1.0 and HTML 4.0 specifications.
These specifications refer normatively to the character set defined by Unicode, and so forth.

In general, the initial Media Type is delivered along with the string of bits for the requested resource to form a complete representation.

Breaking the chain

When a document is published on the web or sent by email, a statement is made by the author to the reader. The specifications chain together as defined above to determine the interpretation of the statement. By using the Internet, both author and reader consent to be rules by these conventions of communication, so that there cannot be argument that the meaning of a document is arbitrary or undefined.

Important exceptions allow breaks in the chain of specifications. When a document is sent as an email attachment, it has no direct meaning except any attributed to it by the covering letter. In this case MIME multipart, and in otehr cases other specifications allow documents to be referred to without being implied as a part of the communication itself.

XML and the Interpretation of mixed namespaces

Looking specifically at XML documents, those with MIME types containing /xml or +xml, the meaning of the document is defined by the specification of the outermost element of the document. The interpretation of the inner elements can only be done if authorized by the specification of the outermost element.

When an XML document has many namespaces, the interpretation of inner elements

@@@ Add: useful to create a MIME type for a new XML namespace to be used for a document element, as allows visibility. @@

MIME
Resource
Uniform Resource Identifier (URI)

References

Normative References

RFC2396: IETF "RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax", T. Berners-Lee, R. Fielding, L. Masinter, August 1998. Available at http://www.ietf.org/rfc/rfc2396.
RFC2616: IETF "RFC 2396: Hypertext Transfer Protocol -- HTTP/1.1", J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, June 1999. Available at http://www.ietf.org/rfc/rfc2616.
RFC2046: IETF "RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", N. Freed, N. Borenstein, November 1996. Available at http://www.ietf.org/rfc/rfc2046.

Non-Normative References

Axioms: "Universal Resource Identifiers - Axioms of Web Architecture", T. Berners-Lee, living document dated December 1996. Available at http://www.w3.org/DesignIssues/Axioms
Cool: "Cool URI's don't change" T. Berners-Lee, W3C, 1998 Available at http://www.w3.org/Provider/Style/URI
Fielding: "Principled Design of the Modern Web Architecture", R.T. Fielding and R.N. Taylor, UC Irvine, Available in PDF at http://www.cs.virginia.edu/~cs650/assignments/papers/p407-fielding.pdf
Fragments: "Fragment Identifiers on URIs", T. Berners-Lee, living document dated April 1997. Available at http://www.w3.org/DesignIssues/Fragment
HTML40: "HTML 4.01 Specification", D. Raggett, A. Le Hors, I. Jacobs, 24 December 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-html401-19991224.
P3P10: "The Platform for Privacy Preferences 1.0 (P3P1.0) Specification", M. Marchiori, ed., 16 April 2002. This W3C Recommendation is available at http://www.w3.org/TR/2002/REC-P3P-20020416/.
RFC3236: IETF "RFC 3236: The 'application/xhtml+xml' Media Type", M. Baker, P. Stark, January 2002. Available at: http://www.rfc-editor.org/rfc/rfc3236.
URI Schemes: Dan Connolly's list of URI schemes is a useful resource for finding out which references define various URI schemes.
XHTML1: "XHTML 1.0: The Extensible HyperText Markup Language: A Reformulation of HTML 4 in XML 1.0", S. Pemberton et al., 26 January 2000. The latest version of this W3C Recommendation is available at http://www.w3.org/TR/xhtml1.
XML10: "Extensible Markup Language (XML) 1.0 (Second Edition)", T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, 6 October 2000. This W3C Recommendation is available at http://www.w3.org/TR/2000/REC-xml-20001006.
XMLNS: "Namespaces in XML", T. Bray, D. Hollander, A. Layman, 14 Jan 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-xml-names-19990114.

To Do

Include examples
Include tips (e.g., how to avoid breaking a link; reference to HTTP).
Include links to issues

Ian Jacobs
Last modified $Date: 2002/05/16 00:11:15 $