DRAFT: Introduction to Web Architecture

This document presents a view of World Wide Web Architecture, from the perspective of W3C's Technical Architecture Group (TAG).This introduction has two purposes: to give the reader a general sense of what the TAG means by World Wide Web Architecture, and to call out some of the principles regarded as fundamental to the success of the Web.

Status of this document

This document has been developed for discussion by the W3C Technical Architecture Group.

This document is the work of the editor. It incorporates the work of the other TAG participants by sewing together written pieces and discussions. It is a draft with no official standing. It does not necessarily represent the consensus opinion of the TAG. The "@@" symbols are to warn the reader that the content is unstable or incomplete or may be incorrect.

Once this document has undergone substantial revision, the TAG expects to develop it on the W3C Recommendation track.

Introduction

On its surface, the Web can be viewed as a simple abstraction, consisting of resources and "Uniform Resource Identifiers" (URIs). A resource can be anything -- a book, a dream, a car, a family photo album -- the Web puts no limits on what we can share. URIs allow us to refer to resources in a way that machines can process reliably. Chapter one discusses what can be known from a URI, and some important properties of URIs generally.

Some resources are documents. A document is an abstract thing, but can have a meaning. Some resources are not documents (e.g., cars, and other concrete resources; the kilogram, and other abstract resources). Some other network resources are important but don't have meaning, such as telnet addresses or email addresses. A document may have any resource as its subject: a Web page can be dedicated to a car, for example.

A document may need to identify many things. A graphic document consists of many objects, a hypertext document many link end points, a Semantic Web document may be about many abstract things. These things can be identified by combining the URI of the document with the identifier within the document, and the result is called a URI reference.

While chapter one discusses URIs in general, chapter two is primarily about documents, and how they are interpreted.

On the Web as in the real world, we distinguish a resource from a representation of the resource. A movie is not the same as a copy of the movie on a DVD, a song is not the same as a particular artist's recording, the Ford Model-T is not the same as the first one off the assembly line. We populate the Web with resources, but we represent them on the Web in a manner that machines can process, using data formats such as XML, HTML, PNG, style sheets, etc. Chapter two discusses what can be known about a resource through a representation.

Under the hood, the Web teems with activity in the form of messages exchanged over a global network. Part of the World Wide Web Architecture ("Web Architecture" from here on) is intended to maintain the illusion of a stable Web, hiding from the user the noise of these low-level messages. For instance, the owner of a resource can set expectations about the persistence of that resource, ranging from "will never change" to "will not change for a week" to "will change one minute from now". Web protocols have been optimized to make use of this and other information about the resource so that caching is possible. Caching not only promotes stability in the face of network failures, it improves performance generally and thus promotes confidence in the Web. Chapter three discusses some aspects of Web protocols that take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.

While chapter three is about messages intentionally hidden from the user's view, in chapter four, we consider those messages that we wish to rise to the surface, to be part of the user's Web experience, and how this affects the Web Architecture. While in chapter two we describe the interpretation of resource representations, in chapter four we describe the interpretation of messages.

Web Architecture consists of both technology and social behavior. For instance, technical specifications define URI syntax, but it is up to people to ensure that resources evolve in a predictable manner, that links don't break, etc. To keep the Web flexible, there should be as few architectural rules as possible that must be obeyed in order for it to work. However, there have to be some rules to ensure that people and computers can communicate at all (interoperate). And there have to be some expectations about good social behavior. We discuss both in the remainder of this document.

Chapter 1: What does a URI mean?

The primary component of the Web Architecture is the Uniform Resource Identifiers (URI), a compact string of characters for identifying an abstract or physical resource. The common symbol set and semantic interpretation of these symbols is defined in RFC 2396 [RFC2396]. How well a URI allows human communication depends on how the URI is used. A URI that refers to a stable resource or one that evolves predictably will be much more useful (and thus "meaningful") than one that refers to wildly changing information.

RFC 2396 defines the generic syntax of URIs, what absolute and relative URIs are, the meaning of fragment identifiers (the string that follows "#" character), what characters can appear in a URI, and more. We do not delve into these subjects in the current document.

The URI scheme determines everything about a resource

RFC 2396 allows for a variety of URI schemes to identify resources: http, fax, ftp, gopher, md5, uuid, mailto, and others; see Dan Connolly's list of URI schemes [URI Schemes]. URI scheme semantics are defined in specifications, such as the HTTP/1.1 specification [RFC2616].

A great power of the URI architecture is the diversity of properties of URIs in different schemes. This allows systems which use URIs to identify things great flexibility to operate under different and new assumptions and conditions. In this document, we do not go into detail for every scheme, but we point out some commonality, and look at the HTTP specification, as one of the architecturally richest ones, in more detail.

General properties of URIs

Specific properties of URI schemes

For instance, the http scheme represents an agreement that anyone who owns a piece of the DNS space can create new URIs. Generally, the entity that has authority over a URI establishes policies for it and the resource it references (such as whether and how frequently a resource changes over time, whether two URIs designate the same resource, etc.).

In many cases, it is important for users to understand the policies associated with a particular resource, including the authority's commitment to the stable evolution of a resource. W3C's Platform for Privacy Preferences (P3P) [P3P10] is an example of a specification that allows resource owners to describe privacy practices associated with their Web sites in a standard way that can be retrieved automatically and interpreted easily by user agents.

FAQ: What you can know about a resource by looking at a URI?

The short answer is: almost nothing, except the "URI scheme".

Since URIs are opaque, it is an error to assume, for example, that a URI that happens to end with the string ".html" refers to a resource that has an HTML representation. Though people must not infer anything about the nature of a resource representation from a URI ending in ".html", resource owners must not create confusion by purposely misassigning suffixes and representation types.

At times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use. But in general, just as "children should be seen but not heard", URIs should be used but not seen. In general, URIs should be hidden from view since they are ugly to look at and they tend to lure us into thinking they hold definitive meaning about a resource.

Note on canonical form of URIs: Although section 6 of RFC 2396 describes URI canonicalization, those using URIs should use the same string-wise URI consistently to refer to the same resource. Don't rely on others to canonicalize a URI.

@@Resources not referenced by URIs@@

@@[RFC2396] is clear that "A resource can be anything that has identity". This is not a closed definition. Are there more things that can be regarded as resources than just those with assigned URIs (or URI references)? RDF provides the ability to described resources by their relationship to one another which leads to the notion of existentially qualified identifiers. For example, there exists a person whose Internet mailbox is identified by the URI mailto:timbl@w3.org. This identifies the person of Tim Berners-Lee by reference to the URI of his Internet mailbox without it being necessary to assign a URI to identify the concept of the person Tim Berners-Lee. When the RDF information is drawn as a circles and arrows diagram, these nodes do not have URIs. They are referred to as blank nodes in the RDF specification.

When a resource is identified indirectly in such a way, there is a loss in that there is no identifier which can be referenced to find out some information about it. The Web works best when identifiers can be dereferenced.

Chapter 2: What does a resource mean?

The Web is an agreement that resources are interpreted according to specifications, for:

The guarantee that the Web offers is that it is correct to interpret a resource representation by following the specifications that define it. Any other interpretation is, in general, incorrect.

There may be special cases when one can "scrape" useful information from a piece of content even without following the relevant specifications. @@Talk about under what conditions it is ok to interpretation "out of context"?@@.

In practice, a resource representation is interpreted by starting with an Internet Media Type (defined in RFC2046 [RFC2046]); the initial Media Type is generally delivered by protocol with the representation. Generally, interpretation proceeds recursively. A piece of content may create a new context that requires a different specification for proper interpretation.

Breaking the chain

When a document is published on the Web or sent by email, a statement is made by the author to the reader. The specifications chain together as defined above to determine the interpretation of the statement. By using the Internet, both author and reader consent to be rules by these conventions of communication, so that there cannot be argument that the meaning of a document is arbitrary or undefined.

Important exceptions allow breaks in the chain of specifications. When a document is sent as an email attachment, it has no direct meaning except any attributed to it by the covering letter. In this case MIME multipart, and in other cases other specifications allow documents to be referred to without being implied as a part of the communication itself.

XML and the Interpretation of mixed namespaces

Looking specifically at XML documents (i.e., those with MIME types containing /xml or +xml), the meaning of the document is defined by the specification of the outermost element of the document. The interpretation of the inner elements can only be done if authorized by the specification of the outermost element.

@@When an XML document has many namespaces, the interpretation of inner elements ...@@

@@@ Add: useful to create a MIME type for a new XML namespace to be used for a document element, as allows visibility. @@

Chapter 3: What goes on under the hood?

don't know yet

@@Here is an example of a URI/resource policy at W3C. W3C technical reports undergo frequent revision, so it is useful to be able to consult a specific version in the revision history and also to be able to find quickly the latest revision. W3C assigns a unique URI to each publication (for archival references), and also employs a "latest version URI" to designate the latest published draft. As a result of this policy, two URIs ("this version" and "the latest version") refer to the same resource at any point in time. With each publication, the latest URI designates the new resource, and is paired with a new archival URI.@@

DRAFT: Web Achitecture Document

8 May 2002

Abstract

Status of this document

Table of Contents

Introduction

Chapter 1: What does a URI mean?

The URI scheme determines everything about a resource

General properties of URIs

Specific properties of URI schemes

FAQ: What you can know about a resource by looking at a URI?

@@Resources not referenced by URIs@@

Chapter 2: What does a resource mean?

Breaking the chain

XML and the Interpretation of mixed namespaces

Chapter 3: What goes on under the hood?

don't know yet

Chapter 4: What does a message mean?

Glossary

References

Normative References

Non-Normative References

To Do