Copyright © 2002 W3C ® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use, and software licensing rules apply.
This document presents a view of World Wide Web Architecture, from the perspective of W3C's Technical Architecture Group (TAG).This introduction has two purposes: to give the reader a general sense of what the TAG means by World Wide Web Architecture, and to call out some of the principles regarded as fundamental to the success of the Web.
This document has been superseded. See next version.
This document has been developed for discussion by the W3C Technical Architecture Group.
This document is the work of the editor. It incorporates the work of the other TAG participants by sewing together written pieces and discussions. It is a draft with no official standing. It does not necessarily represent the consensus opinion of the TAG. The "@@" symbols are to warn the reader that the content is unstable or incomplete or may be incorrect.
Once this document has undergone substantial revision, the TAG expects to develop it on the W3C Recommendation track.
Comments may be directed to the W3C TAG mailing list www-tag@w3.org (archive).
Publication of this document by W3C indicates no endorsement by W3C.
The World Wide Web ("Web" from here on) is a networked information system consisting of clients, servers and other agents that interchange information. Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space.
This architecture consists of:
The rules are kept to a minimum, leaving the functionality the Web can deliver open to the imagination of its developers.
Identification is an important aspect of communication. On the Web, we identify resources with URI references. A resource can be anything.
1. All important resources SHOULD be identifiable by URI reference.
The URI specification [RFC2396] represents a worldwide agreement on who can create identifiers and how they take on meaning in protocols and formats.
The syntax of a URI reference consists of:
A relative URI is a syntactic abbreviation for an absolute URI.
A number of identification mechanisms pre-date the Web, such as those for electronic mailboxes and ftp documents. URIs were designed to incorporate these existing naming schemes ('ftp', 'mailto', etc.) and a new scheme designed specially for the Web: 'http'.
A URI scheme defines the properties of URIs in that scheme. The IANA registry [IANASchemes] lists URI schemes and the specifications that define them. For instance, the "http" URI scheme is defined in section 3.2.2 of the HTTP specification [RFC2616].
Some important properties vary by URI scheme, including the following:
Editor's Note: Roy Fielding version of earlier sentence: "An HTTP URI identifies an abstraction for which there is a time-varying conceptual mapping to a (possibly empty) set of representations that are equivalent."
As mentioned above, a URI schemes may have different persistence properties. There are strong social expectations that once a URI reference is assigned to identify a particular resource, it should continue indefinitely to refer to that same resource. Persistence is usually a matter of policy and commitment on the part of authorities assigning URI references rather than a constraint imposed by technological means.
For example, W3C assigns a URI reference for each W3C technical report and "[makes] every effort to make archival documents indefinitely available at their original address in their original form." ([W3CPROCESS], chapter 5). W3C also assigns a URI reference to the "latest" publication in a series of related publications (e.g., all versions of the SVG 1.0 specification). These are two resources: a particular specification and the latest version of a specification. For the former, W3C's persistence policy is that representation(s) will not change over time. For the latter, W3C's persistence policy is that representations will change over time, with each new publication in the series; the changes are predictable.
For more ideas on persistence policies, see "Cool URIs Don't Change" [Cool].
URI schemes, Media Types, and the DNS piece of the http URI scheme illustrate some of the costs and benefits of central registries.
Similarly, standardized names:
In general, to promote scalability, Web architecture should avoid centralized registries. There are exceptions (e.g., DNS may be acceptable). On the other thand, the TAG finding "Mapping between URIs and Internet Media Types" promotes the idea of using the Web as a repository for new Media Types. [TAG issue uriMediaType-9]
To "dereference a URI" means to request a representation of the resource designated by the URI. The dereference mechanism varies according to URI scheme and should be defined by each scheme (see "Guidelines for new URL Schemes" [RFC2718]). The dereference mechanism for the "http" scheme is GET [TAG issue whenToUseGet-7].
2. Agents SHOULD be able to dereference URI references for important resources. [TAG issue namespaceDocument-8]
3. Dereferencing a URI for an important abstract concept (for example, Internet protocol parameters) SHOULD return human and/or machine readable representations that describe the nature and purpose of those resources. [TAG issue namespaceDocument-8]
4. Dereferencing URIs is safe; i.e. agents do not incur obligations by following links. [TAG finding "URIs, Addressability, and the use of HTTP GET"]
Please refer to the TAG finding "URIs, Addressability, and the use of HTTP GET" for information about safe operations and using HTTP GET for addressibility.
URIs that can be deferenced can end with a fragment identifier (to form a URI reference). Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation. For instance, if the representation is an HTML document, the fragment identifier designates a hypertext anchor. In the case of a graphics format, a URI reference might designate a circle or spline. In the case of an RDF document, a URI reference can designate anything, be it abstract (e.g., a dream) or concrete (e.g., my car). The plain text media type does not define semantics for fragment identifiers.
Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.
New access protocols should provide a means to convert fragment identifiers according to media type.
Section 1.2 of [RFC2396] explains that URIs can be further classified:
A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. The term "Uniform Resource Name" (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable.
RFC 2141 [RFC2141] defines the "urn" URI scheme. In practice, URNs cannot be dereferenced. URI persistence (the goal of the urn scheme) is primarily a social issue. URIs of other schemes (including http) can also be managed to meet the goal of persistence, and can be dereferenced.
@@Ideas:@@
@@Note: See proposed Text from David Orchard.@@
Under the hood, the Web teems with activity in the form of messages exchanged over a global network. Part of the World Wide Web Architecture ("Web Architecture" from here on) is intended to maintain the illusion of a stable Web, hiding from the user the noise of these low-level messages. This chapter discusses some aspects of Web protocols that take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.
Relevant issues, findings:
Do not make assumptions about a resource based on the spelling of a URI that refers to it (other than what is defined in specifications for the URI scheme). Since URIs are opaque, it is an error to assume, for example, that a URI that happens to end with the string ".html" refers to a resource that has an HTML representation. Though people must not infer anything about the nature of a resource representation from a URI ending in ".html", resource owners must not create confusion by purposely misassigning suffixes and representation types.
At times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use. But in general, just as "children should be seen but not heard", URIs should be used but not seen. In general, URIs should be hidden from view since they are ugly to look at and they tend to lure us into thinking they hold definitive meaning about a resource.
Note on canonical form of URIs: Although section 6 of RFC 2396 describes URI canonicalization, those using URIs should use the same string-wise URI consistently to refer to the same resource. Don't rely on others to canonicalize a URI.
Authors should not use a URI reference to identify more than one resource. In particular, authors should not use a URI reference to identify both a document and what the document is about, or both a person and that person's mailbox.