DRAFT: Web Achitecture Document

This version:: http://www.w3.org/2001/tag/2002/0607-intro
Superseded by:: http://www.w3.org/2001/tag/2002/0701-intro
Previous version:: http://www.w3.org/2001/tag/2002/0508-intro
Editor:: Ian Jacobs, W3C

Abstract

This document presents a view of World Wide Web Architecture, from the perspective of W3C's Technical Architecture Group (TAG).This introduction has two purposes: to give the reader a general sense of what the TAG means by World Wide Web Architecture, and to call out some of the principles regarded as fundamental to the success of the Web.

Status of this document

This document has been superseded. See next version.

This document has been developed for discussion by the W3C Technical Architecture Group.

This document is the work of the editor. It incorporates the work of the other TAG participants by sewing together written pieces and discussions. It is a draft with no official standing. It does not necessarily represent the consensus opinion of the TAG. The "@@" symbols are to warn the reader that the content is unstable or incomplete or may be incorrect.

Once this document has undergone substantial revision, the TAG expects to develop it on the W3C Recommendation track.

Comments may be directed to the W3C TAG mailing list www-tag@w3.org (archive).

Publication of this document by W3C indicates no endorsement by W3C.

Introduction
Chapter 1: Identifiers
Chapter 2: Formats
Chapter 3: Protocols
Chapter 4: What does a message mean?
Appendix 1: Tips on URIs
Glossary
References

Introduction

The World Wide Web ("Web" from here on) is a networked information system consisting of clients, servers and other agents that interchange information. Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space.

This architecture consists of:

Identifiers. A single specification of the way in which objects in the system are identified: the Uniform Resource Identifier (URI) [RFC2396].
Formats. Specifications of a nonexclusive set of data formats designed for interchange between agents in the system. This includes several formats used in isolation or in combinations (e.g., XHTML, PNG, XLink, RDF, SMIL animation, Ruby), as well as technologies for designing new formats (XML, XML namespaces, DOM).
Protocols. Specifications of a small and nonexclusive set of protocols for interchanging information between agents, including HTTP [RFC2616], SMTP and others. Several of these protocols share a reliance on the Internet Media Type (or, "MIME") metadata/packaging system [RFC2046].

The rules are kept to a minimum, leaving the functionality the Web can deliver open to the imagination of its developers.

Chapter 1: Identifiers

Identification is an important aspect of communication. On the Web, we identify resources with URI references. A resource can be anything.

1. All important resources SHOULD be identifiable by URI reference.

The URI specification [RFC2396] represents a worldwide agreement on who can create identifiers and how they take on meaning in protocols and formats.

1.1 Generic URI reference syntax

The syntax of a URI reference consists of:

a unique URI (relative or absolute), optionally followed by
a fragment identifier ("#" and what follows).

A relative URI is a syntactic abbreviation for an absolute URI.

1.2 URI Schemes

A number of identification mechanisms pre-date the Web, such as those for electronic mailboxes and ftp documents. URIs were designed to incorporate these existing naming schemes ('ftp', 'mailto', etc.) and a new scheme designed specially for the Web: 'http'.

A URI scheme defines the properties of URIs in that scheme. The IANA registry [IANASchemes] lists URI schemes and the specifications that define them. For instance, the "http" URI scheme is defined in section 3.2.2 of the HTTP specification [RFC2616].

Some important properties vary by URI scheme, including the following:

The sort of resource identified. For instance, a "telnet" URI identifies a telnet service and a "mailto" URI identifies a mailbox. An HTTP URI identifies a document -- something that can be entirely conveyed in bits -- for which there is a mapping to a set of equivalent representations (see scheme property 6). The mapping from HTTP URI to document may vary over time, and the set of (equivalent) representations may be empty.
Editor's Note: Roy Fielding version of earlier sentence: "An HTTP URI identifies an abstraction for which there is a time-varying conceptual mapping to a (possibly empty) set of representations that are equivalent."
Whether representations of the resource can change over time (i.e., whether resources can be "living resources"). [TAG issue httpRange-14]
Deployed dereference mechanisms, if any. [TAG issue whenToUseGet-7]
The persistence of the identity relationship. Persistence involves two properties:
1. Can the same URI, over time, be used to identify a different resource?
2. Will the dereference mechanism continue to work over time?
The social governance of the identity relationship. For instance, in the "http" scheme, the ICANN has authority over DNS.
The equivalence class of representations for a given URI. For instance, for the "md5" scheme, the equivalence class is bitwise equivalence. For "http" URIs, the equivalence class is defined by the naming authority. [TAG issue URIEquivalence-15]

1.2.1 Cool URIs don't change

As mentioned above, a URI schemes may have different persistence properties. There are strong social expectations that once a URI reference is assigned to identify a particular resource, it should continue indefinitely to refer to that same resource. Persistence is usually a matter of policy and commitment on the part of authorities assigning URI references rather than a constraint imposed by technological means.

For example, W3C assigns a URI reference for each W3C technical report and "[makes] every effort to make archival documents indefinitely available at their original address in their original form." ([W3CPROCESS], chapter 5). W3C also assigns a URI reference to the "latest" publication in a series of related publications (e.g., all versions of the SVG 1.0 specification). These are two resources: a particular specification and the latest version of a specification. For the former, W3C's persistence policy is that representation(s) will not change over time. For the latter, W3C's persistence policy is that representations will change over time, with each new publication in the series; the changes are predictable.

For more ideas on persistence policies, see "Cool URIs Don't Change" [Cool].

1.2.2 The economics of names

URI schemes, Media Types, and the DNS piece of the http URI scheme illustrate some of the costs and benefits of central registries.

Registries concentrate power, increase the need for trust and fairness, and raise the stakes when that trust is abused. Who should be the authority that governs a world registry? Central registries can also be single points of failure.
Administrative hierarchies (such as DNS) do scale well for lookup requests.

Similarly, standardized names:

Can make life easier;
May require more time to deploy (which some might consider problematic and others advantageous);
Can lead to battles for control as scare resources. Short, easy-to-remember names are more valuable than random numbers.

In general, to promote scalability, Web architecture should avoid centralized registries. There are exceptions (e.g., DNS may be acceptable). On the other thand, the TAG finding "Mapping between URIs and Internet Media Types" promotes the idea of using the Web as a repository for new Media Types. [TAG issue uriMediaType-9]

1.3 Dereferencing a URI

To "dereference a URI" means to request a representation of the resource designated by the URI. The dereference mechanism varies according to URI scheme and should be defined by each scheme (see "Guidelines for new URL Schemes" [RFC2718]). The dereference mechanism for the "http" scheme is GET [TAG issue whenToUseGet-7].

2. Agents SHOULD be able to dereference URI references for important resources. [TAG issue namespaceDocument-8]

3. Dereferencing a URI for an important abstract concept (for example, Internet protocol parameters) SHOULD return human and/or machine readable representations that describe the nature and purpose of those resources. [TAG issue namespaceDocument-8]

4. Dereferencing URIs is safe; i.e. agents do not incur obligations by following links. [TAG finding "URIs, Addressability, and the use of HTTP GET"]

Please refer to the TAG finding "URIs, Addressability, and the use of HTTP GET" for information about safe operations and using HTTP GET for addressibility.

1.4 Fragment identifiers

URIs that can be deferenced can end with a fragment identifier (to form a URI reference). Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation. For instance, if the representation is an HTML document, the fragment identifier designates a hypertext anchor. In the case of a graphics format, a URI reference might designate a circle or spline. In the case of an RDF document, a URI reference can designate anything, be it abstract (e.g., a dream) or concrete (e.g., my car). The plain text media type does not define semantics for fragment identifiers.

1.4.1 Design weakness: HTTP content negotiation and fragment identifiers

Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.

New access protocols should provide a means to convert fragment identifiers according to media type.

1.5 URIs, URLs, URNs, ...

Section 1.2 of [RFC2396] explains that URIs can be further classified:

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. The term "Uniform Resource Name" (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable.

RFC 2141 [RFC2141] defines the "urn" URI scheme. In practice, URNs cannot be dereferenced. URI persistence (the goal of the urn scheme) is primarily a social issue. URIs of other schemes (including http) can also be managed to meet the goal of persistence, and can be dereferenced.

Chapter 2: Formats

@@Ideas:@@

For new format specifications, use XML unless there's a good reason not to.
Format designers should use URI references without constraining content providers to particular URI schemes.
Namespaces. Issues namespaceDocument-8, mixedNamespaceMeaning-13
Qnames: Issues rdfmsQnameUriMapping-6, qnameAsId-18 and finding "Using QNames as Identifiers in Content"
Formatting properties: Issue formattingProperties-19
Error handling: Issue errorHandling-20
MIME type registration: RFC3023Charset-21, finding Internet Media Type registration, consistency of use. Also, makes sure to define fragment identifier semantics.

Chapter 3: Protocols

@@Note: See proposed Text from David Orchard.@@

Under the hood, the Web teems with activity in the form of messages exchanged over a global network. Part of the World Wide Web Architecture ("Web Architecture" from here on) is intended to maintain the illusion of a stable Web, hiding from the user the noise of these low-level messages. This chapter discusses some aspects of Web protocols that take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.

Relevant issues, findings:

Consistency of media types and message contents (from "TAG Finding: Internet Media Type registration, consistency of use"
Consistency of communicating character encoding (same source).
HTTP as a substrate protocol [TAG issue HTTPSubstrate-16]

Chapter 4: What does a message mean?

Appendix 1: Tips on URIs

A1.1 Spelling of URI references

Do not make assumptions about a resource based on the spelling of a URI that refers to it (other than what is defined in specifications for the URI scheme). Since URIs are opaque, it is an error to assume, for example, that a URI that happens to end with the string ".html" refers to a resource that has an HTML representation. Though people must not infer anything about the nature of a resource representation from a URI ending in ".html", resource owners must not create confusion by purposely misassigning suffixes and representation types.

At times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use. But in general, just as "children should be seen but not heard", URIs should be used but not seen. In general, URIs should be hidden from view since they are ugly to look at and they tend to lure us into thinking they hold definitive meaning about a resource.

Note on canonical form of URIs: Although section 6 of RFC 2396 describes URI canonicalization, those using URIs should use the same string-wise URI consistently to refer to the same resource. Don't rely on others to canonicalize a URI.

A1.2 Unique URI references

Authors should not use a URI reference to identify more than one resource. In particular, authors should not use a URI reference to identify both a document and what the document is about, or both a person and that person's mailbox.

Glossary

Media Type/Content Type/MIME Type
Resource
Uniform Resource Identifier (URI)
Dereference (a URI)
URI Reference
URI Scheme
Absolute/Relative URI
Persistence

References

Normative References

RFC2396: IETF "RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax", T. Berners-Lee, R. Fielding, L. Masinter, August 1998. Available at http://www.ietf.org/rfc/rfc2396.
RFC2616: IETF "RFC 2396: Hypertext Transfer Protocol -- HTTP/1.1", J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, June 1999. Available at http://www.ietf.org/rfc/rfc2616.
RFC2046: IETF "RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", N. Freed, N. Borenstein, November 1996. Available at http://www.ietf.org/rfc/rfc2046.
IANASchemes: IANA's online registry of URI Schemes is available at http://www.iana.org/assignments/uri-schemes.; Dan Connolly's list of URI schemes is a useful resource for finding out which references define various URI schemes.

Non-Normative References

Axioms: "Universal Resource Identifiers - Axioms of Web Architecture", T. Berners-Lee, living document dated December 1996. Available at http://www.w3.org/DesignIssues/Axioms
Cool: "Cool URI's don't change" T. Berners-Lee, W3C, 1998 Available at http://www.w3.org/Provider/Style/URI
Fielding: "Principled Design of the Modern Web Architecture", R.T. Fielding and R.N. Taylor, UC Irvine, Available in PDF at http://www.cs.virginia.edu/~cs650/assignments/papers/p407-fielding.pdf
Fragments: "Fragment Identifiers on URIs", T. Berners-Lee, living document dated April 1997. Available at http://www.w3.org/DesignIssues/Fragment
HTML40: "HTML 4.01 Specification", D. Raggett, A. Le Hors, I. Jacobs, 24 December 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-html401-19991224.
P3P10: "The Platform for Privacy Preferences 1.0 (P3P1.0) Specification", M. Marchiori, ed., 16 April 2002. This W3C Recommendation is available at http://www.w3.org/TR/2002/REC-P3P-20020416/.
W3CPROCESS: "W3C Process Document", 19 July 2001 Version.
RFC2718: "Guidelines for new URL Schemes", L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, November 1999. Available at: http://www.ietf.org/rfc/rfc2718.txt.
RFC3236: IETF "RFC 3236: The 'application/xhtml+xml' Media Type", M. Baker, P. Stark, January 2002. Available at: http://www.rfc-editor.org/rfc/rfc3236.
RFC2141: IETF "RFC 2141: URN Syntax", R. Moats, May 1997. Available at http://www.ietf.org/rfc/rfc2141.txt.
UniqueDNS: "IAB Technical Comment on the Unique DNS Root", B. Carpenter, 27 Sep 1999.
XHTML1: "XHTML 1.0: The Extensible HyperText Markup Language: A Reformulation of HTML 4 in XML 1.0", S. Pemberton et al., 26 January 2000. The latest version of this W3C Recommendation is available at http://www.w3.org/TR/xhtml1.
XML10: "Extensible Markup Language (XML) 1.0 (Second Edition)", T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, 6 October 2000. This W3C Recommendation is available at http://www.w3.org/TR/2000/REC-xml-20001006.
XMLNS: "Namespaces in XML", T. Bray, D. Hollander, A. Layman, 14 Jan 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-xml-names-19990114.

To Do

Include examples
Include tips (e.g., how to avoid breaking a link; reference to HTTP).

Ian Jacobs
Last modified $Date: 2002/07/01 17:46:47 $ by $Author: ijacobs $
Version: $Version$