W3C

DRAFT: Architectural Principles of the World Wide Web

This version:
http://www.w3.org/2001/tag/2002/0701-intro
Superseded by:
http://www.w3.org/2001/tag/2002/0805-archdoc
Previous version:
http://www.w3.org/2001/tag/2002/0607-intro
Editor:
Ian Jacobs, W3C

Abstract

The World Wide Web is a networked information system. Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space. Identification, data formats, and protocols are the main technical components of Web Architecture, but the large-scale effect depends on social behaviours as well.

This document is a reference set of rules for Web Architecture.

Status of this document

This document has been superseded. See next version.

This document has been developed for discussion by the W3C Technical Architecture Group.

This draft is highly unstable. This draft represents substantial input from TAG participants, but does not yet represent consensus. It is a draft with no official standing. Once this document has undergone substantial revision, the TAG expects to develop it on the W3C Recommendation track.

Please send comments on this document to the public W3C TAG mailing list www-tag@w3.org (archive).

Publication of this document by W3C indicates no endorsement by W3C.

Table of Contents

Introduction

The World Wide Web ("Web" from here on) is a networked information system consisting of agents (clients, servers, and other programs) that exchange information. Open: Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space that scales well and behaves predictably.

This architecture consists of:

  1. Identifiers. A single specification of the way in which objects in the system are identified: the Uniform Resource Identifier (URI) [RFC2396].
  2. Formats. Specifications of a nonexclusive set of data formats designed for interchange between agents in the system. This includes several formats used in isolation or in combinations (e.g., XHTML, PNG, XLink, RDF, SMIL animation), as well as technologies for designing new formats (XML, XML namespaces).
  3. Protocols. Specifications of a small and nonexclusive set of protocols for interchanging information between agents, including HTTP [RFC2616], SMTP and others. Several of these protocols share a reliance on the Internet Media Type (or, "MIME") metadata/packaging system [RFC2046].

Limits of this document

This document focuses on architectural principles specific to or fundamental to the Web. It does not address general principles of design, which are also important to the success of the Web. Indeed, behind many of the principles of Web Architecture lie these and other principles: minimal constraint (fewer rules makes the system more flexible), modularity, minimum redundancy, extensibility, simplicity, robustness, etc.

This document does not address design goals covered by targetted W3C specifications:

  1. Internationalization; see W3C's Internationalization Activity.
  2. Accessibility; see W3C's Web Accessibility Initiative.
  3. Device independence; see W3C's Device Independence Activity.

Chapter 1: Identifiers

According to [RFC2396] a resource is "anything that has identity." A resource is part of the Web when there is a URI that identifies it. (Open:issue httpRange-14 : What is the range of the HTTP dereference function?).

UseURI: All important resources SHOULD be part of the Web, i.e., identified by a URI.

Open: The URI specification [RFC2396] represents a worldwide agreement on who can create identifiers and how they take on meaning in protocols and formats.

1.1 URI Schemes

A number of identification mechanisms pre-date the Web, such as those for electronic mailboxes and ftp documents. URIs were designed to incorporate these existing naming schemes ('ftp', 'mailto', etc.) and a new scheme designed specially for the Web: 'http'.

A URI scheme defines the properties of URIs in that scheme. The IANA registry [IANASchemes] lists URI schemes and the specifications that define them. For instance, the HTTP URI scheme is defined in section 3.2.2 of the HTTP specification [RFC2616]. In a URI, the scheme name appears before the colon (":"), as in ftp://www.ietf.org/rfc/rfc2396.txt.

Open: Some important properties vary by URI scheme, including the following:

  1. The sort of resource identified. For instance, a TELNET URI identifies a telnet service and a MAILTO URI identifies a mailbox. Open:
    1. From TimBL: An HTTP URI identifies a document -- something that can be entirely conveyed in bits -- for which there is a mapping to a set of equivalent representations (see scheme property 6). The mapping from HTTP URI to document may vary over time, and the set of (equivalent) representations may be empty.
    2. From Roy: An HTTP URI identifies an abstraction for which there is a time-varying conceptual mapping to a (possibly empty) set of representations that are equivalent.
  2. Whether representations of the resource can change over time (i.e., whether resources can be "living resources").
  3. Deployed dereference mechanisms, if any.
  4. The persistence of the identity relationship. Persistence involves two properties:
    1. The degree to which the same URI identifies the same or different resources over time.
    2. The degree to which the means to dereference a URI remains available over time.
  5. The social governance of the identity relationship. For instance, in the HTTP URI scheme, the ICANN has authority over DNS.
  6. Whether two URIs refer to the same resource.For HTTP URIs, the naming authority determines this. For instance, W3C determines whether two URIs refer to the same W3C Recommendation.
  7. Whether two strings are different spellings for the same URI. [TAG issue URIEquivalence-15]
  8. Whether it's possible to construct relative URI references.

1.1.1 Social expectations for URI persistence

As mentioned above, a URI schemes may have different persistence properties. There are strong social expectations that once a URI identifies a particular resource, it should continue indefinitely to refer to that resource. Persistence is always a matter of policy and commitment on the part of authorities assigning URIs rather than a constraint imposed by technological means.

For example, each W3C technical report (e.g., "the SVG specification") is in fact a series of documents that represent the maturation of the technical report (Working Drafts, Candidate Recommendations, Proposed Recommendations, and a Recommendation). W3C assigns a URI to the "latest version" in the specification series (e.g., http://www.w3.org/TR/SVG). W3C also assigns a URI for each specification in the series (called the "this version URI", as in http://www.w3.org/TR/2001/PR-SVG-20010719/). W3C policy is that representations of the "latest version" resource will change over time (with each new publication of an SVG specification). W3C policy is also that representations of a specification designed by a "this version" URI will not change over time (to the best of W3C's ability to maintain its archives intact).

RFC 2141 [RFC2141] defines the "Uniform Resource Name (URN) URI scheme. URNs form a subset of URIs that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable (per section 1.2 of [RFC2396]). In practice, URNs cannot be dereferenced. URIs of other schemes (including HTTP) can also be managed to meet the goal of persistence, and can be dereferenced.

For more ideas on persistence policies, see "Cool URIs Don't Change" [Cool].

1.1.2 Centralized registries

In general, to promote scalability, Web architecture should avoid centralized registries. There are exceptions (e.g., DNS may be acceptable). On the other thand, the TAG finding "Mapping between URIs and Internet Media Types" promotes the idea of using the Web as a repository for new Media Types. [TAG issue uriMediaType-9]

AvoidRegistries: Designers SHOULD avoid centralized registries but MAY rely on the continued existence and utility of the DNS.

1.2 Dereferencing a URI

To dereference a URI means to request a representation of the resource designated by the URI. The dereference mechanism varies according to URI scheme and must be defined by each scheme where dereferencing is a goal. See "Guidelines for new URL Schemes" [RFC2718]. The dereference mechanism for the HTTP URI scheme is GET [TAG issue whenToUseGet-7].

UseGET: Agents SHOULD be able to dereference URIs for important resources.

Open: "Since HTTP GET is defined and widely deployed, agends SHOULD use HTTP URIs.

Open: Say something here à la what Tim Bray said: "Don't build a world of resources that cannot be identified by URI."?

DescribeResource: Dereferencing a URI for an important abstract concept (for example, Internet protocol parameters) SHOULD return human and/or machine readable representations that describe the nature and purpose of those resources.

GETIsSafe: URI Dereferencing URIs is safe; i.e. agents do not incur obligations by following links. [TAG finding "URIs, Addressability, and the use of HTTP GET"]

Please refer to the TAG finding "URIs, Addressability, and the use of HTTP GET" for information about safe operations and using HTTP GET for addressibility.

1.3 URIs and URI References

Section 4 of RFC 2396 [RFC2396] introduces the term URI Reference to include absolute URIs and two other constructs also used for identification:

  1. Relative URI references. A relative URI reference is a syntactic abbreviation for an absolute URI. An example of a relative URI reference is ../main.html. The meaning of a relative URI references depends on the context where it is used, unlike absolute URIs, whose meaning is the same in any context.
  2. Fragment identifiers ("#" and what follows). According to RFC2396, a fragment identifier "is not part of a URI, but is often used in conjunction with a URI."

There are thus four classes of identifiers that comprise URI References:

  1. Absolute URI only, no fragment identifier. This is the class of URI Reference that is input to the dereference process.
  2. Absolute URI or relative URI Reference, no fragment identifier. As an example, XML 1.0 requires that SYSTEM identifiers belong to this class.
  3. Absolute URI only, with fragment identifier. In practice, almost all XML namespace names belong to this class. Open: This is the class of identifiers that forms the abstract information space.
  4. Unrestricted URI Reference.

Open:

UseOfURIReference: Authors of specifications MUST use the terms "URI" and "URI Reference" according to the definitions in RFC2396.

1.4 Fragment identifiers

URIs that can be deferenced can end with a fragment identifier (to form a URI reference). Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation. For instance, if the representation is an HTML document, the fragment identifier designates a hypertext anchor. In the case of a graphics format, a URI reference might designate a circle or spline. In the case of RDF, a a URI reference can designate anything, be it abstract (e.g., a dream) or concrete (e.g., my car). The plain text media type does not define semantics for fragment identifiers.

1.4.1 Design weakness: HTTP content negotiation and fragment identifiers

ConegFragment: Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.

Open: New access protocols should provide a means to convert fragment identifiers according to media type.

Chapter 2: Formats

@@Ideas:@@

Chapter 3: Protocols

As mentioned in the introduction, the Web is designed to create the large-scale effect of a shared information space that scales well and behaves predictably. The architectural style known as Representational State Transfer [REST] encapsulates this notion of a shared information space. According to Fielding:

REST provides a set of architectural constraints that, when applied as a whole, emphasizes scalability of component interactions, generality of interfaces, independent deployment of components, and intermediary components to reduce interaction latency, enforce security, and encapsulate legacy systems.
-- Roy Fielding, Section 5.5 of [REST]

HTTP has been specially designed for REST interactions. HTTP has a variety of methods designed to manipulate resource state through represenation transfer between agents. These methods include GET (covered in section 1.2), POST, PUT, and DELETE.

This chapter uses the REST model to explain how Web protocols take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.

Relevant issues, findings:

  1. Consistency of media types and message contents (from "TAG Finding: Internet Media Type registration, consistency of use"
  2. Consistency of communicating character encoding (same source).
  3. HTTP as a substrate protocol [TAG issue HTTPSubstrate-16]

Appendix 1: Tips on URIs

A1.1 Spelling of URIs

Do not make assumptions about a resource based on the spelling of a URI that refers to it (other than what is defined in specifications for the URI scheme). Since URIs are opaque, it is an error to assume, for example, that a URI that happens to end with the string ".html" refers to a resource that has an HTML representation. Though people must not infer anything about the nature of a resource representation from a URI ending in ".html", resource owners must not create confusion by purposely misassigning suffixes and representation types.

At times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use. But in general, just as "children should be seen but not heard", URIs should be used but not seen. In general, URIs should be hidden from view since they are ugly to look at and they tend to lure us into thinking they hold definitive meaning about a resource.

Open: Canonical form of URIs. Seeissue URIEquivalence-15.

A1.2 Unique URIs

Authors should not use a URI to identify more than one resource.

Nothing prevents us from considering "a representation of the novel Moby Dick" to be a resource itself (and thus to have an assigned URI). Authors should not use the same URI to refer to the resource "Moby Dick" and to the particular representation of that resource. Similarly, authors should not use the same URI to refer to a person and to that person's mailbox.

Glossary

Agent
A client, server, or other program that exchanges information on the Web.
Dereference (a URI)
To request a representation of the resource designated by the URI.
Link
Media Type/Content Type/MIME Type
Resource
Anything with identity. A resource on the Web is one that has an assigned URI.
Uniform Resource Identifier (URI)
URI Reference
URI Scheme
Absolute/Relative URI
Persistence

References

Normative References

RFC2396
IETF "RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax", T. Berners-Lee, R. Fielding, L. Masinter, August 1998. Available at http://www.ietf.org/rfc/rfc2396.
RFC2616
IETF "RFC 2396: Hypertext Transfer Protocol -- HTTP/1.1", J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, June 1999. Available at http://www.ietf.org/rfc/rfc2616.
RFC2046
IETF "RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", N. Freed, N. Borenstein, November 1996. Available at http://www.ietf.org/rfc/rfc2046.
IANASchemes
IANA's online registry of URI Schemes is available at http://www.iana.org/assignments/uri-schemes.
Dan Connolly's list of URI schemes is a useful resource for finding out which references define various URI schemes.

Non-Normative References

Axioms
"Universal Resource Identifiers - Axioms of Web Architecture", T. Berners-Lee, living document dated December 1996. Available at http://www.w3.org/DesignIssues/Axioms
Cool
"Cool URI's don't change" T. Berners-Lee, W3C, 1998 Available at http://www.w3.org/Provider/Style/URI
Fielding
"Principled Design of the Modern Web Architecture", R.T. Fielding and R.N. Taylor, UC Irvine, Available in PDF at http://www.cs.virginia.edu/~cs650/assignments/papers/p407-fielding.pdf
Fragments
"Fragment Identifiers on URIs", T. Berners-Lee, living document dated April 1997. Available at http://www.w3.org/DesignIssues/Fragment
HTML40
"HTML 4.01 Specification", D. Raggett, A. Le Hors, I. Jacobs, 24 December 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-html401-19991224/.
P3P10
"The Platform for Privacy Preferences 1.0 (P3P1.0) Specification", M. Marchiori, ed., 16 April 2002. This W3C Recommendation is available at http://www.w3.org/TR/2002/REC-P3P-20020416/.
REST
"Representational State Transfer (REST)", Chapter 5 of "Architectural Styles and the Design of Network-based Software Architectures", Doctoral Thesis of R. T. Fielding, 2000.
RFC2718
"Guidelines for new URL Schemes", L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, November 1999. Available at: http://www.ietf.org/rfc/rfc2718.txt.
RFC3236
IETF "RFC 3236: The 'application/xhtml+xml' Media Type", M. Baker, P. Stark, January 2002. Available at: http://www.rfc-editor.org/rfc/rfc3236.
RFC2141
IETF "RFC 2141: URN Syntax", R. Moats, May 1997. Available at http://www.ietf.org/rfc/rfc2141.txt.
UniqueDNS
"IAB Technical Comment on the Unique DNS Root", B. Carpenter, 27 Sep 1999.
XHTML1
"XHTML 1.0: The Extensible HyperText Markup Language: A Reformulation of HTML 4 in XML 1.0", S. Pemberton et al., 26 January 2000. The latest version of this W3C Recommendation is available at http://www.w3.org/TR/xhtml1/.
XML10
"Extensible Markup Language (XML) 1.0 (Second Edition)", T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, 6 October 2000. This W3C Recommendation is available at http://www.w3.org/TR/2000/REC-xml-20001006.
XMLNS
"Namespaces in XML", T. Bray, D. Hollander, A. Layman, 14 Jan 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-xml-names-19990114/.
W3CPROCESS
"W3C Process Document", 19 July 2001 Version.

To Do


Ian Jacobs
Last modified $Date: 2002/08/08 12:33:35 $ by $Author: ijacobs $
Version: $Version$