DRAFT: Architectural Principles of the World Wide Web

This version:: http://www.w3.org/2001/tag/2002/0805-archdoc
Superseded by:: http://www.w3.org/2001/tag/2002/0813-archdoc
Previous version:: http://www.w3.org/2001/tag/2002/0701-intro
Editor:: Ian Jacobs, W3C

Abstract

The World Wide Web is a networked information system. Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space. Identification, data formats, and protocols are the main technical components of Web Architecture, but the large-scale effect depends on social behavior as well.

This document strives to establish a reference set of rules for Web architecture. Some of these rules may conflict with current practice, and so education and outreach will be required to improve on that practice. Other rules may fill in gaps in published specifications or may highlight known weaknesses in those specifications.

Status of this document

This document has been superseded. See next version.

This document has been developed for discussion by the W3C Technical Architecture Group.

This draft is highly unstable. This draft represents substantial input from TAG participants, but does not yet represent consensus. It is a draft with no official standing. Once this document has undergone substantial revision, the TAG expects to develop it on the W3C Recommendation track.

Please send comments on this document to the public W3C TAG mailing list www-tag@w3.org (archive).

Publication of this document by W3C indicates no endorsement by W3C.

Introduction
Chapter 1: Identifiers and resources
Chapter 2: Formats
Chapter 3: Protocols
Appendix 1: Tips on URIs
Appendix 2: URI scheme properties
Glossary
References

Introduction

The World Wide Web ("Web" from here on) is a networked information system consisting of agents (clients, servers, and other programs) that exchange information. Open: Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space that scales well and behaves predictably.

This architecture consists of:

Identifiers. A single specification of the way in which objects in the system are identified: the Uniform Resource Identifier (URI) [RFC2396].
Formats. Specifications of a nonexclusive set of data formats designed for interchange between agents in the system. This includes several formats used in isolation or in combinations (e.g., XHTML, PNG, XLink, RDF, SMIL animation), as well as technologies for designing new formats (XML, XML namespaces).
Protocols. Specifications of a small and nonexclusive set of protocols for interchanging information between agents, including HTTP [RFC2616], SMTP and others. Several of these protocols share a reliance on the Internet Media Type (or, "MIME") metadata/packaging system [RFC2046].

Limits of this document

This document focuses on architectural principles specific to or fundamental to the Web. It does not address general principles of design, which are also important to the success of the Web. Indeed, behind many of the principles of Web Architecture lie these and other principles: minimal constraint (fewer rules makes the system more flexible), modularity, minimum redundancy, extensibility, simplicity, robustness, etc.

This document does not address architectural design goals covered by targetted W3C specifications:

Internationalization; see W3C's Internationalization Activity.
Accessibility; see W3C's Web Accessibility Initiative.
Device independence; see W3C's Device Independence Activity.

Chapter 1: Identifiers and resources

The Web is a universe of resources. According to [RFC2396] a resource is "anything that has identity." A resource is part of the Web when it is identified by a URI.

Use URIs: All important resources SHOULD be identified by a URI.

Each valid use of a URI unambiguously identifies one resource.

Valid use of URI: If you are using a registered URI scheme and following all the other relevant protocol specifications, it is unambiguous what resource you are referring to. This goes for all URI references, not just URIs.

Some resources do not have URIs (and are not part of the Web). For instance, if we consider every real number a resource; clearly we can't give every real number a URI without collisions; there are only denumerably many URIs.

1.1 Some generalities about URIs

The following statements are useful generalities about some URIs. Some of these generalities do not hold for some URI schemes.

The authority over a URI determines which resource a URI designates.
It is not generally possible to inspect a URI and determine what resource it identifies. For example, in general, one cannot look at a URI and know that it refers to "my old car" or "the weather forecast for Oaxaca, Mexico."
In general, several URIs may designate the same resource.
It is not generally possible to inspect two URIs and determine that they identify the same resource.
It is possible to compare two URIs to see whether they are spelled equivalently; see the section on URI equivalence and comparision for more details. One scenario where it is useful to compare URIs in when they are used as XML namespace identifiers [XMLNS].

1.1.1 URIs and context-sensitivity

Each valid use of a URI identifies one resource, but the resource itself may be inherently context-sensitive. For instance, http://www.example.com/ identifies the same resource in any context. On the other hand, http://localhost/ and file:/etc/hosts each identify one resource, but that resource is "local" to a particular computer. It is valid to use a URI such as file:/etc/hostson a given computer, and even on several computers, if you are confident that all of those computers are running the same type of operating system.

Context-insensitive URI: A URI SHOULD denote the same resource or concept independent of the context(s) in which the URI is used.

The same principle applies to URI references, which may be context-sensitive as well. For instance, the URI reference ../myFile is likely to be ambiguous and refer to two different resources (after resolution to an absolute URI given a base URI), depending on the context in which it is used.

Note: Within the Resource Description Framework (RDF), URI references are used to identify resources.

1.2 Resources and representations

A resource is an abstraction for which there is a conceptual mapping to a (possibly empty) set of representations. People and Web agents do not interact with a resource directly, but rather with representations of that resource. Interaction with a resource is governed by recursive application of a finite set of specifications, beginning with the specification that governs the scheme of the URI. One of the most important ways to interact with a resource is to request a representation of it. This is done by dereferencing the URI that identifies the resource.

For instance, suppose the URI http://weather.yahoo.com/forecast/MXOA0069.html identifies a resource that is "the weather forecast for Oaxaca, Mexico". Dereferencing the URI will result in a representation that may be encoded in any number of formats, including HTML, XHTML, SVG, etc.; see chapter 2 for more information about formats.

1.2.1 Consistent representations

The representations of a resource may vary as a function of factors including time, place, and the identity of the agent accessing the resource. For example, for the resource "the weather forecast for Oaxaca, Mexico," the representations depend on (at least) time, the expressed preference of the user for Fahrenheit or Celsius, and the identity of the user-agent software receiving the representation.

Consistent Representations: There is a strong expectation of consistency between the representations of a resource; to the extent possible, representations SHOULD be equivalent.

1.2.2 Consistent associations between URIs and resources

Note the difference between changes in representations of a resource and changes in the binding between a URI and a resource. Today, the URI http://www.w3.org/ designates the resource "the W3C home page." A representation you get today by dereferencing that URI is likely to differ from one you get tomorrow, since W3C updates its home page frequently with news items. These changes in representation are predictable, and the resource remains "the W3C home page".

On the other hand, if tomorrow, the same URI designated a different resource (for example, because the domain was sold and the new owner decided to assert a different URI-Resource relationship), the URI would lose value. This type of indiscriminate use of URIs undermines their value and interferes with people who relied on them (e.g., historians, court archives, new archive services, and anybody frustrated by a broken link).

1.3 Characteristics of URIs

A URI is a string of characters starting with a URI scheme. Some examples include:

http://www.w3.org/
ftp://ftp.w3.org/
irc://irc.openprojects.net/rdfig
urn:oasis:names:tc:SAML:1.0:assertion
tel:+1-913-555-1212

See [RFC2396] for more information about URI syntax. URIs do not include a "#", per RFC 2396. Only URI references do.

Persistence

There are strong social expectations that once a URI identifies a particular resource, it should continue indefinitely to refer to that resource. Persistence is always a matter of policy and commitment on the part of authorities assigning URIs rather than a constraint imposed by technological means.

Persistent URIs: Those who create and manage resources and their identifiers SHOULD design the identifiers in such a way as to ensure their persistence.

For example, each W3C technical report (e.g., "the SVG specification") is in fact a series of documents that represent the maturation of the technical report (Working Drafts, Candidate Recommendations, Proposed Recommendations, and a Recommendation). W3C assigns a URI to the "latest version" in the specification series (e.g., http://www.w3.org/TR/SVG). W3C also assigns a URI for each specification in the series (called the "this version URI", as in http://www.w3.org/TR/2001/PR-SVG-20010719/). W3C policy is that representations of the "latest version" resource will change over time (with each new publication of an SVG specification). W3C policy is also that representations of a specification designed by a "this version" URI will not change over time (to the best of W3C's ability to maintain its archives intact).

URI Schemes

A primary characteristic of a URI is its scheme, which is given by colon-delimited prefix. For example the scheme of the URI http://www.example.com/ is "http", and for ftp://ftp.example.com/ it is "ftp". It is common to classify URIs by scheme, calling the two preceding examples respectively an "HTTP URI" and an "FTP URI".

Many of the properties of URIs are scheme-dependent.

New Schemes Expensive: Since correct processing of URIs is often scheme-dependent, and since a huge range of software is expected to be able to process URIs, the cost of introduction of new URI schemes is very high. The introduction of new URI schemes SHOULD be avoided.

The IANA registry [IANASchemes] lists URI schemes and the specifications that define them. For instance, the HTTP URI scheme is defined in section 3.2.2 of the HTTP specification [RFC2616]. Refer to RFC2717 for information about registering a new URI scheme.

Scheme-specific Resource Classes

Some URI schemes are used for identifying specific classes of resources. For example, TELNET URIs represent telnet services and MAILTO URIs electronic mailboxes.

(Open:issue httpRange-14 : What is the range of the HTTP dereference function?).

Dereference Mechanisms

The procedure for dereferencing a URI may vary from scheme to scheme. For example, HTTP URIs are dereferencable using the protocol of the same name, and the scheme is actually defined in section 3.2.2 of the HTTP specification [RFC 2616].

On the other hand, the URN scheme [RFC 2141] does not guarantee that a dereference procedure is defined for any given URN.

@@Does it work here to substitute "dereference" for GET?@@

Allow Dereference: Agents SHOULD be able to dereference URIs for important resources.

Describe Resources: Dereferencing a URI for an important abstract concept (for example, Internet protocol parameters) SHOULD return human and/or machine readable representations that describe the nature and purpose of those resources.

Dereference is Safe: URI Dereferencing URIs is safe; i.e. agents do not incur obligations by following links. [TAG finding "URIs, Addressability, and the use of HTTP GET"]

Open: "Since HTTP GET is defined and widely deployed, agends SHOULD use HTTP URIs.

Open: Say something here à la what Tim Bray said: "Don't build a world of resources that cannot be identified by URI."?

Social Governance

The deployment and use of URI different URI schemes may require varying degrees of central co-ordination and administration. For example, HTTP URIs depend (in practice at least) on the use of the DNS infrastructure. Also, there is a central registry of URN subclasses.

Equivalence and Comparison

Certain URI schemes provide syntactic rules for determining equivalence in URIs, and these rules vary from scheme to scheme.

For example, URNs begin with two colon-delimited fields, the first of which must be urn and the second identifies the subclass of URN, for example urn:ietf:example. In URNs, these two fields are to be compared in a case-insensitive fashion. The remainder of the URN following the second colon is subject to rules dependent on the content of the second field (following the first colon) - thus the equivalence rules may vary within subclasses of URNs.

Section 3.2.3 of the HTTP specification [RFC2616] states that, when comparing two HTTP URIs, the host name part must be considered case-insensitive, so http://WWW.EXAMPLE/ and http://www.example/ identify the same resource.

URI case sensitivity: People SHOULD NOT assume that two URIs that differ only in case can be used interchangeably.

Note: Equivalence of URIs is not the same as equivalence of representations of a resource.

1.2 Characteristics of URI References

Section 4 of RFC 2396 [RFC2396] introduces the term URI Reference to include absolute URIs and two other constructs also used for identification:

Relative URI references. A relative URI reference is a syntactic abbreviation for an absolute URI. An example of a relative URI reference is ../main.html. What a relative URI reference identifies depends on the context where it is used. RFC 2396 defines the algorithm for finding an absolute URI for a given relative URI reference; this algorithm is not scheme-dependent.
Fragment identifiers ("#" and what follows). According to RFC2396, a fragment identifier "is not part of a URI, but is often used in conjunction with a URI."

There are thus four classes of identifiers that comprise URI References:

Absolute URI only, no fragment identifier. This is the class of URI Reference that is input to the dereference process.
Absolute URI or relative URI Reference, no fragment identifier. As an example, XML 1.0 requires that SYSTEM identifiers belong to this class.
Absolute URI only, with optional fragment identifier. In practice, almost all XML namespace names belong to this class. Open: This is the class of identifiers that forms the abstract information space. @@In this case, what URI Reference designates depends on the representation and media type.@@
Unrestricted URI Reference.

Open:

Use of URI Reference: Authors of specifications MUST use the terms "URI" and "URI Reference" according to the definitions in RFC2396.

Open: What should we call things in list items 1-3 above?

1.3 Fragment identifiers

In some URI schemes, URIs may end with a fragment identifier (to form a URI reference). Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation. For instance, if the representation is an HTML document, the fragment identifier designates a hypertext anchor. In the case of a graphics format, a URI reference might designate a circle or spline. In the case of RDF, a a URI reference can designate anything, be it abstract (e.g., a dream) or concrete (e.g., my car). The plain text media type does not define semantics for fragment identifiers.

1.3.1 Design weakness: HTTP content negotiation and fragment identifiers

Coneg Fragment: Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.

Open: New access protocols should provide a means to convert fragment identifiers according to media type.

Frequently asked questions about URIs and resources

Can a URI designate a representation of a resource?

Can a URI reference designate a resource?

Can I use an unregistered URI scheme on the public Internet?

No. While "myscheme:blort" is a URI that satisfies the syntactic constraints of [RFC2396], if "myscheme" is not registered, you don't have license to use that URI in any Internet protocols; there aren't any valid uses of it. You can't expect anybody to know what you mean by it, and you aren't guaranteed that somebody else isn't already using it for something else.

Chapter 2: Formats

2.1 Scope

What is a format, and how does it relate to the concept of a document. Do all documents have a format? Is a document a collection of resources of different formats organised into a whole? Is a document the same as a resource? the same as a message body? as a non-multipart message body? What is the distinction between documents and data, if any. Does 'document' imply human readable and if so, does it imply presentation? Does it imply a hierarchically structured, report-like document with headings and subheadings? Is a catalog a document? Is a rave flyer a document?

Negotiation (stuff above might go here also) by network request, by listed alternatives in content any preference? Resource varuants, foo.css and foo.html unlikely to be equivalent.

2.2 Model View Controller

Separation allows more easily composable specifications, allows multimodal acces, clarifies the concept of multiple, synchronous views of a document, and enhances accessibility.

The model - document formats

Composability (ns-meaning). Use of XML for tree structured content. Linking in general v. idref in one document. Human readable v. machine data. Served or not (hidden behind server - semantic firewall, accessibility. Linking into parts of the model, transclusion of parts. Compound documents, components from multiple servers - scalability, deep linking. Processing models, error handling.

The view - presentation

Presentation by decoration, and by derivation (creation of html/svg/etc as presentation). Linking between view and model. Inheritance of properties across namespaces. Consistency of property names. Subsets. 'Applies to' as opposed to 'set on'. Specificity of properties as attributes, chaining styling, restyling. Timelines, linking to portions of a timeline.

The controller - animation, scripting, events, client/server interaction

Declarative vs script based - accessibility, power; formalisation of common functionality (loop animation, rollovers) in declarative form. DOM - making additional methods, add to tather than replacing XML DOM. Effect of script/programming language limitations on choice of element and attribute names. Linking to active components - XForms example with model and abstract form control, can be extended to presentational instantiation of form control.

@@Ideas:@@

For new format specifications, use XML family of specifications unless there's a good reason not to. Open: Which XML specifications? Open: which particular family members?
Format designers should use URIs without constraining content providers to particular URI schemes. Open: what does "use" mean? IDREF vs linking - web-wide rather than document-wide references.
Namespaces. Issues namespaceDocument-8, mixedNamespaceMeaning-13
Qnames: Issues rdfmsQnameUriMapping-6, qnameAsId-18 and finding "Using QNames as Identifiers in Content"
Formatting properties: Issue formattingProperties-19
Error handling: Issue errorHandling-20
MIME type registration: RFC3023Charset-21, finding Internet Media Type registration, consistency of use. Also, makes sure to define fragment identifier semantics.
Effect of Mobile on architecture - size, complexity, memory constraints. Binary infosets, storage efficiency. Composable subsets.

Chapter 3: Protocols

As mentioned in the introduction, the Web is designed to create the large-scale effect of a shared information space that scales well and behaves predictably. The architectural style known as Representational State Transfer [REST] encapsulates this notion of a shared information space. According to Fielding:

REST provides a set of architectural constraints that, when applied as a whole, emphasizes scalability of component interactions, generality of interfaces, independent deployment of components, and intermediary components to reduce interaction latency, enforce security, and encapsulate legacy systems.
-- Roy Fielding, Section 5.5 of [REST]

HTTP has been specially designed for REST interactions. HTTP has a variety of methods designed to manipulate resource state through represenation transfer between agents. These methods include GET (covered in section 1.2), POST, PUT, and DELETE.

This chapter uses the REST model to explain how Web protocols take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.

Relevant issues, findings:

Consistency of media types and message contents (from "TAG Finding: Internet Media Type registration, consistency of use"
Consistency of communicating character encoding (same source).
HTTP as a substrate protocol [TAG issue HTTPSubstrate-16]

Appendix 1: Tips on URIs

A1.1 Spelling of URIs

Do not make assumptions about a resource based on the spelling of a URI that refers to it (other than what is defined in specifications for the URI scheme). Since URIs are opaque, it is an error to assume, for example, that a URI that happens to end with the string ".html" refers to a resource that has an HTML representation. Though people must not infer anything about the nature of a resource representation from a URI ending in ".html", resource owners must not create confusion by purposely misassigning suffixes and representation types.

At times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use. But in general, just as "children should be seen but not heard", URIs should be used but not seen. In general, URIs should be hidden from view since they are ugly to look at and they tend to lure us into thinking they hold definitive meaning about a resource.

Open: Canonical form of URIs. Seeissue URIEquivalence-15.

A1.2 Unique URIs

Authors should not use a URI to identify more than one resource.

Nothing prevents us from considering "a representation of the novel Moby Dick" to be a resource itself (and thus to have an assigned URI). Authors should not use the same URI to refer to the resource "Moby Dick" and to the particular representation of that resource. Similarly, authors should not use the same URI to refer to a person and to that person's mailbox.

Glossary

Agent: A client, server, or other program that exchanges information on the Web.
Dereference (a URI): To request a representation of the resource designated by the URI.
Link
Media Type/Content Type/MIME Type
Resource: Anything with identity. A resource on the Web is one that has an assigned URI.
Uniform Resource Identifier (URI)
URI Reference
URI Scheme
Absolute/Relative URI
Persistence

References

Normative References

RFC2396: IETF "RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax", T. Berners-Lee, R. Fielding, L. Masinter, August 1998. Available at http://www.ietf.org/rfc/rfc2396.
RFC2616: IETF "RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1", J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, June 1999. Available at http://www.ietf.org/rfc/rfc2616.
RFC2046: IETF "RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", N. Freed, N. Borenstein, November 1996. Available at http://www.ietf.org/rfc/rfc2046.
RFC2717: IETF "Registration Procedures for URL Scheme Names", R. Petke, I. King, November 1999. Available at http://www.ietf.org/rfc/rfc2717.
IANASchemes: IANA's online registry of URI Schemes is available at http://www.iana.org/assignments/uri-schemes.; Dan Connolly's list of URI schemes is a useful resource for finding out which references define various URI schemes.

Non-Normative References

Axioms: "Universal Resource Identifiers - Axioms of Web Architecture", T. Berners-Lee, living document dated December 1996. Available at http://www.w3.org/DesignIssues/Axioms
Cool: "Cool URI's don't change" T. Berners-Lee, W3C, 1998 Available at http://www.w3.org/Provider/Style/URI
Fielding: "Principled Design of the Modern Web Architecture", R.T. Fielding and R.N. Taylor, UC Irvine, Available in PDF at http://www.cs.virginia.edu/~cs650/assignments/papers/p407-fielding.pdf
Fragments: "Fragment Identifiers on URIs", T. Berners-Lee, living document dated April 1997. Available at http://www.w3.org/DesignIssues/Fragment
HTML40: "HTML 4.01 Specification", D. Raggett, A. Le Hors, I. Jacobs, 24 December 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-html401-19991224/.
P3P10: "The Platform for Privacy Preferences 1.0 (P3P1.0) Specification", M. Marchiori, ed., 16 April 2002. This W3C Recommendation is available at http://www.w3.org/TR/2002/REC-P3P-20020416/.
REST: "Representational State Transfer (REST)", Chapter 5 of "Architectural Styles and the Design of Network-based Software Architectures", Doctoral Thesis of R. T. Fielding, 2000.
RFC2718: "Guidelines for new URL Schemes", L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, November 1999. Available at: http://www.ietf.org/rfc/rfc2718.txt.
RFC3236: IETF "RFC 3236: The 'application/xhtml+xml' Media Type", M. Baker, P. Stark, January 2002. Available at: http://www.rfc-editor.org/rfc/rfc3236.
RFC2141: IETF "RFC 2141: URN Syntax", R. Moats, May 1997. Available at http://www.ietf.org/rfc/rfc2141.txt.
UniqueDNS: "IAB Technical Comment on the Unique DNS Root", B. Carpenter, 27 Sep 1999.
XHTML1: "XHTML 1.0: The Extensible HyperText Markup Language: A Reformulation of HTML 4 in XML 1.0", S. Pemberton et al., 26 January 2000. The latest version of this W3C Recommendation is available at http://www.w3.org/TR/xhtml1/.
XML10: "Extensible Markup Language (XML) 1.0 (Second Edition)", T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, 6 October 2000. This W3C Recommendation is available at http://www.w3.org/TR/2000/REC-xml-20001006.
XMLNS: "Namespaces in XML", T. Bray, D. Hollander, A. Layman, 14 Jan 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-xml-names-19990114/.
W3CPROCESS: "W3C Process Document", 19 July 2001 Version.

Ian Jacobs
Last modified $Date: 2002/08/13 14:25:17 $ by $Author: ijacobs $
Version: $Version$