Copyright © 2002 W3C ® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use, and software licensing rules apply.
The World Wide Web is a networked information system. Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space. Identification, data formats, and protocols are the main technical components of Web Architecture, but the large-scale effect depends on social behavior as well.
This document strives to establish a reference set of rules for Web architecture. Some of these rules may conflict with current practice, and so education and outreach will be required to improve on that practice. Other rules may fill in gaps in published specifications or may highlight known weaknesses in those specifications.
This document has been superseded. See next version.
This document has been developed for discussion by the W3C Technical Architecture Group.
This draft is highly unstable. This draft represents substantial input from TAG participants, but does not yet represent consensus. It is a draft with no official standing. Once this document has undergone substantial revision, the TAG expects to develop it on the W3C Recommendation track.
Please send comments on this document to the public W3C TAG mailing list www-tag@w3.org (archive).
Publication of this document by W3C indicates no endorsement by W3C.
The World Wide Web ("Web" from here on) is a networked information system consisting of agents (clients, servers, and other programs) that exchange information. Open: Web Architecture is the set of rules that all agents in the system follow that result in the large-scale effect of a shared information space that scales well and behaves predictably.
This architecture consists of:
This document focuses on architectural principles specific to or fundamental to the Web. It does not address general principles of design, which are also important to the success of the Web. Indeed, behind many of the principles of Web Architecture lie these and other principles: minimal constraint (fewer rules makes the system more flexible), modularity, minimum redundancy, extensibility, simplicity, robustness, etc.
This document does not address architectural design goals covered by targetted W3C specifications:
The Web is a universe of resources. According to [RFC2396] a resource is "anything that has identity." A resource is part of the Web when it is identified by a URI.
Use URIs: All important resources SHOULD be identified by a URI.
Each valid use of a URI unambiguously identifies one resource.
Valid use of URI: If you are using a registered URI scheme and following all the other relevant protocol specifications, it is unambiguous what resource you are referring to. This goes for all URI references, not just URIs.
Some resources do not have URIs (and are not part of the Web). For instance, if we consider every real number a resource; clearly we can't give every real number a URI without collisions; there are only denumerably many URIs.
The following statements are useful generalities about some URIs. Some of these generalities do not hold for some URI schemes.
Each valid use of a URI identifies one resource, but the resource itself
may be inherently context-sensitive. For instance,
http://www.example.com/
identifies the same resource in any
context. On the other hand, http://localhost/
and
file:/etc/hosts
each identify one resource, but that resource is
"local" to a particular computer. It is valid to use a URI such as
file:/etc/hosts
on a given computer, and even on several
computers, if you are confident that all of those computers are running the
same type of operating system.
Context-insensitive URI: A URI SHOULD denote the same resource or concept independent of the context(s) in which the URI is used.
The same principle applies to URI references,
which may be context-sensitive as well. For instance, the URI reference
../myFile
is likely to be ambiguous and refer to two different
resources (after resolution to an absolute URI given a base URI), depending
on the context in which it is used.
Note: Within the Resource Description Framework (RDF), URI references are used to identify resources.
A resource is an abstraction for which there is a conceptual mapping to a (possibly empty) set of representations. People and Web agents do not interact with a resource directly, but rather with representations of that resource. Interaction with a resource is governed by recursive application of a finite set of specifications, beginning with the specification that governs the scheme of the URI. One of the most important ways to interact with a resource is to request a representation of it. This is done by dereferencing the URI that identifies the resource.
For instance, suppose the URI
http://weather.yahoo.com/forecast/MXOA0069.html
identifies a
resource that is "the weather forecast for Oaxaca, Mexico". Dereferencing the
URI will result in a representation that may be encoded in any number of
formats, including HTML, XHTML, SVG, etc.; see chapter
2 for more information about formats.
The representations of a resource may vary as a function of factors including time, place, and the identity of the agent accessing the resource. For example, for the resource "the weather forecast for Oaxaca, Mexico," the representations depend on (at least) time, the expressed preference of the user for Fahrenheit or Celsius, and the identity of the user-agent software receiving the representation.
Consistent Representations: There is a strong expectation of consistency between the representations of a resource; to the extent possible, representations SHOULD be equivalent.
Note the difference between changes in representations of a resource and
changes in the binding between a URI and a resource. Today, the URI
http://www.w3.org/
designates the resource "the W3C home page."
A representation you get today by dereferencing that URI is likely to differ
from one you get tomorrow, since W3C updates its home page frequently with
news items. These changes in representation are predictable, and the resource
remains "the W3C home page".
On the other hand, if tomorrow, the same URI designated a different resource (for example, because the domain was sold and the new owner decided to assert a different URI-Resource relationship), the URI would lose value. This type of indiscriminate use of URIs undermines their value and interferes with people who relied on them (e.g., historians, court archives, new archive services, and anybody frustrated by a broken link).
A URI is a string of characters starting with a URI scheme. Some examples include:
http://www.w3.org/
ftp://ftp.w3.org/
irc://irc.openprojects.net/rdfig
urn:oasis:names:tc:SAML:1.0:assertion
tel:+1-913-555-1212
See [RFC2396] for more information about URI syntax. URIs do not include a "#", per RFC 2396. Only URI references do.
There are strong social expectations that once a URI identifies a particular resource, it should continue indefinitely to refer to that resource. Persistence is always a matter of policy and commitment on the part of authorities assigning URIs rather than a constraint imposed by technological means.
Persistent URIs: Those who create and manage resources and their identifiers SHOULD design the identifiers in such a way as to ensure their persistence.
For example, each W3C technical report (e.g., "the SVG specification") is in fact a series of documents that represent the maturation of the technical report (Working Drafts, Candidate Recommendations, Proposed Recommendations, and a Recommendation). W3C assigns a URI to the "latest version" in the specification series (e.g., http://www.w3.org/TR/SVG). W3C also assigns a URI for each specification in the series (called the "this version URI", as in http://www.w3.org/TR/2001/PR-SVG-20010719/). W3C policy is that representations of the "latest version" resource will change over time (with each new publication of an SVG specification). W3C policy is also that representations of a specification designed by a "this version" URI will not change over time (to the best of W3C's ability to maintain its archives intact).
A primary characteristic of a URI is its scheme, which is given by
colon-delimited prefix. For example the scheme of the URI
http://www.example.com/
is "http", and for
ftp://ftp.example.com/
it is "ftp". It is common to classify
URIs by scheme, calling the two preceding examples respectively an "HTTP URI"
and an "FTP URI".
Many of the properties of URIs are scheme-dependent.
New Schemes Expensive: Since correct processing of URIs is often scheme-dependent, and since a huge range of software is expected to be able to process URIs, the cost of introduction of new URI schemes is very high. The introduction of new URI schemes SHOULD be avoided.
The IANA registry [IANASchemes] lists URI schemes and the specifications that define them. For instance, the HTTP URI scheme is defined in section 3.2.2 of the HTTP specification [RFC2616]. Refer to RFC2717 for information about registering a new URI scheme.
Some URI schemes are used for identifying specific classes of resources. For example, TELNET URIs represent telnet services and MAILTO URIs electronic mailboxes.
(Open:issue httpRange-14 : What is the range of the HTTP dereference function?).
The procedure for dereferencing a URI may vary from scheme to scheme. For example, HTTP URIs are dereferencable using the protocol of the same name, and the scheme is actually defined in section 3.2.2 of the HTTP specification [RFC 2616].
On the other hand, the URN scheme [RFC 2141] does not guarantee that a dereference procedure is defined for any given URN.
@@Does it work here to substitute "dereference" for GET?@@
Allow Dereference: Agents SHOULD be able to dereference URIs for important resources.
Describe Resources: Dereferencing a URI for an important abstract concept (for example, Internet protocol parameters) SHOULD return human and/or machine readable representations that describe the nature and purpose of those resources.
Dereference is Safe: URI Dereferencing URIs is safe; i.e. agents do not incur obligations by following links. [TAG finding "URIs, Addressability, and the use of HTTP GET"]
Open: "Since HTTP GET is defined and widely deployed, agends SHOULD use HTTP URIs.
Open: Say something here à la what Tim Bray said: "Don't build a world of resources that cannot be identified by URI."?
The deployment and use of URI different URI schemes may require varying degrees of central co-ordination and administration. For example, HTTP URIs depend (in practice at least) on the use of the DNS infrastructure. Also, there is a central registry of URN subclasses.
Certain URI schemes provide syntactic rules for determining equivalence in URIs, and these rules vary from scheme to scheme.
For example, URNs begin with two colon-delimited fields, the first of
which must be urn
and the second identifies the subclass of URN,
for example urn:ietf:example
. In URNs, these two fields are to
be compared in a case-insensitive fashion. The remainder of the URN following
the second colon is subject to rules dependent on the content of the second
field (following the first colon) - thus the equivalence rules may vary
within subclasses of URNs.
Section 3.2.3 of the HTTP specification [RFC2616]
states that, when comparing two HTTP URIs, the host name part must be
considered case-insensitive, so http://WWW.EXAMPLE/
and
identify the same resource.http://www.example/
URI case sensitivity: People SHOULD NOT assume that two URIs that differ only in case can be used interchangeably.
Note: Equivalence of URIs is not the same as equivalence of representations of a resource.
Section 4 of RFC 2396 [RFC2396] introduces the term URI Reference to include absolute URIs and two other constructs also used for identification:
../main.html
. What a relative URI reference identifies
depends on the context where it is used. RFC 2396 defines the algorithm
for finding an absolute URI for a given relative URI reference; this
algorithm is not scheme-dependent.There are thus four classes of identifiers that comprise URI References:
SYSTEM
identifiers belong to
this class.Open:
Use of URI Reference: Authors of specifications MUST use the terms "URI" and "URI Reference" according to the definitions in RFC2396.
Open: What should we call things in list items 1-3 above?
In some URI schemes, URIs may end with a fragment identifier (to form a URI reference). Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation. For instance, if the representation is an HTML document, the fragment identifier designates a hypertext anchor. In the case of a graphics format, a URI reference might designate a circle or spline. In the case of RDF, a a URI reference can designate anything, be it abstract (e.g., a dream) or concrete (e.g., my car). The plain text media type does not define semantics for fragment identifiers.
Coneg Fragment: Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.
Open: New access protocols should provide a means to convert fragment identifiers according to media type.
No. While "myscheme:blort" is a URI that satisfies the syntactic constraints of [RFC2396], if "myscheme" is not registered, you don't have license to use that URI in any Internet protocols; there aren't any valid uses of it. You can't expect anybody to know what you mean by it, and you aren't guaranteed that somebody else isn't already using it for something else.
What is a format, and how does it relate to the concept of a document. Do all documents have a format? Is a document a collection of resources of different formats organised into a whole? Is a document the same as a resource? the same as a message body? as a non-multipart message body? What is the distinction between documents and data, if any. Does 'document' imply human readable and if so, does it imply presentation? Does it imply a hierarchically structured, report-like document with headings and subheadings? Is a catalog a document? Is a rave flyer a document?
Negotiation (stuff above might go here also) by network request, by listed alternatives in content any preference? Resource varuants, foo.css and foo.html unlikely to be equivalent.
Separation allows more easily composable specifications, allows multimodal acces, clarifies the concept of multiple, synchronous views of a document, and enhances accessibility.
Composability (ns-meaning). Use of XML for tree structured content. Linking in general v. idref in one document. Human readable v. machine data. Served or not (hidden behind server - semantic firewall, accessibility. Linking into parts of the model, transclusion of parts. Compound documents, components from multiple servers - scalability, deep linking. Processing models, error handling.
Presentation by decoration, and by derivation (creation of html/svg/etc as presentation). Linking between view and model. Inheritance of properties across namespaces. Consistency of property names. Subsets. 'Applies to' as opposed to 'set on'. Specificity of properties as attributes, chaining styling, restyling. Timelines, linking to portions of a timeline.
Declarative vs script based - accessibility, power; formalisation of common functionality (loop animation, rollovers) in declarative form. DOM - making additional methods, add to tather than replacing XML DOM. Effect of script/programming language limitations on choice of element and attribute names. Linking to active components - XForms example with model and abstract form control, can be extended to presentational instantiation of form control.
@@Ideas:@@
As mentioned in the introduction, the Web is designed to create the large-scale effect of a shared information space that scales well and behaves predictably. The architectural style known as Representational State Transfer [REST] encapsulates this notion of a shared information space. According to Fielding:
REST provides a set of architectural constraints that, when applied as a whole, emphasizes scalability of component interactions, generality of interfaces, independent deployment of components, and intermediary components to reduce interaction latency, enforce security, and encapsulate legacy systems.
-- Roy Fielding, Section 5.5 of [REST]
HTTP has been specially designed for REST interactions. HTTP has a variety of methods designed to manipulate resource state through represenation transfer between agents. These methods include GET (covered in section 1.2), POST, PUT, and DELETE.
This chapter uses the REST model to explain how Web protocols take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.
Relevant issues, findings:
Do not make assumptions about a resource based on the spelling of a URI that refers to it (other than what is defined in specifications for the URI scheme). Since URIs are opaque, it is an error to assume, for example, that a URI that happens to end with the string ".html" refers to a resource that has an HTML representation. Though people must not infer anything about the nature of a resource representation from a URI ending in ".html", resource owners must not create confusion by purposely misassigning suffixes and representation types.
At times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use. But in general, just as "children should be seen but not heard", URIs should be used but not seen. In general, URIs should be hidden from view since they are ugly to look at and they tend to lure us into thinking they hold definitive meaning about a resource.
Open: Canonical form of URIs. Seeissue URIEquivalence-15.
Authors should not use a URI to identify more than one resource.
Nothing prevents us from considering "a representation of the novel Moby Dick" to be a resource itself (and thus to have an assigned URI). Authors should not use the same URI to refer to the resource "Moby Dick" and to the particular representation of that resource. Similarly, authors should not use the same URI to refer to a person and to that person's mailbox.