Copyright © 2002 W3C ® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use, and software licensing rules apply.
The World Wide Web is a networked information system. Web Architecture is the set of principles that all agents in the system follow to create the large-scale effect of a shared information space. Identification, data formats, and protocols are the main technical components of Web Architecture, but the large-scale effect depends on social behavior as well.
This document strives to establish a reference set of principles for Web architecture.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
This draft incorporates comments on the first public Working Draft of "Architectural Principles of the World Wide Web." This document has been developed by W3C's Technical Architecture Group (TAG) (charter). A list of changes in this document is available.
This draft remains incomplete; sections 1 and 2 are the most developed, 3 and 4 the least. The TAG has published a number of findings that address specific architecture issues. Parts of those findings may appear in subsequent drafts. Please also consult the list of issues under consideration by the TAG.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than "work in progress."
The latest information regarding patent disclosures related to this document is available on the Web. As of this publication, there are no disclosures.
Please send comments on this document to the public W3C TAG mailing list www-tag@w3.org (archive).
A list of current W3C Recommendations and other technical documents can be found at the W3C Web site.
The World Wide Web (or, Web) is a networked information system consisting of agents (programs acting on behalf of another person, entity, or process) that exchange information.
This architecture consists of:
After this introduction, sections two, three, and four discuss identifiers, formats, and protocols, respectively. Each section highlights principles of Web architecture and notes on good practice. These principles and good practice notes are summarized at the end of the introduction.
The terms MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used in accordance with RFC 2119 [RFC2119].
This draft includes some editorial notes and also references to open TAG issues. These do not represent all open issues in the document. They are expected to disappear from future drafts.
@@Explain here how requirements and principles lead to constraints.@@
The principles in this document are based on experience. There has been some theoretical and modeling work in the area of Web Architecture, notably Roy Fielding's work on "Representational State Transfer" [REST].
The intended audience for this document includes:
The authors have made every effort to keep this document terse, with the expectation that additional documents will elaborate on the principles.
This document focuses on architectural principles specific to or fundamental to the Web. It does not address general principles of design, which are also important to the success of the Web. Indeed, behind many of the principles of Web Architecture lie these and other principles such as minimal constraint (fewer rules makes the system more flexible), modularity, minimum redundancy, extensibility, simplicity, and robustness.
Other groups within W3C are addressing architectural design goals in the following areas:
For information about architectural principles of the Internet, refer to [RFC1958].
In the design of the Web, some design decisions, like the names
the <p> and <li> elements in HTML, or the choice of the
colon character in URIs, are somewhat arbitrary; if <par>,
<elt>, or *
had been chosen instead, the
large-scale result would, most likely, have been the same. Other
design choices are more fundamental; these are the architectural
principles of the Web:
Some of these principles may conflict with current practice, and so education and outreach will be required to improve on that practice. Other principles may fill in gaps in published specifications or may call attention to known weaknesses in those specifications.
This document suggests the following good practice:
The Web is a universe of resources. A resource is defined by [RFC2396] to be anything that has identity. Examples include documents, files, menu items, machines, and services, as well as people, organizations, and concepts. Web architecture starts with a uniform syntax for resource identifiers, so that we can refer to resources, access them, describe them, and share them. The Uniform Resource Identifier (URI) syntax employs an extensible set of URI schemes. Several URI schemes incorporate into this syntax some identification mechanisms that pre-date the Web:
mailto:nobody@example.org
ftp://example.org/aDirectory/aFile
news:comp.infosystems.www
tel:+1-816-555-1212
urn:uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882
Other URI schemes have been introduced since the advent of the Web, including those introduced as a consequence of new protocols. Examples of URIs for these schemes include:
http://www.example.org/something?with=arg1;and=arg2
ldap://ldap.itd.umich.edu/c=GB?objectClass?one
urn:oasis:SAML:1.0
One can append a fragment identifier to a URI to yield an identifier for part of, or a view of, a resource2:
ftp://example.org/aDirectory/aDocument#section1
http://www.example.org/aList#item1
http://www.example.org/states#texas
Note that while this composition is syntactically fully general,
it is meaningless in some URI schemes. The URI
mailto:nobody@example.org#abc
is meaningless in
practice.
A generic syntax for URIs is defined by [RFC2396]. The current document uses the term "URI" to mean, in RFC2396 terms, an absolute URI reference3 optionally followed by a fragment identifier. The TAG is working actively to convince the IETF to revise RFC2396 so that the definition of "URI" aligns with the current document.
When one resource refers to another via a URI, a link is formed. When many resources are linked this way, the large-scale effect is a shared information space, addressable by URI. The value of the Web increases with the number of resources addressable by URI. In turn, resources are more valuable when they are addressable in the Web. Hence:
Use URIs: All important resources SHOULD be identified by a URI.4
There are many benefits to making resources addressable by URI. Some are by design (e.g., linking and bookmarking), while others have arisen naturally (e.g., global search services). See the TAG finding URIs, Addressability, and the use of HTTP GET for some details about the interaction of this principle in HTTP application design.
The two primary operations on URIs are:
There may be applications (e.g., XML namespace names [XMLNS]) where comparison is expected to be the sole or primary operation on a URI. Certain URI schemes provide rules for determining the syntactic equivalence of URIs, i.e., whether two URIs are different spellings of the same identifier. These rules vary from scheme to scheme.
For example, URNs begin with two colon-delimited fields, the
first of which is the string urn
and the second
identifies the subclass of URN, for example
urn:ietf:example
. In URNs, these two fields are to be
compared in a case-insensitive fashion. The remainder of the URN
following the second colon is subject to rules dependent on the
content of the second field (following the first colon) - thus the
equivalence rules may vary within URN namespace identifiers.
Section 3.2.3 of the HTTP specification [RFC2616] states that, when comparing two HTTP
URIs, the host name part must be considered case-insensitive, so
http://WWW.EXAMPLE/
and
http://www.example/
identify the same resource.
Good practice note. URI case: It SHOULD NOT be assumed that URIs which differ only in character case can be used interchangeably.
Note: Equivalence of URIs is not the same as consistent representations of a resource.
Issue: URIEquivalence-15: When are two URI variants considered equivalent? See also issue IRIEverywhere-27 - Should W3C specifications start promoting IRIs?
To dereference a URI is to interact with the resource it identifies. One interacts with a resource by the exchange of representations of resource state; a representation is a data object that represents or describes a resource state. A resource is an abstraction for which there is a conceptual mapping to a (possibly empty) set of representations. Representations, when transferred by a Web protocol, are often accompanied by metadata in the message (for example, HTTP headers). In particular, the value of the media type metadata value is key to the correct interpretation of a resource representation, and governs the handling of fragment identifiers.
For instance, suppose the URI
http://weather.yahoo.com/forecast/MXOA0069
identifies
a resource that is "the weather forecast for Oaxaca, Mexico". A
representation retrieved by means of that URI may be encoded in any
number of formats, including HTML, XHTML, and SVG; see section 2 for more information
about formats.
Interaction with a resource is governed by successive
application of a finite set of specifications, beginning with the
specification that governs the scheme of the URI. For example, suppose the
URI for the weather forecast is used within an a
element of an SVG document. The sequence of specifications applied
is:
a
link involves retrieving a representation a
resource, identified by the XLink href
attribute: "By
activating these links (by clicking with the mouse, through
keyboard input, and voice commands), users may visit these
resources." This means that the GET method defined in HTTP/1.1 is
used to retrieve the representation of the resource.It is important for the correct functioning of the Web that the mapping between URIs and resources be unambiguous.
URI ambiguity: Ambiguity in the relationship between URIs and resources is harmful for humans and machines.
The representations of a resource may vary as a function of factors including time, the identity of the agent accessing the resource, data submitted to the resource when interacting with it, and changes external to the resource. For example, for the resource "the weather forecast for Oaxaca, Mexico," the representations depend on (at least) time, the expressed preference of the user for Fahrenheit or Celsius, the identity of the user-agent software receiving the representation, and, presumably, the weather in Oaxaca.
Since one interacts with a resource through its representations, ambiguity can manifest itself in a number of ways:
This does not mean that representations of a resource cannot change over time, only that indiscriminate reuse of identifiers undermines their value and interferes with people who relied on them.
There are strong social expectations that once a URI identifies a particular resource, it should continue indefinitely to refer to that resource; this is called the persistence of the URI. Persistence is always a matter of policy and commitment on the part of authorities assigning URIs rather than a constraint imposed by technological means.
For example, each W3C technical report (e.g., "the SVG
specification") is in fact a series of documents that mature over
time (from Working Drafts, Candidate Recommendations, Proposed
Recommendations, to Recommendation). W3C assigns a URI to the
"latest version" in the series (e.g.,
http://www.w3.org/TR/SVG
). W3C also assigns a URI for
each specification in the series (called the "this version URI",
e.g., http://www.w3.org/TR/2001/PR-SVG-20010719/
). W3C
policy is that representations of the "latest version" resource
will change over time (with each new publication of an SVG
specification). W3C policy is also that representations of a
specification designated by a "this version" identifier will not
change over time, to the best of W3C's ability to maintain its
archives intact.
HTTP [RFC2616] has been designed to help site managers maintain the relationship between a URI and a resource (e.g., through redirects).
For more discussion about persistence, refer to [Cool].5
Depending on the protocol used, there may be several ways to dereference a URI. One of the most important operations for the Web is to retrieve a representation of a resource (such as with HTTP GET), which means to retrieve a representation of the state of the resource. There are other ways to interact with a resource (such as with HTTP POST). Dereference mechanisms vary by URI scheme. For instance, the URN scheme [RFC 2141] does not guarantee that a dereference procedure is defined for any given URN.
Resource descriptions: Owners of important resources (for example, Internet protocol parameters) SHOULD make available representations that describe the nature and purpose of those resources.
Issue: namespaceDocument-8: What should a "namespace document" look like?
Safe retrieval: Agents do not incur obligations by retrieving a representation.
For instance, a user does not incur an obligation by following an HTML link that causes the user agent to retrieve a representation. Tools such as proxies and search engines can retrieve representations without user interaction; it would be harmful to the Web if such operations incurred obligations. See the TAG finding "URIs, Addressability, and the use of HTTP GET" for more information about safe retrieval.
Issue: deepLinking-25: What to say in defense of principle that deep linking is not an illegal act?
Editor's note: Need to say something about difference between assertions about a resource and assertions about a representation. E.g., do not use the same URI to refer to the resource "Moby Dick" and to the particular representation of that resource, or do not use the same URI to refer to a person and to that person's mailbox. See issue httpRange-14.
One
important characteristic of a URI is its
scheme (the string that precedes the first colon in a
URI). For example the scheme of the URI
http://www.example.com/
is "http", and for
ftp://ftp.example.com/
it is "ftp". It is common to
classify URIs by scheme, calling the two preceding examples
respectively an "HTTP URI" and an "FTP URI".
Since many aspects of URI processing are scheme-dependent, and since a huge range of software is expected to be able to process URIs, the cost of introduction of new URI schemes is very high.
New URI schemes: Authors of specifications SHOULD avoid introducing new URI schemes when existing schemes can be used to meet the goals of the specifications.
While "myscheme:blort" is a URI that satisfies the syntactic constraints of [RFC2396], if "myscheme" is not registered, you do not have license to use that URI in any Internet protocols; there are no valid uses of it. You can't expect anybody to know what you mean by it, and you are not guaranteed that somebody else isn't already using it for something else.
Unregistered URI schemes: Unregistered URI schemes MUST NOT be used on the public Internet for general purpose applications.
The IANA registry [IANASchemes] lists registered URI schemes and the specifications that define them. For instance, the IANA registry indicates that the "http" scheme is defined by [RFC2616]. Refer to RFC2717 for information about registering a new URI scheme.
The deployment and use of different URI schemes may require varying degrees of central coordination and administration. For example, MAILTO, FTP, and HTTP URIs depend (in practice at least) on the use of the DNS infrastructure. Also, there is a central registry of URN subclasses6.
In some URI schemes it is meaningful for a URI to end with a fragment identifier. The fragment identifier is interpreted only after the retrieval of a representation. Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation.
For instance, if the representation is an HTML document, the fragment identifies a hypertext anchor. In the case of a graphics format, the fragment might identify a circle or spline. In the Resource Description Framework [RDF10], fragments can be used to identify anything, be it abstract (e.g., a dream) or concrete (e.g., an automobile).
Good practice note. Coneg with fragments: Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.
Editor's note: There has been some discussion but no agreement that new access protocols should provide a means to convert fragment identifiers according to media type.
The following generalities about URIs are included to answer some frequently asked questions about URIs. Some of these generalities do not hold for all URI schemes.
http://www.example.com/lj45sr
and know that it refers
to "my old car" or "the weather forecast for Oaxaca."
Over time, we trust that some URIs will identify familiar resources, but that trust derives from social behavior, not the spelling of the identifier.
Data on the Web manifests itself through resource representations. A resource representation consists of:
A format specification describes the structure of the bit sequence.
Refer to other W3C format guidelines: Charmod, XAG, etc.
What is a format, and how does it relate to the concept of a document. Do all documents have a format? Is a document a collection of resources of different formats organized into a whole? Is a document the same as a resource? the same as a message body? as a non-multipart message body? What is the distinction between documents and data, if any. Does 'document' imply human readable and if so, does it imply presentation? Does it imply a hierarchically structured, report-like document with headings and subheadings? Is a catalog a document? Is a rave flyer a document?
Negotiation (stuff above might go here also) by network request, by listed alternatives in content any preference? Resource variants, foo.css and foo.html unlikely to be equivalent.
On the interpretation and processing of formats (see namespaceDocument-8 and mixedNamespaceMeaning-13):
@@Incomplete sections on specification design.@@
On using XML:
When designing specifications that address independent functions of a system, avoidable references between the specifications are in general harmful. They are harmful because they impede the independent evolution of the specifications.
For example, it is a strength of XML that XPath cannot query the
HTTP header. It is a strength of HTTP that it does not refer to
details of the underlying TCP do the extent that it cannot be run
over a different transport service. Similarly, the RDF data graph
has a significance that is independent of the actual serialization.
However, there is a flaw: the embedded XML
parsetype="Literal"
data type.
Sometimes it is necessary (and good for given application) to break layers. For example, it is good for an HTTP client to be aware of TCP speeds and round trip times to different mirror servers in order to optimize the choice of server. When designing specification, identify the functionalities that break layers so it is clear when they are being used.
This section attempts to organize some areas of future discussion. Separating the concepts content, presentation, and interaction allows more easily composable specifications. For example, a markup language can be specified independently of a style sheet language. The separation facilitates alternate presentations of the same content, which is seen to have an accessibility advantage and to be more suited to the multiple modalities of Web access.
Issue: contentPresentation-26: Separation of semantic and presentational markup, to the extent possible, is architecturally sound.
Composability (ns-meaning). Use of XML for tree structured content. Linking in general v. idref in one document. Human readable v. machine data. Served or not (hidden behind server - semantic firewall, accessibility. Linking into parts of the content, transclusion of parts. Compound documents, components from multiple servers - scalability, deep linking. Processing models, error handling.
Presentation by decoration (application of CSS to XML as presentation), and by derivation (creation of html/svg/etc as presentation). Linking (bidirectionally) between content and presentations. Inheritance of properties across namespaces. Consistency of property names. Subsets. 'Applies to' as opposed to 'set on'. Specificity of properties as attributes, chaining styling, restyling. Time-lines, linking to portions of a time-line.
Animation, scripting, events, client/server interaction. Declarative v. script based - accessibility, power; formalization of common functionality (loop animation, rollovers) in declarative form. DOM - making additional methods, add to rather than replacing XML DOM. Effect of script/programming language limitations on choice of element and attribute names. Linking to active components - XForms example with model and abstract form control, can be extended to presentational instantiation of form control.
As mentioned in the introduction, the Web is designed to create the large-scale effect of a shared information space that scales well and behaves predictably.
http://example/dir1/dir2/file1
, the relative URI
reference ../file2
is a shortened form of
http://example/dir1/file2
and the relative URI
reference #abc
is a shortened form for
http://example/dir1/dir2/file1#abc
. (Note 3 context.)The authors of this document are the participants of W3C's Technical Architecture Group: Tim Berners-Lee (Chair, W3C), Tim Bray (Antarctica Systems), Dan Connolly (W3C), Paul Cotton (Microsoft), Roy Fielding (Day Software), Chris Lilley (W3C), David Orchard (BEA Systems), Norman Walsh (Sun), and Stuart Williams (Hewlett-Packard).
The TAG thanks people for their thoughtful contributions on the TAG's public mailing list, www-tag (archive).