Copyright © 2002 W3C ® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use, and software licensing rules apply.
The World Wide Web is a networked information system. Web Architecture is the set of principles that all agents in the system follow to create the large-scale effect of a shared information space. Identification, data formats, and protocols are the main technical components of Web Architecture, but the large-scale effect depends on social behavior as well.
This document strives to establish a reference set of principles for Web architecture.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
This is (almost) the first public Working Draft of "Architectural Principles of the World Wide Web." This document has been developed by W3C's Technical Architecture Group (TAG) (charter).
This draft represents substantial input from TAG participants, but does not yet represent consensus. It is also incomplete; sections 1 and 2 are the most developed, 3 and 4 the least. The TAG has published a number of findings that address specific architecture issues. Parts of those findings may appear in subsequent drafts. Please also consult the list of issues under consideration by the TAG.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than "work in progress."
The latest information regarding patent disclosures related to this document is available on the Web. As of this publication, there are no disclosures.
Please send comments on this document to the public W3C TAG mailing list www-tag@w3.org (archive).
A list of current W3C Recommendations and other technical documents can be found at the W3C Web site.
The World Wide Web (or, Web) is a networked information system consisting of agents (programs acting on behalf of another person, entity, or process) that exchange information.
This architecture consists of:
After this introduction, sections two, three, and four discuss identifiers, formats, and protocols, respectively. Each of those sections includes principles of Web architecture. Each principle has a title and is highlighted. The last section of the introduction lists all of the principles.
The terms MUST, SHOULD, MAY, etc. are used in accordance with RFC 2119 [RFC2119].
Some issues and editorial notes are indicated.
The intended audience for this document includes:
The authors have made every effort to keep this document terse, with the expectation that additional documents will elaborate on the principles.
This document focuses on architectural principles specific to or fundamental to the Web. It does not address general principles of design, which are also important to the success of the Web. Indeed, behind many of the principles of Web Architecture lie these and other principles: minimal constraint (fewer rules makes the system more flexible), modularity, minimum redundancy, extensibility, simplicity, robustness, etc.
This document does not address architectural design goals covered by targeted W3C specifications:
Some of these principles may conflict with current practice, and so education and outreach will be required to improve on that practice. Other principles may fill in gaps in published specifications or may call attention to known weaknesses in those specifications.
This document establishes the following principles:
The Web is a universe of resources. Resources are a generalization over documents, files, menu items, machines, and services, as well as people, organizations, concepts, etc. Web architecture starts with a uniform syntax for resource identifiers, so that we can refer to resources, access them, describe them, share them, etc. The Uniform Resource Identifier (URI) syntax employs an extensible set of URI schemes. Several URI schemes incorporate into this syntax some identification mechanisms that pre-date the Web:
mailto:nobody@example.org
. The MAILTO scheme is
for mailbox names.ftp://example.org/aDirectory/aFile
. The FTP scheme
is for ftp file and directory names.news:comp.infosystems.www
. The NEWS scheme is for
newsgroup names.
tel:+1-816-555-1212
. The TEL scheme is for
telephone numbers.urn:uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882
. The
URN UUID scheme is for Universal Unique Identifiers.Other URI schemes have been introduced since the advent of the Web, including those introduced as a consequence of new protocols:
http://www.example.org/something?with=arg1;and=arg2
.ldap://ldap.itd.umich.edu/c=GB?objectClass?one
.urn:oasis:SAML:1.0
.One can append a fragment identifier to a URI to yield an identifier for part of, or a view of, a resource:
ftp://example.org/aDirectory/aDocument#section1
http://www.example.org/aList#item1
http://www.example.org/states#texas
Note that while this composition is syntactically fully general,
it is meaningless in some URI schemes. The absolute URI reference
mailto:nobody@example.org#abc
is meaningless in
practice.
To summarize, a Uniform Resource Identifier, or URI, is a character sequence starting with a scheme name, followed by a number of scheme-specific fields. An absolute URI reference is a URI followed optionally by a fragment identifier (see [RFC2396] for the complete list of syntactic constraints). URIs and absolute URI references identify Web resources. The principles in this document are expressed in terms of absolute URI references.
Note: The current URI specification, RFC 2396 [RFC2396], defines a URI reference to be either
an absolute URI reference or a relative URI reference. The syntax
for a relative URI reference is a shortened form of that for an
absolute URI reference, where some prefix of the URI is missing and
certain path components ("." and "..") have a special meaning when,
and only when, interpreting a relative path. For example, in a
document whose base URI is
http://example/dir1/dir2/file1
, the relative URI
reference ../file2
is a shortened form of
http://example/dir1/file2
and the relative URI
reference #abc
is a shortened form for
http://example/dir1/dir2/file1#abc
.
Editor's note: While people agree that URIs identify resources (per RFC 2396 [RFC2396]), there is not yet consensus that absolute URI references with fragment identifies may be used to identify resources. Some people contend that an absolute URI reference with a fragment identifier identifies a portion of a representation.
When one resource refers to another via an absolute URI reference, a link is formed. When many resources are linked this way, the large-scale effect is a shared information space, addressable by absolute URI reference. The value of the Web increases with the number of resources addressable by absolute URI reference. In turn, resources are more valuable when they are addressable in the Web. Hence:
Use URIs: All important1 resources SHOULD be identified by an absolute URI reference.2
There are many benefits to making resources addressable by absolute URI reference. Some are by design (e.g., linking and bookmarking), while others are serendipitous (e.g., global search services). See the TAG finding URIs, Addressability, and the use of HTTP GET for some details about the interaction of this principle in HTTP application design.
The two primary operations on absolute URI references are:
There may be applications (e.g., of XML namespace names [XMLNS]) where comparison is expected to be the sole or primary operation on an absolute URI reference. In such cases, it does not matter whether one has chosen a URI or an absolute URI reference to identify a resource.
When one expects to interact with a resource, there are some advantages to identifying that resource with a URI rather than an absolute URI reference: only URIs work with intermediaries in the Web architecture (e.g., proxies) or with redirection (in HTTP, for example).
Note: When an absolute URI reference with a fragment identifier is used to refer to a resource, one may refer to a portion of that resource with a different absolute URI reference. There are no special conventions for choosing such absolute URI references.
Certain URI schemes provide rules for determining the syntactic equivalence of absolute URI references, i.e., whether two absolute URI references are different spellings of the same identifier. These rules vary from scheme to scheme.
For example, URNs begin with two colon-delimited fields, the
first of which must be urn
and the second identifies
the subclass of URN, for example urn:ietf:example
. In
URNs, these two fields are to be compared in a case-insensitive
fashion. The remainder of the URN following the second colon is
subject to rules dependent on the content of the second field
(following the first colon) - thus the equivalence rules may vary
within subclasses of URNs.
Section 3.2.3 of the HTTP specification [RFC2616] states that, when comparing two HTTP
URIs, the host name part must be considered case-insensitive, so
http://WWW.EXAMPLE/
and
http://www.example/
identify the same resource.
Do not rely on URI case insensitivity: People SHOULD NOT assume that two URIs that differ only in case can be used interchangeably.
Note: Equivalence of URIs is not the same as consistent representations of a resource.
Issue: URIEquivalence-15: When are two URI variants considered equivalent?
To dereference an absolute URI reference is to interact with the resource it identifies. One interacts with a resource by the exchange of representations of the resource. A resource is an abstraction for which there is a conceptual mapping to a (possibly empty) set of representations. The interpretation of any such representation is determined by its Media type (and possibly metadata that accompanies the representation when transferred by a Web protocol). See the section on formats for more information about constructing a representation.
For instance, suppose the URI
http://weather.yahoo.com/forecast/MXOA0069
identifies
a resource that is "the weather forecast for Oaxaca, Mexico". A
representation retrieved by means of that URI may be encoded in any
number of formats, including HTML, XHTML, SVG, etc.; see section 2 for more information
about formats.
Interaction with a resource is governed by successive
application of a finite set of specifications, beginning with the
specification that governs the scheme of the URI. For example, suppose the
absolute URI reference for the weather forecast is used within an
a
element of an SVG document. The sequence of
specifications applied is:
a
link involves retrieving a representation a
resource, identified by the XLink href
attribute: "By
activating these links (by clicking with the mouse, through
keyboard input, voice commands, etc.), users may visit these
resources." This means that the GET method defined in HTTP/1.1 is
used to retrieve the representation of the resource.Absolute URI references are unambiguous: Each absolute URI reference unambiguously identifies one resource. If you use a registered URI scheme and follow all the other relevant protocol specifications, the identity of that resource is unambiguous.
There may be several ways to interact with a resource. One of the most important operations for the Web is to retrieve a representation of a resource (such as with HTTP GET), which means to retrieve an electronic snapshot of a state of a resource. There are other ways to interact with a resource (such as with HTTP POST).
Allow dereference: Agents SHOULD be able to dereference absolute URI references for important resources.
Describe resources: Owners of important resources (for example, Internet protocol parameters) SHOULD make available representations that describe the nature and purpose of those resources.
Issue: namespaceDocument-8: What should a "namespace document" look like?
Representation retrieval is safe: Agents do not incur obligations by retrieving a representation (e.g., by following a link).
Note: See the TAG finding "URIs, Addressability, and the use of HTTP GET" for more information about safe retrieval.
Issue: deepLinking-25: What to say in defense of principle that deep linking is not an illegal act?
Editor's note: Need to say something about difference between assertions about a resource and assertions about a representation. E.g., do not use the same URI to refer to the resource "Moby Dick" and to the particular representation of that resource, or do not use the same URI to refer to a person and to that person's mailbox.
Each absolute URI reference unambiguously identifies one
resource, but the resource itself may be defined in a
context-sensitive manner. For resources of this type, the result of
a dereference operation may vary by context. Thus,
http://example.org/nearest/gas/
may unambiguously
identify "the nearest gas station", but the result of a retrieval
operation may vary (e.g., it may change with the geographical
position of the retrieving agent). Similarly,
http://localhost/
and file:/etc/hosts
each identify one resource, but that resource is local to a
particular computer, so dereference results will vary.
Context-sensitive absolute URI references can be useful (e.g.,
when one needs to find a gas station or talk about host names in
Unix environments). However, on the public Internet, an identifier
such as file:/etc/hosts
is a poor choice for the
generic resource "host information" because, in many contexts
(i.e., most non-Unix operating systems), host information is not
maintained in a file named /etc/hosts
.
Avoid context-sensitive absolute URI references: An absolute URI reference SHOULD denote the same resource or concept independent of the context(s) in which the identifier is used.
The representations of a resource may vary as a function of factors including time, the identity of the agent accessing the resource, data submitted to the resource when interacting with it, and changes external to the resource. For example, for the resource "the weather forecast for Oaxaca, Mexico," the representations depend on (at least) time, the expressed preference of the user for Fahrenheit or Celsius, the identity of the user-agent software receiving the representation, and, presumably, the weather in Oaxaca.
Use consistent representations: There is a strong expectation of consistency between the representations of a resource; to the extent possible, representations SHOULD be equivalent.
Editor's note: Need to clarify what "equivalent" means in the previous sentence.
There is a difference between changes in representations of a
resource and changes in the binding between an absolute URI
reference and a resource. The absolute URI reference
http://www.w3.org/
identifies the resource "the W3C
home page." A representation retrieved today for that absolute URI
reference is likely to differ from one you get tomorrow, since W3C
updates its home page frequently with news items. Though the news
changes, the resource remains "the W3C home page".
On the other hand, if tomorrow, the same absolute URI reference identified a different resource (for example, because the domain was sold and the new owner decided to assert a different URI-Resource relationship), the identifier would lose value. This type of indiscriminate reuse of identifiers undermines their value and interferes with people who relied on them.
There are strong social expectations that once an absolute URI reference identifies a particular resource, it should continue indefinitely to refer to that resource; this is called the persistence of the absolute URI reference. Persistence is always a matter of policy and commitment on the part of authorities assigning URIs rather than a constraint imposed by technological means.
Support persistence: Those who create and manage resources and their identifiers SHOULD design the identifiers in such a way as to ensure their persistence.
For example, each W3C technical report (e.g., "the SVG
specification") is in fact a series of documents that mature over
time (from Working Drafts, Candidate Recommendations, Proposed
Recommendations, to Recommendation). W3C assigns an absolute URI
reference to the "latest version" in the series (e.g.,
http://www.w3.org/TR/SVG
). W3C also assigns an
absolute URI reference for each specification in the series (called
the "this version URI", e.g.,
http://www.w3.org/TR/2001/PR-SVG-20010719/
). W3C
policy is that representations of the "latest version" resource
will change over time (with each new publication of an SVG
specification). W3C policy is also that representations of a
specification designated by a "this version" identifier will not
change over time, to the best of W3C's ability to maintain its
archives intact.
For more discussion about persistence, refer to [Cool].3
One
important characteristic of a URI is its
scheme (the string that precedes the first colon in a
URI). For example the scheme of the URI
http://www.example.com/
is "http", and for
ftp://ftp.example.com/
it is "ftp". It is common to
classify URIs by scheme, calling the two preceding examples
respectively an "HTTP URI" and an "FTP URI".
Correct processing of URIs is often scheme-dependent, and since a huge range of software is expected to be able to process URIs, the cost of introduction of new URI schemes is very high.
Avoid unnecessary new URI schemes: Authors of specifications SHOULD avoid introducing new URI schemes when existing schemes can be used to meet the same goals.
While "myscheme:blort" is a URI that satisfies the syntactic constraints of [RFC2396], if "myscheme" is not registered, you don't have license to use that URI in any Internet protocols; there aren't any valid uses of it. You can't expect anybody to know what you mean by it, and you aren't guaranteed that somebody else isn't already using it for something else.
Do not use unregistered URI schemes: People MUST NOT use an unregistered URI scheme on the public Internet.
The IANA registry [IANASchemes] lists URI schemes and the specifications that define them. For instance, the HTTP URI scheme is defined in section 3.2.2 of the HTTP specification [RFC2616]. Refer to RFC2717 for information about registering a new URI scheme.
The deployment and use of different URI schemes may require varying degrees of central coordination and administration. For example, MAILTO, FTP, and HTTP URIs depend (in practice at least) on the use of the DNS infrastructure. Also, there is a central registry of URN subclasses.
The following sections discuss some properties of URIs that are scheme-dependent.
Some URI schemes are used to identify specific classes of resources. For example, TELNET URIs identify telnet services and MAILTO URIs electronic mailboxes.
Issue: httpRange-14 : What is the range of HTTP URIs? Two views held within the TAG are that the range is (1) anything or (2) "documents," used in a very broad sense.
The procedure for retrieving a representation may vary from scheme to scheme. For example, HTTP URIs are dereferencable using the protocol of the same name, and the dereferencing procedure is defined in section 3.2.2 of the HTTP specification [RFC2616].
On the other hand, the URN scheme [RFC 2141] does not guarantee that a dereference procedure is defined for any given URN.
Editor's note: There has been discussion but no agreement to the proposed principle "Since HTTP GET is defined and widely deployed, agents SHOULD use HTTP URIs."
In some URI schemes it is meaningful for an absolute URI reference to end with a fragment identifier. The fragment identifier is interpreted only after the retrieval of a representation. Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation.
For instance, if the representation is an HTML document, the fragment identifies a hypertext anchor. In the case of a graphics format, the fragment might identify a circle or spline. In the case of RDF, the fragment can identify anything, be it abstract (e.g., a dream) or concrete (e.g., my car).
Be aware of content negotiation and fragment semantics: Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.
Editor's note: There has been some discussion but no agreement that new access protocols should provide a means to convert fragment identifiers according to media type.
The following generalities about absolute URI references are included to answer some frequently asked questions about URIs. Some of these generalities do not hold for all URI schemes.
http://www.example.com/lj45sr
and know
that it refers to "my old car" or "the weather forecast for
Oaxaca."
Over time, we trust that some absolute URI references will identify familiar resources, but that trust derives from social behavior, not the spelling of the identifier.
What is a format, and how does it relate to the concept of a document. Do all documents have a format? Is a document a collection of resources of different formats organized into a whole? Is a document the same as a resource? the same as a message body? as a non-multipart message body? What is the distinction between documents and data, if any. Does 'document' imply human readable and if so, does it imply presentation? Does it imply a hierarchically structured, report-like document with headings and subheadings? Is a catalog a document? Is a rave flyer a document?
Negotiation (stuff above might go here also) by network request, by listed alternatives in content any preference? Resource variants, foo.css and foo.html unlikely to be equivalent.
Separation allows more easily composable specifications, allows multimodal access, clarifies the concept of multiple, synchronous views of a document, and enhances accessibility.
Composability (ns-meaning). Use of XML for tree structured content. Linking in general v. idref in one document. Human readable v. machine data. Served or not (hidden behind server - semantic firewall, accessibility. Linking into parts of the model, transclusion of parts. Compound documents, components from multiple servers - scalability, deep linking. Processing models, error handling.
Presentation by decoration (application of CSS to XML as presentation), and by derivation (creation of html/svg/etc as presentation). Linking between view and model. Inheritance of properties across namespaces. Consistency of property names. Subsets. 'Applies to' as opposed to 'set on'. Specificity of properties as attributes, chaining styling, restyling. Time-lines, linking to portions of a time-line.
Declarative v. script based - accessibility, power; formalization of common functionality (loop animation, rollovers) in declarative form. DOM - making additional methods, add to rather than replacing XML DOM. Effect of script/programming language limitations on choice of element and attribute names. Linking to active components - XForms example with model and abstract form control, can be extended to presentational instantiation of form control.
As mentioned in the introduction, the Web is designed to create the large-scale effect of a shared information space that scales well and behaves predictably. The architectural style known as Representational State Transfer [REST] encapsulates this notion of a shared information space. According to Fielding:
REST provides a set of architectural constraints that, when applied as a whole, emphasizes scalability of component interactions, generality of interfaces, independent deployment of components, and intermediary components to reduce interaction latency, enforce security, and encapsulate legacy systems.
-- Roy Fielding, Section 5.5 of [REST]
HTTP has been specially designed for REST interactions. HTTP has a variety of dereference methods, including GET, POST, PUT, and DELETE.
The following sections use the REST model to explain how Web protocols take into account the properties of resources and URIs, as well as real-world time and space constraints, in order to improve the user's Web experience.
The REST constraints are:
REST focuses on the roles of components, the constraints upon their interaction with other components, and their interpretation of significant data elements. REST ignores the details of component implementation and protocol syntax. REST components communicate by transferring a representation of a resource, selected dynamically based on the capabilities or desires of the recipient and the nature of the resource. Whether the representation is in the same format as the raw source, or is derived from the source, remains hidden behind the interface.
Typical hypertext systems support one of three possible styles of data representation:
The Web provides a hybrid of all three options by focusing on a shared understanding of data types with metadata, but limiting the scope of what is revealed to a standardized interface.
Web components perform various roles in interactions. User agents, gateways, proxies, and origin servers are the main roles that a component can act in. A component may act in different roles depending upon the interaction.
The authors of this document are the participants of W3C's Technical Architecture Group: Tim Berners-Lee (Chair, W3C), Tim Bray (Antarti.ca), Dan Connolly (W3C), Paul Cotton (Microsoft), Roy Fielding (Day Software), Chris Lilley (W3C), David Orchard (BEA Systems), Norman Walsh (Sun), and Stuart Williams (Hewlett-Packard).
The TAG thanks people for their thoughtful contributions on the TAG's public mailing list, www-tag (archive).