Architecture of the World Wide Web

Editor's Draft 21 February 2003

This version:: http://www.w3.org/2001/tag/2003/webarch-20030221
Latest editor's draft:: http://www.w3.org/2001/tag/webarch/
Previous version:: http://www.w3.org/2001/tag/2002/webarch-20030206
Latest version:: http://www.w3.org/TR/webarch/

Editor:: Ian Jacobs, W3C
Authors:: See acknowledgments.

Copyright © 2002-2003 W3C ^® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.

Abstract

The World Wide Web is a networked information system. Web Architecture consists of the requirements, constraints, principles, and design choices that influence the design of the system and the behavior of agents within the system. When followed, the large-scale effect is that of a shared information space. This document organizes the technical discussion of the system in three parts: identification, representation, and interaction. This document also addresses some non-technical (social) issues that contribute to the shared information space.

This document strives to establish a reference set of requirements, constraints, principles, and design choices for Web architecture.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This draft includes a modified introduction and some text based on resolutions by the TAG in January and February 2003. It does not represent consensus within the TAG. This document has been developed by W3C's Technical Architecture Group (TAG) (charter). A list of changes in this document is available.

This draft remains incomplete; sections 1 and 2 are the most developed, 3 and 4 the least. The TAG has published a number of findings that address specific architecture issues. Parts of those findings may appear in subsequent drafts. Please also consult the list of issues under consideration by the TAG.

This draft includes some editorial notes and also references to open TAG issues. These do not represent all open issues in the document. They are expected to disappear from future drafts.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than "work in progress."

The latest information regarding patent disclosures related to this document is available on the Web. As of this publication, there are no disclosures.

Please send comments on this document to the public W3C TAG mailing list www-tag@w3.org (archive).

A list of current W3C Recommendations and other technical documents can be found at the W3C Web site.

1. Introduction
2. About this document
- 2.1. Scope of this document
3. Identification and resources
4. Representations
5. Machine-to-machine interaction
6. General design principles
- 6.1. Information hiding
7. Glossary
- 7.1. Principles, constraints, etc.
- 7.2. Technical terms
8. References
- 8.1. Normative References
- 8.2. Non-Normative References
9. End notes
10. Acknowledgments

1. Introduction

The World Wide Web (or, Web) is a networked information system. In this document we discuss how the system is built out of identifiers, formats for exchanging data, and protcols for the exchange.

1.1. Identifiers

Something is "on the Web" if it is identified by a Uniform Resource Identifier (URIs), defined in RFC 2396 [RFC2396]. The URI is the bedrock of Web architecture. The Web relies on a worldwide agreement to follow the syntactic and semantic rules of URIs so that we can refer to things on the Web, access them, describe them, and share them.

URIs identify resources. The term "resource" encompasses all those things that populate the Web: documents, services ("the weather forecast for Oaxaca"), people, organizations, physical objects, and concepts. A resource can be anything that has identity ([RFC2396], 1.1).

Web resources are abstractions, which is why it is possible for a person or a car to be "on the Web." Web agents (programs acting on behalf of a person, entity, or process) identify resource with URIs and communicate about them through representations.

In general, it is not possible to inspect a URI and determine what resource it identifies. For example, in general, one cannot look at http://www.example.com/lj45sr and know that it refers to "my old car" or "the weather forecast for Oaxaca." On the Web, information about the nature or state of a resource is communicated through representations, not URIs. Consequently, the party authorized to make those representations available determines the meaning of the resource, and which URIs refer to it.

In section 3 we discuss the syntax of URIs, who is authorized to create them, and important operations on URIs.

1.2. Representations

A representation is a data object that represents or describes a resource state. A representation consists of:

Data about the resource, conveyed by formats (e.g., XHTML, CSS, PNG, XLink, RDF/XML, and SMIL animation) used separately or in combination.
Metadata about the representation, such as the Internet Media Type (defined in RFC 2046 [RFC2046]). The Internet Media Type is the key to the correct interpretation of a resource representation, and governs the handling of fragment identifiers. When transferred by a Web protocol, a representation often includes metadata about both the representation and the message bearing the representation (for example, some HTTP headers).

Web agents do not just read resource state through representations; they also modify resource state through representations, such as when an author publishes a new document on the Web, or when the user completes the purchase of a new fishing rod through the Web.

For some resources, representations may vary over time (and other parameters). One would expect a representation of the weather in Oaxaca to vary over time.

In real life as well, communication takes place through representations. For instance, "Moby Dick" is an abstract literary work that can be printed in hardcover volumes, made available electronically, or read aloud. We access "Moby Dick" through these representations. We do not exchange the abstract work itself.¹

In section 3 we discuss: how one learns about the meaning of a resource, how to interact with a resource given a URI, and how the value of a URI depends on how well it is serviced. In section 4 we discuss how data formats are used to build representations.

1.3. Machine interactions

Web agents exchange representations via protocols, including HTTP [RFC2616], FTP, and SMTP². Several of these protocols share a reliance on the Multipurpose Internet Mail Extensions (MIME) standards for the format of message bodies [RFC2045] and for Internet Media Types [RFC2046].

In section 5 we discuss protocols.

1.4. Summary of required properties, constraints, principles, and good practice notes

The terms used in the following list are elaborated on in the document, as are the categories of principle, constraint etc.

The terms MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used in accordance with RFC 2119 [RFC2119].

Spelling URIs: [practice]

If you want to refer to a resource and you know of a URI that refers to it, you SHOULD spell the URI the same way.

New URI schemes: [practice]

Authors of specifications SHOULD avoid introducing new URI schemes when existing schemes can be used to meet the goals of the specifications.

Conneg with fragments: [practice]

Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.

Resource descriptions: [practice]

Owners of important resources SHOULD make available representations that describe the nature and purpose of those resources.

Use URIs: [principle]

All important resources SHOULD be identified by a URI.

Safe retrieval: [principle]

Agents do not incur obligations by retrieving a representation.

Service URIs: [practice]

Parties responsible for a URI SHOULD service that URI predictably.

Understand REST: [practice]

Designers of protocols SHOULD invest time in understanding the REST paradigm and consider the role to which its principles should guide their design:

statelessness
clear assignment of roles to parties
uniform address spac
limited uniform set of verbs

Some of the items in the above list may conflict with current practice, and so education and outreach will be required to improve on that practice. Other items may fill in gaps in published specifications or may call attention to known weaknesses in those specifications.

The motivation behind some of good practice notes is economic: there is a benefit to doing things a standard way and a cost to doing things differently. For instance:

It is prohibitively expensive to use an identification mechanism other than URIs.
It is quite costly to introduce a new URI scheme.
It is costly to introduce a new protocol or format.

This document promotes reuse of existing standards when suitable, and gives some guidance on how to innovate (when necessary) in a manner consistent with the Web architecture.

2. About this document

The intended audience for this document includes:

Participants in W3C groups,
Other groups and individuals developing technologies to be integrated into the Web.
Implementers of W3C specifications, and those who use the resulting products.

The authors have made every effort to keep this document terse, with examples. The TAG expects that additional documents such as the TAG findings will elaborate on the required properties, constraints, and principles, rationale, and additional examples.

Readers will benefit from familiarity with the Requests for Comments (RFC) series from the IETF, some of which define pieces of the architecture discussed in this document.

The architecture described in this document is the result of experience. There has been some theoretical and modeling work in the area of Web Architecture, notably Roy Fielding's work on "Representational State Transfer" [REST].

2.1. Scope of this document

This document focuses on the architecture of the Web. The authors assume the reader is familiar with rationale for some of the general design principles: minimal constraint (fewer rules makes the system more flexible), modularity, minimum redundancy, extensibility, simplicity, and robustness.

Other groups within W3C are addressing architectural design goals in the following areas:

Internationalization; see W3C's Internationalization Activity.
Accessibility; see W3C's Web Accessibility Initiative.
Device independence; see W3C's Device Independence Activity.

For information about architectural principles of the Internet, refer to [RFC1958].

3. Identification and resources

Web architecture starts with Uniform Resource Identifiers (URI), whose generic syntax is defined by RFC 2396 [RFC2396].

Technical usage note: The current document uses the term "URI" to mean, in RFC2396 terms, an absolute URI reference³ optionally followed by a fragment identifier. The TAG is working actively to convince the IETF to revise RFC2396 so that the definition of "URI" aligns with the current document.

Editor's note: The TAG is following work on "Internationalized Resource Identifiers (IRIs)" [IRI], which the TAG considers to be a valuable mechanism for writing down URIs in an international context. Please refer to TAG issue IRIEverywhere-27.

3.1. Comparing identifiers

Communication requires that two parties have a way of knowing they are referring to the same resource. On the Web, if two parties use the same URI, the parties are referring to the same resource.

If two parties use two different URIs, it is generally not possible to know whether the identified resources are different. Web architecture does not constrain resources to be uniquely named. In some cases, however, one can tell by inspecting two different URIs (and knowing additional syntactic rules imposed by URI schemes) that they identify the same resource. For instance, one can tell that http://example.com/ and http://example.com:80/ identify the same resource by comparing the URIs and knowing the rules in section 3.2.2 of [RFC2616].

Emerging Semantic Web technologies, including "DAML+OIL" [DAMLOIL] and "Web Ontology Language (OWL)" [[OWL10]], define RDF properties such as equivalentTo and FunctionalProperty to state -- or at least claim -- formally that two URIs identify the same resource. Whether such claims are to be trusted is a matter of local policy.

When software is required to compare two URIs, it does so for some particular purpose. Different software modules with different purposes might reasonably come to different conclusions about the same pair of URIs. Software modules performing such comparisons differ in their requirements and therefore their URI equivalence criteria. Refer to the draft TAG finding "How to Compare Uniform Resource Identifiers" for detailed information about URI comparison.

Good practice

Spelling URIs: If you want to refer to a resource and you know of a URI that refers to it, you SHOULD spell the URI the same way.

Producers of URIs should be conservative by maximizing the consistency of identifiers used to refer to any given resource, and by ensuring sufficient difference between identifiers used for different resources.

Consumers of URIs, on the other hand, should be liberal in allowing URI producers maximum freedom in choosing URIs. Even though producers should not use http://example.com/MyStuff and http://example.com/myStuff to identify different resources, they may, and clients that assume that these URIs refer to the same resource do so at their own risk.

Issue: URIEquivalence-15: When are two URI variants considered equivalent? See also issue IRIEverywhere-27 - Should W3C specifications start promoting IRIs?

3.2. URI Schemes

The first syntactic component of a URI is the "URI scheme," the string before the first ":". For example the scheme of the URI http://www.example.com/ is "http", and for ftp://ftp.example.com/ it is "ftp". The URI scheme is important because it is the first piece of information evaluated when using the URI to access a representation of the identified resource; see the section on dereferencing a URI below for more information.

A URI scheme may specify semantics or syntax constraints beyond those of [RFC2396]. For instance, a URI scheme might specify the type of resource identified by such URIs, the desired persistence of such URIs, or a default character encoding for URIs of that scheme.

It is therefore common to classify URIs by scheme, calling the two preceding examples respectively an "HTTP URI" and an "FTP URI". Several URI schemes incorporate identification mechanisms that pre-date the Web into this syntax:

MAILTO URIs identify mailboxes:
mailto:nobody@example.org
FTP URIs identify identify ftp files and directories:
ftp://example.org/aDirectory/aFile
NEWS URIs identify USENET newsgroups:
news:comp.infosystems.www
TEL URIs identify terminals on the telephone network:
tel:+1-816-555-1212

Other URI schemes have been introduced since the advent of the Web, including those introduced as a consequence of new protocols. Examples of URIs for these schemes include:

http://www.example.org/something?with=arg1;and=arg2
ldap://ldap.itd.umich.edu/c=GB?objectClass?one
urn:oasis:SAML:1.0

Since many aspects of URI processing are scheme-dependent, and since a huge range of software is expected to be able to process URIs, the cost of introduction of new URI schemes is quite high.

Good practice

New URI schemes: Authors of specifications SHOULD avoid introducing new URI schemes when existing schemes can be used to meet the goals of the specifications.

While "myscheme:blort" is a URI that satisfies the syntactic constraints of [RFC2396], unless it is registered, using it is problematic for a number of reasons, including:

Someone else may be using the scheme for other purposes.
You should not expect that software will do anything useful with URIs of this scheme.

The IANA registry [IANASchemes] lists registered URI schemes and the specifications that define them. For instance, the IANA registry indicates that the "http" scheme is defined by [RFC2616]. Refer to RFC2717 for information about registering a new URI scheme.

The deployment and use of different URI schemes may require varying degrees of central coordination and administration. For example, MAILTO, FTP, and HTTP URIs depend (in practice at least) on the use of the DNS infrastructure. Also, there is a central registry of URN namespace identifiers.

Editor's note: Say something here or in another section about the authority component of a URI, since that is part of what determines who gets to send back authoritative representations?

3.3. Fragment identifiers

In some URI schemes it is meaningful for a URI to end with a fragment identifier to yield an identifier for part of, or a view of, a resource⁴. The following URIs include fragment identifiers:

ftp://example.org/aDirectory/aDocument#section1
http://www.example.org/states#texas

Note that while this composition is syntactically fully general, it is meaningless in some URI schemes. The URI mailto:nobody@example.org#abc is meaningless in practice.

The fragment identifier is interpreted only after the retrieval of a representation. Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation.

For instance, if the representation is an HTML document, the fragment identifies a hypertext anchor. In the case of a graphics format, the fragment might identify a circle or spline. In the Resource Description Framework [RDF10], fragments can be used to identify anything, be it abstract (e.g., a dream) or concrete (e.g., an automobile).

Good practice

Conneg with fragments: Authors SHOULD NOT use HTTP content negotiation for different media types that do not share the same fragment identifier semantics.

Editor's note: There has been some discussion but no agreement that new access protocols should provide a means to convert fragment identifiers according to media type.

3.4. Dereferencing a URI

A Web agent initiates interaction with a resource by dereferencing a URI that identifies the resource. To dereference a URI is to apply in succession a finite set of relevant specifications, beginning with the specification that governs the scheme of the URI. Available dereference mechanisms vary by URI scheme and protocol used (e.g., HTTP GET and HTTP POST). For instance, the URN scheme [RFC 2141] does not specify a dereference procedure.

In general, information about which dereference mechanism to use for a given URI is not part of the URI, but specified by the context in which the URI is used. Many format specifications include ways to refer to other resources; agents processing those URIs generally dereference them.

TAG issue metadataInURI-31: Should metadata (e.g., versioning information) be encoded in URIs?

3.4.1. Retrieving a representation

One of the most important actions on the Web is to retrieve a representation of a resource (by using, for example, HTTP GET).

Good practice

Resource descriptions: Owners of important resources SHOULD make available representations that describe the nature and purpose of those resources.

As an example of dereferencing a URI to retrieve a representation, suppose that http://weather.yahoo.com/forecast/MXOA0069 is used within an a element of an SVG document. The sequence of specifications applied is:

The URI specification [RFC2396]. This specification says (in section 3.1) that the scheme "define the semantics for the remainder of the URI string." In this case, the URI scheme is HTTP.
The HTTP/1.1 protocol. Section 3.2.2 of RFC2616 [RFC2616] explains the semantics of HTTP URIs.
The SVG 1.0 Recommendation [SVG10], which imports the link semantics defined by XLink 1.0 [XLink10]. Section 17.1 of the SVG specification suggests that interaction with an a link involves retrieving a representation of a resource, identified by the XLink href attribute: "By activating these links (by clicking with the mouse, through keyboard input, and voice commands), users may visit these resources." This means that the GET method defined in HTTP/1.1 is used to retrieve the representation of the resource.
Once the representation has been retrieved, the media type of the representation governs its interpretation (here, for rendering). Note that, in general, one cannot determine the media type(s) of representation(s) of a resource by inspecting a URI for that resource. For example, do not assume that a URI that ends with the string ".html" refers to a resource that has an HTML representation.

When a representation refers to a resource (by means of a URI), a link is formed. The networked information system is built of linked resources, and the large-scale effect is a shared information space. The value of the Web increases with the number of linked resources (the "network effect"). The value of a resource also increases when it is identifiable on the Web.

Principle

Use URIs: All important resources SHOULD be identified by a URI.⁵

There are many benefits to making resources identifiable by URI. Some are by design (e.g., linking and bookmarking), while others have arisen naturally (e.g., global search services).⁶

3.4.2. Safe retrieval

Principle

Safe retrieval: Agents do not incur obligations by retrieving a representation.

For instance, a user does not incur an obligation by following an HTML link. Tools such as proxies and search engines can retrieve representations without user interaction; it would be harmful to the Web if such operations incurred obligations. See the TAG finding "URIs, Addressability, and the use of HTTP GET" for more information about safe retrieval.

3.4.3. Identification is not access

Consider the difference between identifying a resource on the Web and retrieving a representation (or otherwise interacting with it). It is reasonable to control access to the resource (e.g., for security reasons), but it is unreasonable to prohibit others from merely identifying the resource.

As an analogy: A building might have a policy that the public may only enter via the main front door, and only during business hours. People employed in the building and in making deliveries to it might use other doors as appropriate. Such a policy would be enforced by a combination of security personnel and mechanical devices such as locks and pass-cards. One would not enforce this policy by hiding some of the building entrances, nor by requesting legislation requiring the use of the front door and forbidding anyone to reveal the fact that there are other doors to the building.

The Web provides several mechanisms to control access to resources, none of which relies on hiding or suppressing URIs for those resources. For more information on identification and access control, please refer to the TAG finding "'Deep Linking' in the World Wide Web."

3.4.4. Servicing a URI

The value of a URI increases with the predictability of interactions via that URI.

Good practice

Service URIs: Parties responsible for a URI SHOULD service that URI predictably.

Service breakdowns include:

No service available (i.e., dereferencing fails)
Inconsistent representations served

Inconsistency may be caused by a number of factors. For example, representations may vary as a function of factors including time, the identity of the agent accessing the resource, data submitted to the resource when interacting with it, and changes external to the resource. Consider the URI http://weather.yahoo.com/forecast/MXOA0069: representations for the designed resource (the weather in Oaxaca) depend on (at least) time, the expressed preference of the user for Fahrenheit or Celsius, the identity of the user-agent software receiving the representation, and, presumably, the weather in Oaxaca.

Serving two images as equivalents through HTTP content negotiation, where one image represents a square and the other a circle, would similarly constitute inconsistent service.

Any description of what a URI identifies should be unambiguous. For instance, saying that the URI http://www.example.com/moby identifies "Moby Dick" can lead to confusion because this might be interpreted as any one of the following very distinct resources: a particular printing of this work (say, by ISBN), or the work itself in an abstract sense (for example, using RDF), or the fictional white whale, or a particular copy of the book on the shelves of a library (via the Web interface of the library's online catalog), or the record in the library's electronic catalog which contains the metadata about the work, or the Gutenberg project's online version. Similarly, one should not use the same URI to refer to a person and to that person's mailbox.

Ambiguous descriptions of what a URI identifies increase the likelihood that two parties will think the same URI identifies different resources, and thus that the parties will use the URI inconsistently. This can be costly, as in the case of two databases in which the same URI is used inconsistently; merging the two databases might lead to confusion or errors.

There are thus strong social expectations that once a URI identifies a particular resource, it should continue indefinitely to refer to that resource; this is called the persistence of the URI. Persistence is always a matter of policy and commitment on the part of authorities assigning URIs rather than a constraint imposed by technological means.

HTTP [RFC2616] has been designed to help Webmasters service URIs. For example, HTTP redirection (via some of the 3xx response codes) permits servers to tell a client that further action needs to be taken by the client in order to fulfill the request (e.g., the resource has been assigned a new URI). In addition, content negotiation also promotes consistency, as a site manager would not be required to define new URIs for each new format that is supported, as would be the case with protocols that don't support content negotiation, such as FTP.

For more discussion about persistence, refer to [Cool].⁷

4. Representations

Editor's note: Refer to other W3C format guidelines: Charmod, XAG, etc.

4.1. Scope

What is a format, and how does it relate to the concept of a document. Do all documents have a format? Is a document a collection of resources of different formats organized into a whole? Is a document the same as a resource? the same as a message body? as a non-multipart message body? What is the distinction between documents and data, if any. Does 'document' imply human readable and if so, does it imply presentation? Does it imply a hierarchically structured, report-like document with headings and subheadings? Is a catalog a document? Is a rave flyer a document?

Negotiation (stuff above might go here also) by network request, by listed alternatives in content any preference? Resource variants, foo.css and foo.html unlikely to be equivalent.

4.2. Processing model

On the interpretation and processing of formats (see namespaceDocument-8):

It's useful to say what xml:lang means in a very large number of cases, without too much effort. Same for xml:base.
We also need to allow other specs to use xml:lang and xml:base in other ways (e.g., xslt outputting it).

4.3. Format specification design guidelines

Editor's note: This section is in its early stages.

A format specification describes the semantics and structure of the format - either directly as a bit sequence or with some indirection (as when multiple character encodings are permitted).

4.3.1. When to use XML

For hierarchically structured information which does not contain a large proportion of binary information, XML is a natural choice. It is widely but not universally applicable for format specifications. For example, an audio or video format is unlikely to be well suited to representation in XML. Advantages of using XML include

Explicit representation of the hierarchical structure
Persistence; there is lots of redundancy
Facilitates internationalization
Clean error-handling; early detection of errors
Mix of structure and text or data content
Composability of multiple namespaces

Refer also to "Guidelines For The Use of XML in IETF Protocols" [IETFXML] for information about the use of XML within IETF standards-track protocols and "XML Accessibility Guidelines" [XAG] for help designing XML formats that lower barriers to Web accessibility for people with disabilities

4.3.2. XML Namespaces and Namespace Documents

XML vocabularies use URIs, per "Namespaces in XML" [XMLNS], as "namespace names" to create globally unique element and attribute names. These URIs are identified in an XML document with a namespace declaration.

Editor's note: Possible practice note: Designers SHOULD use XML Namespaces when they use XML.

Editor's note: The following text is all well and good, but mainly discusses namespace documents. Need some prior discussion on why using XML namespaces is good (composability, modularity) and what the drawbacks are (DTD hacks, etc) particularly to back up the suggested practice note.

Although "Namespaces in XML" makes it clear that it is not necessary for the namespace name to be a retrievable resource, the "resource description" principle suggests that it SHOULD be a retrievable resource.

Presented with a namespace name that it does not recognize, the URI is the only key that a person or application has to find out more about the namespace. The natural way to find out more about a resource identified by a URI is to dereference it.

There are many reasons why a person or agent might want more information about the namespace. A person might want to:

understand its purpose,
find out who controls it,
request authority to access schemas or collateral material about it, or
report a bug or situation that could be considered an error in some collateral material.

An agent might undertake to retrieve that information for a user, or it might be searching for other kinds of information, such as:

a schema to use for validation,
a stylesheet to use for presentation,
an ontology to use for making inferences, or
any number of other application-specific details.

It follows that there is, in general, no single type of resource that can be returned in response to a request for the namespace name that will always be the most appropriate.

Consequently, it often makes sense to use some sort of hybrid document that indirectly provides access to a variety of resources as the document available from the namespace name. One example of such a hybrid document is RDDL [@@Reference?@@].

Note, however, that RDDL or a document like it, is no more universally correct than any other type of resource. For any particular namespace, there might be a single best answer (schema, ontology, HTML documentation, etc.), as determined by the developers of that namespace.

Issue: namespaceDocument-8: What should a "namespace document" look like?

Editor's note: Where should we put a section on mixing namespaces; is the section on processing model more appropriate? See issue mixedNamespaceMeaning-13.

Editor's note: Mixing namespaces can however be done without agreeing on a processing model so although any processing model would affect mixed namespaces, it is not the same issue.

4.3.3. Use of fragment identifiers in XML

TAG issue fragmentInXML-28: Do fragment identifiers refer to a syntactic element (at least for XML content), or can they refer to abstractions? See TAG issue.

TAG issue xmlIDSemantics-32: How should the problem of identifying ID semantics in XML languages be addressed in the absence of a DTD?

4.3.4. XML subsets and composability

TAG issue xmlProfiles-29: When, whither and how to profile W3C specifications in the XML Family?

TAG issue mixedUIXMLNamespace-33: Composability for user interface-oriented XML namespaces

TAG issue xmlFunctions-34: XML Transformation and composability (e.g., XSLT, XInclude, Encryption)

TAG issue RDFinXHTML-35: Syntax and semantics for embedding RDF in XHTML

4.3.5. Compressed XML

Effect of Mobile on architecture - size, complexity, memory constraints. Binary infosets, storage efficiency. See TAG issue binaryXML-30

4.4. Separating Content and Presentation

Issue: contentPresentation-26: Separation of semantic and presentational markup, to the extent possible, is architecturally sound.

Separating the concepts content, presentation, and interaction allows more easily composable specifications. For example, a markup language can be specified independently of a style sheet language. The separation facilitates alternate presentations of the same content, which is seen to have an accessibility advantage and to be more suited to the multiple modalities of Web access.

There is no hard and fast division between what is 'purely semantic content' and what is 'just presentation'. The term "semantics" is often used or misued in this context, however any structured format is likely to have some semantics; and some semantics refer to precise details of presentation. Thus, 'semantics' is not used in this section. Instead, a given format may be seen to reside somewhere on a continuum from highly abstract to highly concrete. The more concrete, the less presentational flexibility remains. Highly abstract formats often require extensive transformation before being presented. Less abstract formats can often be presented directly just by decorating the source tree with formatting properties, for example using CSS. Highly concrete formats may still retain some presentational flexibility, for example restyling a pie diagram to fit into a different presentation.

In general, moving from more abstract to more concrete can be done (with loss of some abstractions) and is frequently done (for presentation); moving from a more concrete to a more abstract representation (eg, HTML to RDF) is sometimes possible in specific cases but may require extra information and is not possible in the general case.

It is sometimes asserted that XML is content and styling is presentation. However, presentational information may itself be complex and structured, and thus a good candidate for encoding in XML. Examples of such highly concrete XML formats are XSL Formatting Objects, SVG, the presentational part of MathML, and Voice XML.

As an example, an abstract dataset may contain relationships between sales areas, individual sales people, individual products, and time periods. A more concrete representation might be an HTML table comparing total sales per area over a three year period by quarter. Abstractions concerning individuals, dates of sales, and popularity of products have been lost, and the decision made to display the information in a two dimensional tabular format. However the table may be styled in different fonts and colors, columns or rows may be added, the table may be transposed, and the table may be serialized to a voice browser or screen reader. An even more concrete representation might be a pie chart in SVG of total sales over the whole period by sales area. Seasonal patterns have been lost, but there is still a limited degree of presentational flexibility in terms of color and size and access to the descriptions. Creation of a report on the sales performance of individuals or the growth in popularity of different products would require access to the most abstract form of the data.

4.4.1. Content, Presentation, and Interaction

This section attempts to organize some areas of future discussion.

4.4.1.1. Content

Composability (ns-meaning). Use of XML for tree structured content. Linking in general v. idref in one document. Human readable v. machine data. Served or not (hidden behind server - semantic firewall, accessibility. Linking into parts of the content, transclusion of parts. Compound documents, components from multiple servers - scalability, deep linking. Processing models, error handling.

4.4.1.2. Presentation

Presentation by decoration (application of CSS to XML as presentation), and by derivation (creation of html/svg/etc as presentation). Linking (bidirectionally) between content and presentations. Inheritance of properties across namespaces. Consistency of property names. Subsets. 'Applies to' as opposed to 'set on'. Specificity of properties as attributes, chaining styling, restyling. Time-lines, linking to portions of a time-line.

4.4.1.3. Interactivity

Animation, scripting, events, client/server interaction. Declarative v. script based - accessibility, power; formalization of common functionality (loop animation, rollovers) in declarative form. DOM - making additional methods, add to rather than replacing XML DOM. Effect of script/programming language limitations on choice of element and attribute names. Linking to active components - XForms example with model and abstract form control, can be extended to presentational instantiation of form control.

4.5. Ideas and issues

For new format specifications, use XML family of specifications unless there's a good reason not to. Which XML specifications? Which particular family members?
Format designers should use URIs without constraining content providers to particular URI schemes.
Allow for Web-wide linking, not just internal document linking.
Qnames: Issues rdfmsQnameUriMapping-6, qnameAsId-18 and finding "Using QNames as Identifiers in Content"
Formatting properties: Issue formattingProperties-19, contentPresentation-26
Error handling: Issue errorHandling-20
Media type registration: RFC3023Charset-21, finding "Internet Media Type registration, consistency of use." Also, makes sure to define fragment identifier semantics.
What is the scope of using XLink? xlinkScope-23
Can a specification include rules for overriding HTTP content type parameters? contentTypeOverride-24
Create formats that allow authors to hide URIs from view (e.g., behind link text). For authors: at times it is useful or necessary to reveal a URI (e.g., in an advertisement on the side of a bus), in which case, good social behavior requires that the URI be easy to use.

5. Machine-to-machine interaction

As mentioned in the introduction, the Web is designed to create the large-scale effect of a shared information space that scales well and behaves predictably. It is also not static - it is primarily use by people to get information, and that process is not one of passive consumption. Besides selecting which documents to read, people also interact with them - scrolling, zooming, filling in forms, following hyperlinks, and viewing and interacting with animations and scripting. A document is thus not merely a piece of XML markup and associated stylesheet but also descriptions of hyperlinking, scripting through the Document Object Model, declarative animation, associated audio and visual media, forms, etc.

5.1. Device Independence and Multimodal Interaction

Although much interaction to date has taken place through fairly similar desktop and laptop computers - with a keyboard, mouse, sctreen - increasingly interaction also uses other devices - PDSa, cellphones, spoken interaction - and also using accessibility helpers ranging from screen magnifiers through to Braille devices. The increasing necessity for such multimodal interaction informs the architectural principles and best practices relating to Web interaction.

5.2. HTTP and REST

Good practice

Understand REST: Designers of protocols SHOULD invest time in understanding the REST paradigm and consider the role to which its principles should guide their design:

statelessness
clear assignment of roles to parties
uniform address spac
limited uniform set of verbs

5.3. Ideas and issues

Consistency of media types and message contents (from the TAG finding "Internet Media Type registration, consistency of use"
Consistency of communicating character encoding (same source).
HTTP as a substrate protocol [TAG issue HTTPSubstrate-16]

Editor's note: Per their 27 Jan 2003 teleconf, the TAG expects to add text about proper interpretation of protocol headers, as discussed in "Internet Media Type registration, consistency of use."

6. General design principles

Editor's note: There may be some general principles that hold across all three previous chapters. Should we put them in this appendix and refer to them from each section? Tim Bray has expressed the opinion that we should not have a separate section on general design principles.

6.1. Information hiding

When designing specifications that address independent functions of a system, avoidable references between the specifications are in general harmful. They are harmful because they impede the independent evolution of the specifications.

For example, it is a strength of XML that XPath cannot query the HTTP header. It is a strength of HTTP that it does not refer to details of the underlying TCP to the extent that it cannot be run over a different transport service. Similarly, the RDF data graph has a significance that is independent of the actual serialization. However, there is a flaw: the embedded XML parsetype="Literal" data type.

Sometimes it is necessary (and good for given application) to break layers. For example, it is good for an HTTP client to be aware of TCP speeds and round trip times to different mirror servers in order to optimize the choice of server. When designing specification, identify the functionalities that break layers so it is clear when they are being used.

7. Glossary

This section is non-normative.

7.1. Principles, constraints, etc.

Editor's note: The TAG is still experimenting with the categorization of points in this document. This list is likely to change. It has also been suggested that the categories clearly indicate their primary audience.

The important points of this document are categorized as follows:

Constraint: An architectural constraint is a restriction in behavior or interaction within the system. Constraints may be imposed for technical, policy, or other reasons.
Design Choice: In the design of the Web, some design choices, like the names of the <p> and <li> elements in HTML, or the choice of the colon character in URIs, are somewhat arbitrary; if <par>, <elt>, or * had been chosen instead, the large-scale result would, most likely, have been the same. Other design choices are more fundamental; these are the focus of this document.
Good practice: Good practice -- by software developers, content authors, site managers, users, and specification writers -- increases the value of the Web.
Principle: An architectural principle is a fundamental law that applies to a large number of situations and variables. Architectural principles include "separation of concerns", "generic interface", "self-descriptive syntax," "visible semantics," "network effect" (Metcalfe's Law), and Amdahl's Law: "The speed of a system is determined by its slowest component."
Property: Architectural properties include both the functional properties achieved by the system, such as accessibility and global scope, and non-functional properties, such as relative ease of evolution, reusability of components, efficiency, and dynamic extensibility.

7.2. Technical terms

Dereference: To dereference a URI is to apply in succession a finite set of relevant specifications, beginning with the specification that governs the scheme of the URI.
Link: When a representation refers to a resource (by means of a URI), a link is formed.
MIME: standards for the format of message bodies [RFC2045] and for Internet Media Types [RFC2046].
Persistence: There are thus strong social expectations that once a URI identifies a particular resource, it should continue indefinitely to refer to that resource; this is called the persistence of the URI.
Resource: A resource can be anything that has identity ([RFC2396], 1.1)

8. References

8.1. Normative References

IANASchemes: IANA's online registry of URI Schemes is available at http://www.iana.org/assignments/uri-schemes.; Dan Connolly's list of URI schemes is a useful resource for finding out which references define various URI schemes.
RFC2045: IETF "RFC 2045: Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", N. Freed, N. Borenstein, November 1996. Available at http://www.ietf.org/rfc/rfc2045.txt.
RFC2046: IETF "RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", N. Freed, N. Borenstein, November 1996. Available at http://www.ietf.org/rfc/rfc2046.txt.
RFC2119: IETF "RFC 2119: Key words for use in RFCs to Indicate Requirement Levels", S. Bradner, March 1997. Available at http://www.ietf.org/rfc/rfc2119.txt.
RFC2396: IETF "RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax", T. Berners-Lee, R. Fielding, L. Masinter, August 1998. Available at http://www.ietf.org/rfc/rfc2396.txt.
RFC2616: IETF "RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1", J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, June 1999. Available at http://www.ietf.org/rfc/rfc2616.txt.
RFC2717: IETF "Registration Procedures for URL Scheme Names", R. Petke, I. King, November 1999. Available at http://www.ietf.org/rfc/rfc2717.txt.

8.2. Non-Normative References

Axioms: "Universal Resource Identifiers - Axioms of Web Architecture", T. Berners-Lee, living document dated December 1996. Available at http://www.w3.org/DesignIssues/Axioms.
Cool: "Cool URIs don't change" T. Berners-Lee, W3C, 1998 Available at http://www.w3.org/Provider/Style/URI.
CSS2: "Cascading Style Sheets, level 2", B. Bos, H. Lie, C. Lilley, I. Jacobs, 12 May 1998. This W3C Recommendation is available at http://www.w3.org/TR/1998/REC-CSS2-19980512/.
DAMLOIL: "DAML+OIL (March 2001) Reference Description", D. Connolly, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, 18 Dec 2001. This W3C Note is available at http://www.w3.org/TR/2001/NOTE-daml+oil-reference-20011218.
Eng90: "Knowledge-Domain Interoperability and an Open Hyperdocument System", D. C. Engelbart, June 1990.
Fielding: "Principled Design of the Modern Web Architecture", R.T. Fielding and R.N. Taylor, UC Irvine. In Proceedings of the 2000 International Conference on Software Engineering (ICSE 2000), Limerick, Ireland, June 2000, pp. 407-416. This document is available at http://www.ics.uci.edu/~fielding/pubs/webarch_icse2000.pdf.
Fragments: "Fragment Identifiers on URIs", T. Berners-Lee, living document dated April 1997. Available at http://www.w3.org/DesignIssues/Fragment.
HTML40: "HTML 4.01 Specification", D. Raggett, A. Le Hors, I. Jacobs, 24 December 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-html401-19991224/.
IETFXML: IETF "Guidelines For The Use of XML in IETF Protocols," S. Hollenbeck, M. Rose, L. Masinter, eds., 2 November 2002. This IETF Internet Draft is available at http://www.imc.org/ietf-xml-use/xml-guidelines-07.txt. If this document is no longer available, refer to the ietf-xml-use mailing list.
IRI: IETF " Internationalized Resource Identifiers (IRIs)", M. Duerst, M. Suignard, Nov 2002. This IETF Internet Draft is available at http://www.w3.org/International/iri-edit/draft-duerst-iri.html. If this document is no longer available, refer to the home page for Editing 'Internationalized Resource Identifiers (IRIs)'.
OWL10: "Web Ontology Language (OWL) Reference Version 1.0", M. Dean, D. Connolly, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, L. A. Stein, eds., 12 Nov 2002. This W3C Working Draft is available at http://www.w3.org/TR/2002/WD-owl-ref-20021112/.
P3P10: "The Platform for Privacy Preferences 1.0 (P3P1.0) Specification", M. Marchiori, ed., 16 April 2002. This W3C Recommendation is available at http://www.w3.org/TR/2002/REC-P3P-20020416/.
RDF10: "Resource Description Framework (RDF) Model and Syntax Specification", O. Lassila, R. R. Swick, eds., 22 February 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/.
REST: " Representational State Transfer (REST)", Chapter 5 of "Architectural Styles and the Design of Network-based Software Architectures", Doctoral Thesis of R. T. Fielding, 2000. Available at http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm.
RFC1958: IETF "RFC 1958: Architectural Principles of the Internet", B. Carpenter, June 1996. Available at http://www.ietf.org/rfc/rfc1958.txt.
RFC2141: IETF "RFC 2141: URN Syntax", R. Moats, May 1997. Available at http://www.ietf.org/rfc/rfc2141.txt.
RFC2718: "Guidelines for new URL Schemes", L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, November 1999. Available at: http://www.ietf.org/rfc/rfc2718.txt.
RFC3236: IETF "RFC 3236: The 'application/xhtml+xml' Media Type", M. Baker, P. Stark, January 2002. Available at: http://www.rfc-editor.org/rfc/rfc3236.txt
SVG10: "Scalable Vector Graphics (SVG) 1.0 Specification", J. Ferraiolo, ed., 4 Sep 2001. This W3C Recommendation is available at http://www.w3.org/TR/2001/REC-SVG-20010904/.
UniqueDNS: " IAB Technical Comment on the Unique DNS Root", B. Carpenter, 27 Sep 1999. Available at http://www.icann.org/correspondence/iab-tech-comment-27sept99.htm.
XAG: "XML Accessibility Guidelines", Daniel Dardailler et al., 3 October 2002. Available at http://www.w3.org/TR/xag
XHTML10: "XHTML 1.0: The Extensible HyperText Markup Language: A Reformulation of HTML 4 in XML 1.0", S. Pemberton et al., 26 January 2000, revised 1 August 2002. Available at http://www.w3.org/TR/2002/REC-xhtml1-20020801/.
XLink10: "XML Linking Language (XLink) Version 1.0", S. DeRose, E. Maler, D. Orchard, 27 June 2001. This W3C Recommendation is available at http://www.w3.org/TR/2001/REC-xlink-20010627/.
XML10: "Extensible Markup Language (XML) 1.0 (Second Edition)", T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, 6 October 2000. This W3C Recommendation is available at http://www.w3.org/TR/2000/REC-xml-20001006.
XMLNS: "Namespaces in XML", T. Bray, D. Hollander, A. Layman, 14 Jan 1999. This W3C Recommendation is available at http://www.w3.org/TR/1999/REC-xml-names-19990114/.
W3CPROCESS: "W3C Process Document", 19 July 2001 Version. Available at http://www.w3.org/Consortium/Process-20010719/.

9. End notes

Roy Fielding cites a number of advantages to the "resource as abstraction" model in section 5.2.1.1 of his thesis [REST]. (Note 1 context.)
@@Text here on why SMTP part of Web@@ (Note 2 context.)
[RFC2396] defines a URI reference to be either an absolute URI reference or a relative URI reference. The syntax for a relative URI reference is a shortened form of that for an absolute URI reference, where some prefix of the URI is missing and certain path components ("." and "..") have a special meaning when, and only when, interpreting a relative path. For example, in a document whose base URI is http://example/dir1/dir2/file1, the relative URI reference ../file2 is a shortened form of http://example/dir1/file2 and the relative URI reference #abc is a shortened form for http://example/dir1/dir2/file1#abc. (Note 3 context.)
When comparison is expected to be the sole or primary operation on a URI, it does not matter whether one has chosen a URI with our without a fragment identifier. However, when one expects to interact with a resource, there are some advantages to using a URI without a fragment identifier: only URIs work with intermediaries in the Web architecture (e.g., proxies) or with redirection (in HTTP, for example). (Note 4 context.)
This principle dates back at least as far as Douglas Engelbart's seminal work on open hypertext systems; see section Every Object Addressable in [Eng90]. (Note 5 context.)
See the TAG finding "URIs, Addressability, and the use of HTTP GET" for some details about the interaction of this principle in HTTP application design. (Note 6 context.)
The title is somewhat misleading. It's not the URIs that change, it's what they identify. (Note 7 context.)

10. Acknowledgments

The authors of this document are the participants of W3C's Technical Architecture Group: Tim Berners-Lee (Chair, W3C), Tim Bray (Antarctica Systems), Dan Connolly (W3C), Paul Cotton (Microsoft), Roy Fielding (Day Software), Chris Lilley (W3C), David Orchard (BEA Systems), Norman Walsh (Sun), and Stuart Williams (Hewlett-Packard).

The TAG thanks people for their thoughtful contributions on the TAG's public mailing list, www-tag (archive).

Architecture of the World Wide Web

Editor's Draft 21 February 2003

Abstract

Status of this document

Table of Contents

1. Introduction

1.1. Identifiers

1.2. Representations

1.3. Machine interactions

1.4. Summary of required properties, constraints, principles, and good practice notes

2. About this document

2.1. Scope of this document

3. Identification and resources

3.1. Comparing identifiers

3.2. URI Schemes

3.3. Fragment identifiers

3.4. Dereferencing a URI

3.4.1. Retrieving a representation

3.4.2. Safe retrieval

3.4.3. Identification is not access

3.4.4. Servicing a URI

4. Representations

4.1. Scope

4.2. Processing model

4.3. Format specification design guidelines

4.3.1. When to use XML

4.3.2. XML Namespaces and Namespace Documents

4.3.3. Use of fragment identifiers in XML

4.3.4. XML subsets and composability

4.3.5. Compressed XML

4.4. Separating Content and Presentation

4.4.1. Content, Presentation, and Interaction

4.4.1.1. Content

4.4.1.2. Presentation

4.4.1.3. Interactivity

4.5. Ideas and issues

5. Machine-to-machine interaction

5.1. Device Independence and Multimodal Interaction

5.2. HTTP and REST

5.3. Ideas and issues

6. General design principles

6.1. Information hiding

7. Glossary

7.1. Principles, constraints, etc.

7.2. Technical terms

8. References

8.1. Normative References

8.2. Non-Normative References

9. End notes

10. Acknowledgments