Dereferencing HTTP URIs

Draft Tag Finding 31 May 2007

This version:
http://www.w3.org/2001/tag/doc/httpRange-14/2007-05-31/HttpRange-14.html
Latest version:
http://www.w3.org/2001/tag/doc/httpRange-14/HttpRange-14.html
Previous version:
Editor:
Rhys Lewis, Volantis Systems Ltd. <rhys@volantis.com>

Abstract

Editorial note 
The title is deliberately vague at this point. We mentioned a number of ways of making this finding available, including a direct response to httpRange-14, material in an updated version of AWWW, or as part of another finding. Currently this is written as a stand-alone finding.
This ....

Status of this Document

This document is an editors' copy that has no official standing. In particular, it does not yet necessarily reflect consenus within the working group or within the wider community.

This document has been produced by the W3C Technical Architecture Group (TAG). This finding addresses TAG issue httpRange-14.

This version of the document is a first editor's draft.

Additional TAG findings, both accepted and in draft state, may also be available. The TAG may incorporate this and other findings into future versions of the [AWWW].

The terms must, must not, required, shall, shall not, should, should not, recommended, may, and optional are used in this document in accordance with [IETF RFC 2119].

Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org (archive).

Table of Contents

1 Background
2 Information Resources
3 Other Web Resources
4 Associating Information Resources with Other Resources
    4.1 Using HTTP to Represent Associations
    4.2 Using URIs to Represent Individual Non-Information Resources
    4.3 Using URIs for Groups of Non-Information Resources
        4.3.1 Hash Namespaces

Appendix

A References


1 Background

The World Wide Web (WWW, or simply Web) is an information space in which the items of interest, referred to as resources, are identified by global identifiers called Uniform Resource Identifiers (URI) [AWWW].

The vast majority of users of the Web think of URIs as links to human-readable information. To them, the Web appears as a very large number of interlinked documents (Web pages or simply pages). Pages themselves usually contain URIs rendered in ways that allow user interaction. For example, activating a rendered link may make the content, of the page to which it refers, available to the user. Pages may directly convey information. They may also allow access to a huge variety of operations, from purchasing a book to remotely controlling a robot.

For these users, who account for the vast majority of today's accesses to the Web, there is a clear expectation that links point to other pages in the Web and that those pages will be available to them when the link is activated. Links that do not behave this way are typically considered to be 'broken'. The sense that the inability to traverse a link represents a failure is so strong that the resulting error message is usually passed directly to the user of the application that encounters it. The authority responsible for links exhibiting this behavior will often request to be informed of such errors so that appropriate corrections can be made.

Editorial note 
Stuart made the following comment about the previous paragraph: "For some people (I think) not all references made with URI’s (hyper)links. I think that there is a tension in what you write borne of whether you think that you are speaking of an articfact in a serialisation or a conceptual connection between two resources. At least from my POV links arise between resources – and (some) references made using URI that appear in serialised representations are ‘just’ part of the mechanics of communicating such relations."Rhys responds: "Personally, I agree about 'links' arising between resources. What I was trying to do with this paragraph was to point out that there is a sizable community of Web users who think that 404 means things have gone horribly wrong. And indeed for their particular use case, they are probably correct. These are the kinds of user who are likely to support a view that every HTTP URI must point at a document. I think we need to make this point."

While the architecture of the Web clearly supports the sort of use we've just been discussing, it also supports many other kinds. Some of these uses may not directly involve a human user at all. In some of these alternative uses, it may not be necessary for URIs to identify resources with which human users can interact directly. Indeed, for some uses, there may be no representation associated with the resource identified by a URI. There is an apparent dichotomy between URIs whose primary purpose is to provide a way to access human-readable representations and those whose primary purpose is different. This dichotomy led to TAG issue httpRange-14 being raised. The issue essentially asks whether HTTP URIs must always identify resources that the vast majority of users would consider to be documents or Web pages.

2 Information Resources

Information resources are resources, identified by URIs and whose essential characteristics can be conveyed in a message [AWWW]. The pages and documents familiar to users of the Web are information resources. Information resources typically have one or more representations that can be accessed using HTTP. It is these representations of the resource that flow in messages. The act of retrieving a representation of a resource identified by a URI is known as dereferencing that URI. Applications, such as browsers, render the retrieved representation so that it can be perceived by a user. Most Web users do not distinguish between a resource and the rendered representation they receive by accessing it.

Information resources make up the vast majority of the Web today. Their behavior is well understood. In particular, information resources have representations which are, in some sense, 'obvious'. The essence of an information resource is information. Consequently, the act of creating a representation is simply a transformation of that information into an appropriate form. Often that transformation will include formatting that allows the rendered representation to be used conveniently by a Web user.

As an example, let's consider the creation of a statement of activity for a particular month for a particular bank account. We'll suppose that a URI identifies the resource which, in this case, is a particular set of of binary data held in a relational database. To create a representation of the resource, the appropriate data is first extracted from the database and converted to textual form. Then it is embedded in a stream of HTML markup that also references appropriate styling information. This representation flows across the Web to a browser, where it is rendered. A user is able to perceive the rendered form and to understand the activity on the account for month in question.

The process of creating and rendering representations from information resources is so common that it is often either overlooked or considered to be completely ubiquitous. However, not all Web resources are necessarily associated with obvious representations.

3 Other Web Resources

While the behavior of information resources on the Web is well understood, the behavior of other kinds of resource is potentially problematic. In some instances, it can be desirable to use HTTP URIs to refer to resources that are outside the information space of the Web. Let's look at an example to explore some of the potential issues.

Story

Angela is creating an OWL ontology that defines specific characteristics of devices used to access the Web. Some of these characteristics represent physical properties of the device, such as its length, width and weight. As a result, the ontology includes concepts such as unit of measure, and specific instances, such as meter and kilogram. Angela uses URIs to identify these concepts.

Having chosen a URI for the concept of the meter, Angela faces the question of what should be returned if that URI is ever dereferenced. There is general advice that owners of URIs should provide representations [AWWW] and Angela is keen to comply. However, the choices of possible representations appear legion. Given that the URI is being used in the context of an OWL ontology, Angela first considers a representation that consists of some RDF triples that allow suitable computer systems to discover more information about the meter. She then worries that these might be less useful to a human user, who might prefer the appropriate Wikipedia entry. Perhaps, she reasons, a better approach would be to create a representation which itself contains a set of URIs to a range of resources that provide related representations. Perhaps content negotiation can help? She could return different representations based on the content type specified in the request.

Angela's dilemma is, of course, based on the fact that none of the representations she is considering are actually representations of the units of measure themselves. Even if the Web could deliver a platinum-iridium bar with two marks a meter apart at zero degrees celsius, or 1,650,763.73 wavelengths of the orange-red emission line in the electromagnetic spectrum of the krypton-86 atom in a vacuum [METRE], or even two marks, a meter apart on a screen, such representations are probably less than completely useful in the context of an information space. The representations that Angela is considering are not representations of the meter itself. Instead, they are representations of information resources related to the meter.

It is not appropriate for any of the individual representations that Angela is considering to be returned by dereferencing the URI that identifies the concept of the meter. Not only do the representations she is considering fail to represent the concept of the meter, they each have a different essence and so they should each have their own URI. As a consequence, it would also be inappropriate to use content negotiation as a way to provide them as alternate representations when the URI for the concept of the meter is dereferenced.

4 Associating Information Resources with Other Resources

The representations of information resources associated with other kinds of resource can be extremely useful. However, it would be misleading to claim that they are representations of the resource itself. In the previous example, the information resources convey information about the meter, but not its essence, which is a particular distance.

Information resources associated with a non-information resource need to have their own URIs. They are themselves distinct resources and provide representations. They may have uses other than providing additional information about the non-information resource. However, the fact that they are associated with a non-information resource is important.

4.1 Using HTTP to Represent Associations

HTTP itself provides one means of representing associations between resources. Instead of returning a representation, when the URI of a non-information resource is dereferenced, it is possible to return the HTTP response code 303 (known as 'see other'). This indicates that there is other, related information available concerning the URI that was dereferenced. In addition to the code, the response includes the URI of the related resource. This mechanism provides a way to draw attention to the related information without the need to return a representation, which might mislead the person or system making the request.

There is one nuance we should address before looking at this in a little more detail. Strictly, according to [IETF 2616], HTTP reserves the term representation for entities that are subject to content negotiation. We'll use the term here the way it is used in [AWWW]. In this slightly broader sense, the term representation is used to mean the contents of a message which carries the essential characteristics of a resource, whether or not that material was subject to content negotiation.

Editorial note 
I think the previous paragraph captures the variation between the way HTTP uses 'representation' and the way that AWWW uses it but I'm perfectly prepared to be told that I'm wrong.

The HTTP specification [IETF 2616] provides normative definitions of the meanings of the various HTTP response codes. Three basic sets of response codes are of interest in this particular case.

Response Code 200

According to the HTTP specification, when a code of 200 is received in response to an HTTP GET request, it indicates that "an entity corresponding to the requested resource" has been returned in the response. The contents of this entity is what we understand as a representation of the resource. This correspondence between a resource and a representation is defined in [AWWW] as characterising an information resource. Consequently, we can assume that if we receive this particular response code in response to an HTTP GET request, we have also received a representation and that the URI references an information resource.

Response Code 303

According to the HTTP specification, a response code of 303 indicates that "the response to the request can be found under a different URI ...". It provides the URI where we can look for that response. It's worth noting that although 303 has the role of redirecting user agents after script processing following POST requests, the specification does not limit it to that role.

Importantly, the specification also states that "The new URI is not a substitute reference for the originally requested resource." In other words, responses containing this code direct us to related material. If we dereference the supplied URI and receive a representation, it is clear that the representation relates to the URI we were given in the 303 response, and not to the URI that led to the 303 response. In particular, we're not being mislead into thinking that the original URI itself has representations.

Of course, there is no guarantee that the URI returned in the 303 will lead to a representation, although often it will. We need to dereference it and react to the resulting response. One possibility is that the URI returned in the 303 might itself lead to further redirections. However, if we are able, eventually to access a representation, we can conclude that the information is related to the URI that originally led to the 303 response code.

Editorial note 
There is a question in my mind about how much we can infer from receiving a 303, and in particular the fact that we didn't get a 200. We say that a response of 200 means that the dereferenced resource is an information resource. Also, the HTTP spec says that the URI returned with a 303 is "not a substitute reference for the originally requested resource" (quote from RFC2616). Hence the URIs are distinct, and so are the resources. Does that mean we can never get a representation from the original URI? If so, we should be able to infer that it can't refer to an information resource? The question is are there any reasons for getting a 303 from an information resource? Can the behaviour of a 303 ever be associated with an information resource? For example, could a 303 ever be a valid response from a failure of conneg? I don't think so, because it would be being used to say "I don't have a representation of this but you might find one over there" This is at odds with the 303 statement about the redirected URI not being a substitute for the originally requested one.

Any of the 4XX and 5XX Response Codes

The HTTP specification defines codes in the range 4XX and 5XX to indicate errors if various kinds. Generally, codes in the 4XX range indicate errors that are likely to be due to the client, and codes in the 5XX range are errors likely to be due to the server.

Regardless of the source of the error, nothing can be determined about the URI that led to the error. In particular, it is not possible to determine whether or not the resource referenced by that URI is an information or non-information resource.

Based on this discussion, Table 1 summarizes the information that can be inferred from the results of dereferencing a URI.

Table 1: Summary of inferences that can be made when dereferencing an HTTP URI
HTTP Respose CodeMaterial ReturnedInference
200 (success)A representationThe resource is an information resource and a representation of it has been returned.
303 (see other) A URIThe resource could be any resource. There is an associated resource whose URI has been returned. The associated resource might or might not be an information resource.
4XX or 5XX (error)NothingNothing can be inferred about the nature of the resource.

The discussion also suggest the following guidance for authorities who wish to create URIs for non-information resources using HTTP response codes.

Good Practice

Authorities MAY create HTTP URIs for non-information resources in addition to those for information resources.

If a URI identifies an information resource, the URI owner SHOULD provide representations of that resource. This is based on the available representation practice 3.5 in [AWWW]

If a URI identifies a non-information resource, the URI owner SHOULD provide an associated information resource which, when dereferenced, provides additional information about the original resource. In addition, the URI owner SHOULD make the URI of an associated information resource available using the mechanism based on returning an HTTP response code of 303 to the original request.

The following sections look at how this guidance applies when using URIs both with and without fragment identifiers. Using URIs with fragment identifiers allows information associated with a set of non-information resources to be kept together in a single resource.

4.2 Using URIs to Represent Individual Non-Information Resources

Story

Angela decides to provide information related to the meter as part of her work on the ontology. She configures her web server to return an HTTP 303 response code when the URI for the meter (http://www.example.com/ontology/meter) is dereferenced. She arranges for the URI returned with the HTTP 303 response (http://www.example.com/ontology/related/meter) to refer to an information resource that can provide multiple, equivalent representations via content negotiation. In particular she arranges for representations in HTML and RDF to be available to requests that specify the appropriate content type.

Angela's approach uses the good practice described above to provide information related to the meter at a specific URI. This approach allows the information concerning this specific concept to be retrieved. Where a small number of non-information resources is involved or where they are associated with very different knowledge domains, this approach is attractive. However, where a number of related concepts is involved, it may be more convenient to arrange for the associated information concerning a number of concepts to share a single representation. Some approaches are described in the following sections.

4.3 Using URIs for Groups of Non-Information Resources

Story

Angela needs to extend her ontology. She finds that she needs to add URIs that represent a significant number of additional SI units of measure (see for example [SI] ). She realises that extending her current approach will require the creation of an HTML representation and an RDF representation of the associated information for each unit of measure. She also notes that a considerable amount of additional configuration will be required on her web server.

Angela wonders if there is an approach that could reduce the amount of effort needed to support these additional representations.

One mechanism for reducing the amount of effort involved is to use a single resource and associated representations to provide information associated with a number of non-information resources. This type of approach may be appropriate where a significant number of related, non-information resources are concerned.

Fragment identifiers give one possible mechanism for such groupings. The simplification in configuration arises because fragment identifiers are not used when dereferencing a URI to retrieve a representation. They are used only after a representation has been successfully returned. Consequently, they do not need to be an explicit part of the server side configuration.

Story

Angela decides to group references to non-information resources by using fragment identifiers. She modifies her ontology to use the URI http://www.example.com/ontology/SI#meter to refer to the concept of the meter and http://www.example.com/ontology/SI#kilogram for the concept of the kilogram. She uses similar URIs to refer to the concepts of the second, the candela and the ampere. She configures her web server to return HTTP response code 303 for any request to http://www.example.com/ontology/SI together with the URI http://www.example.com/ontology/SI/Information. Finally, she arranges to return HTML or RDF representations from the URI http://www.example.com/ontology/SI/Information using content negotiation. In either case, the representations contain elements that can be referenced by the fragment identifiers associated with each of the units of measure in Angela's ontology.

In Angela's revised system, any reference to a URI such as http://www.example.com/ontology/SI#kilogram actually results in the URI http://www.example.com/ontology/SI being dereferenced. The configuration Angela has chosen causes a 303 response to be returned together with the URI http://www.example.com/ontology/SI/Information. This URI does return a representation when dereferenced, and the fragment identifier is applied to locate the particular information with which it is associated.

This approach is simpler to configure and requires fewer individual representations. However, each representation is bigger because it contains information about multiple non-information resources. Consequently, it may take longer to retrieve the information about a specific non-information resource and it may require more network traffic.

4.3.1 Hash Namespaces

The use of fragment identifiers in connection with grouping URIs for non-information resources is compatible with the so-called 'hash namespace' approach to vocabularies (see for example [Pub RDF]). In this approach, individual concepts and properties in vocabularies are identified via fragment identifiers used within specific RDF or OWL serialisations that act as representations. Such a representation would fulfil the role of the RDF variant of the associated information resource that Angela arranges to return in the 303 response from references to the non-information resources. By the way, it's worth noting that the example says nothing about the representation of Angela's ontology itself.

A References

AWWW
Architecture of the World Wide Web I.Jacobs and N. Walsh, 2004, W3C. (See http://www.w3.org/TR/webarch/.)
IETF RFC 2119
RFC 2119: Key words for use in RFCs to Indicate Requirement Levels Internet Engineering Task Force, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)
IETF 2616
RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1 Internet Engineering Task Force, 1999. (See http://www.ietf.org/rfc/rfc2616.txt.)
METRE
Metre. Description in Wikipedia. (See http://en.wikipedia.org/wiki/Metre.)
SI
The NIST Reference on Constants, Units and Uncertainty (See http://physics.nist.gov/cuu/Units/.)
Pub RDF
Best Practices Recipes for Publishing RDF Vocabularies (See http://www.w3.org/TR/swbp-vocab-pub/.)