Some thoughts on resources, information resources and representations

Noah Mendelsohn

W3C Technical Architecture Group

23 February 2008

Status of this note

This note has no formal standing at the W3C or in the TAG. Consider it as having the same status as an extended email to www-tag. I am posting it in HTML as I think it will be more readable and more easily shared in this form. This note is being offered in the hope that it will contribute to the TAGs progress on issues such as TAG Issue 57 (HttpRedirections-57).

Summary

This note explores a new Web architecture direction for dealing with what have been called Information Resources. This approach is motivated by the observation that the Information Resource abstraction isn't quite serving it's intended purpose anyway. That purpose was to avoid ambiguities as to what's identified by a URI for which HTTP returns status code 200. The proposal explored here is in summary:

  1. To identify a class of Web document resources that are immutable and do not engage in content negotiation. These resources provide the identical representation (if any) each time they are referenced. For these resources, there is thus very little ambiguity as to what the corresponding URI identifies. Call these "Immutable Document Resources" (IDR). As we'll see below, these are used to help distinguish between Web resources and the information provided as their representations.
  2. To encourage resource providers to associate with each representation that they serve an IDR (I.e. a resource with its own URI) that would, if asked, serve that same content as its representation. Stated differently, we don't in general assign a URI to each representation, which is a fleeting thing on the wire; we assign a URI to a resource that would serve the same representation if asked.
  3. To create a new HTTP header, tentatively called: Representation-source. This header is very similar in use to Content-location, but with the crucial difference that, if used, Representation-source MUST identify an IDR (I.e. immutable resource) that serves the same representation content as was just retrieved. Stated differently, a GET response provides a URI that's associated with the content, I.e. the headers and entity body, of the provided representation.
  4. As a special case of the rule above, to specify that when responding to a GET an IDR SHOULD identify itself as its own Representation-source. So, you can tell you're speaking to an immutable IDR if the Representation-source comes back the same as the original Request-URI. In this case, you know there's no ambiguity as to what the URI represents, and you can make very rigorous semantic Web statements about it.
  5. To back off from the advice to give 303's for non-IRs. You can give a 200 for (what we used to call) a resource that's not an Information Resource, but you SHOULD provide in the Representation-source header the URI of the IDR that is for the representation. The user agent can then distinguish semantic Web statements about the document retrieved (the IDR) from semantic Web statements about what may prove to be a "not an information resource".
  6. To investigate mechanisms such as HTTP link headers or Resource-Description headers that can be used, perhaps in the same responses that carry Representation-source headers, to indicate where information about a resource can be found.

Some of the rationale for this proposal is explained in more detail below. As a quick summary: the intuition is to acknowledge that due to conneg and just general lack of consensus in the community, the current deployed use of 200 isn't sufficiently consistent and reliable for rigorous reasoning in the semantic Web. This is dealt with by introducing a new header that can signal either (case 3) that the resource referenced is an immutable document with only one form, and the representation you have is the only information you'll get from accessing it now or in the future or (case 2) the server explicitly tells you that the representation you have is only one of many that this resource can serve, either because of conneg or because it's state changes over time. In this case, you are given the URI of a resource that stands for just the content you have been given, allowing you to either make unambiguous semantic Web statements about that, or to continue to probe the original resource for more information should you wish to. You may also be given either an HTTP link header or a Resource-description header (TBD) that will tell you where to find information about the resource.

I consider this proposal very rough and preliminary. There are some significant drawbacks to it, and I doubt that it will prove the correct answer in all details. If nothing else, working through it has clarified some issues for me, and perhaps others will find the same. Anyway, the sections below explore the reasoning behind this proposal in some more detail, analyse some of the pros & cons, etc.

Why Information Resource isn't the right abstraction

A key requirement of the Semantic Web is that URIs be used to identify resources unambiguously. Indeed, a particular use case that's caused concern is one in which an http-scheme URI is used to identify something tangible, such as a person. In these cases, any representation returned from an HTTP GET would clearly be at best indirectly related to the resource itself. One might return a picture of a person, for example, but clearly not the person herself. Returning such a picture would cause ambiguity since, as observed from the outside, it would be difficult to tell whether the URI referred to the picture or the person. A semantic Web statement with the URI as subject ("Resource R is too big") might be ambiguous (is the picture too big or the person?) For this reason, the TAG's decision on the issue httpRange-14 was to prohibit the use of HTTP status code 200 for resources that are not information resources.

As noted above, the concept of Information Resource, was introduced by the TAG in the Architecture of the World Wide Web. From AWWW:

It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as resources. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as information resources.

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transferred in a message. In the case of this document, the message payload is the representation of this document.

Information resources are those for which it's possible, at least in principle, to avoid the ambiguity discussed above, which is why the TAG's resolution of issue httpRange-14 suggests that HTTP status code 200 is appropriate only for information resources. Ironically, having gone to all this trouble, we then allow for ambiguity anyway!. While the TAG carefully discourages use of HTTP 200 for non-Information Resources, 200 seems to be allowed for resources that engage in content negotiation, e.g. to select a language. If I request a press release http://example.org/pressRelease, the server may use various heuristics to decide to serve me a copy in French, in English or in Greek. Does a Web Statement "http://example.org/pressRelease is hard to understand" apply to the press release in the abstract, or to the French representation that was served to me?

The obvious rejoinder to this concern is that HTTP provides a Content-Location header that could have identified the particular French variant served back to me. Indeed Content-Location seems to be pretty close to what we need to resolve the ambiguity, but I wonder whether it's defined quite carefully enough for this purpose? For example, let's assume that the press release is being revised, and that http://example.org/pressRelease identifies the current version, albeit in multiple languages. I do a GET at 2PM and a Content-Location is provided with URI http://example.org/pressRelease.French. I make the Semantic Web statement http://example.org/pressRelease.French is hard to understand. Is it unambiguous whether I'm referring to the 2PM version, or to the French versions independent of time? I'm not quite convinced that Content-Location as defined removes that ambiguity. This note explores the thesis that if we had a header that was defined more carefully, I.e. specifically to be used only for resources that would return the same representation, independent of time, conneg, and other complications, we might be able to move away from talking about Information Resources, and to have a stronger story about unambiguous identification on the Web.

From information resources to immutable documents

As described in the summary above, this proposal attempts to focus first on the case for which none of these ambiguities arise: I.e. resources for which the resource owner is willing to warrant that all GET requests now and in the future will provide representations with exactly the same content, if retrieval is successful at all. These seem to be the cases for which one can make truly unambiguous Semantic Web statements like "that picture is too big". We call such resources "Immutable Document Resources" (IDRs). Because they are only observable through HTTP, and since the resource owner warrants that they are immutable in the sense just described, there is a sense in which the URI of such a resource is particularly closely associated with the content of the representations it serves.

Not URIs for Representations, but URIs for the content of Representations

As mentioned above, a representation is something fleeting on the wire. Even for an IDR, if you do two successive HTTP GETs, and if those both respond with 200, you will have two separate representations. For IDRs, however the two representations will have the same headers, and the same content for the entity body.

Now consider the case of a resource R that's not an IDR. That resource may do conneg, or it may be something like an article at a news site that is revised from time to time. Two successive GET's may retrieve different representation content. If we want to say "That article is confusing" or "That article is copyright by the New York Times", are we referring to the particular document that came through from one of the retrievals, or to the resource R in general? This proposal deals with this ambiguity by encouraging the owner of the article to serve, along with each representation of the article, a Representation-source header. To do this, the owner of R must undertake to at least assign URIs to, and perhaps to serve content for two or more additional resources, which we'll call IDR1, IDR2 ... IDRN. A new one must be created for each variation of representation that's served for R (perhaps English vs. French) and for each version that's served over time. It's acceptable for IDR1...IDRN to respond 404, but if any of them responds 200, it MUST be with the associated immutable content. Furthermore, each time a 200 is returned for R, the Representation-source header should identify the IDR corresponding to the representation served. Now we can respond to the challenge: if we want to say "that article is confusing", we probably use the IDR as the subject, since that's the one we've just read. If we want to say "that article is copyright by the New York Times", we probably make a statement about R, since it's likely to be the article in general rather than the particular representation content that we're talking about.

Why not Content-location?

One of the frustrating things about this proposal is that Content-location, which is widely used, is almost but not quite what we want. I believe there's nothing that prevents one from returning in Content-location the URI of a resource that has mutable content, or indeed one that's not a document or information resource at all. Part of what's important about Representation-source is that the server is making a statement about the nature of the resource it's identifying, and it's warranting that over time a particular relationship will be maintained between that IDR and the representation content just served. Indeed, it would be very unusual to find a proper use of Representation-source in which the the original resource (R in the previous example) and the Representation-source were not controlled by the same or closely related organizations.

Will anyone bother to deploy this?

I'm not sure. I think the positive value comes mainly through the enablement of more rigorous Semantic Web processing. Presuming, for the moment, that this is an architecturally sound approach (which I'm not quite sure), the easy case is the one in which your resource is known immutable anyway, and that's true for many important legal and other documents (e.g. many W3C dated documents). For such resources, one merely has to put the original URI R into a Representation-source header. That's non-trivial as a practical matter on many servers, but there's nothing deeply complex about it either. The case where a resource is immutable but does conneg on, e.g., a few language translations is also not too hard in principle.

The common case of changing resources is harder. To use the architecture right, you have to mint a unique URI each time the representation changes, and preferably, use that same URI in all cases where multiple representations share the same content. In cases where content changes monotonically, a base IDR URI with a counter in it is one solution. At least in principle, something like an MD5 hash in the URI could be used in other cases, and in principle could be generated automatically by server code. There are, of course, some performance and complexity issues with that. Note that there is no absolute requirement to deploy anything for these many IDR URIs. The identification function, and hence the core Semantic Web requirement, is served even if they are 404. Of course, it's very valuable if a server can retain the content of all representations it has ever served, but in many practical cases that will be onerous. When someone makes a Semantic Web statement about content of a resource that has changed thousands of times, it's quite likely that a user of this architecture will find that an unambiguous identifier has been used for each version, but that content can be retrieved only for the most recent or important versions. I think that's a fine compromise.

What's this business about status code 200 for "Resources that aren't Information Resources?"

The reason for limiting code 200 to information resources was to avoid ambiguity. For the reasons explained above, I don't think common practice avoids the ambiguity very well anyway. First of all, both conneg and mutable resources are allowed. Secondly, even if that weren't a problem, it's not at all clear that the architecture is being followed sufficiently carefully today that we can count on much from just a 200 in any case, regardless of what the TAG may have hoped.

So, this proposal assigns to today's common practice no more and no less than the semantics given by RFC 2616 (which I won't attempt to restate.) What it does do is to provide a new header, and to associate much more rigorous semantics with the use of the new header.

As mentioned above, either HTTP Link headers (if the expired Internet Draft goes forward to RFC) or Resource-description headers, as proposed in recent email to the TAG, may provide a better means than status code 302 of suggesting where metadata about a resource can be found. I believe that these proposals are complementary to the proposal for Representation-source headers. The former allow you to find information about a resource; the latter allows one to determine whether the Representation fully captures the state of a resource (for all time), and if not, ensures that a URI is available for both the state retrieved as well as for the resource as a whole. The following table explains what a client can discover from various responses.

Results for HTTP GET of Resource with URI==R
Status code Representation-source Resource-description or HTTP link header Implication
200 R (I.e. same as the Request-URI) The resource R is immutable. The provided representation carries the full state of the resource, and is the same as what will be returned for subsequent references.
200 R2 (different from the Request-URI) The resource R may serve different representations on subsequent accesses. R2 is the URI for a resource that will always serve a representation with the same content as that just retrieved.
200 R D The resource R is immutable. The provided representation carries the full state of the resource, and is the same as what will be returned for subsequent references. D is the URI for a resource that provides metadata about or descriptions of R.
200 R2 D The resource R may serve different representations on subsequent accesses. R2 is the URI for a resource that will always serve a representation with the same content as that just retrieved. D is the URI for a resource that provides metadata about or descriptions of R.
303 See RFC 2616. The URI provided with the 303 could be for metadata about the resource, or could be for some other resource of interest.

A use case: Namespace documents

One way to test an approach like this is to see how it applies to use cases that are tricky. I've never been happy with our story on XML Namespaces in particular. Everyone on the TAG seems happy to have the resource that is an XML namespace, I.e. the one that has the namespace name as its URI, respond with a 200. With our traditional approach, this always seemed a stretch to me. One rationale is that the URI doesn't actually identify the namespace, but rather some sort of document that's descriptive of the namespace. I believe I've heard Tim Berners-Lee argue for this interpretation. I've never quite liked it, because it seems to me that an XML Namespace is not a document. I tend to think of it as a set of names, or of you prefer, the infinite set of all possible expanded names that share a namespace name, with some given distinguished meanings. It seems to me that there are lots of documents that could be equally good descriptions of the same namespace.

With the approach advocated here, some of the conflict goes away. We can say that the namespace is just the collection of names (either all of them or just the ones assigned specific uses, as you prefer.) We no longer have to debate whether this set is an information resource, because we can return a 200 either way. Returned with 200 will indeed be a representation in the form of some particular document, perhaps as RDDL, perhaps as RDF. Either way, the Representation-source header can give a URI for that particular RDDL or RDF document. Now we have what we want: we can use the namespace name URI to make statements about the namespace itself, and the Representation source URI if we want to make statements about that particular namespace description document.