Interoperability of referential uses of hashless URIs

Jonathan A. Rees, 22 Aug 2011 (revised 29 Aug 2011 and 15 Feb 2012)

This note is about the continuing lack of consensus over the referential use of 'hashless' URIs (i.e. without #) for which retrieval behavior is defined. Two conventions for using such URIs referentially are being put forward, each as the preferred way to meet a notational need. The two conventions are incompatible, creating an interoperability risk.

Here are some hashless URIs for which retrieval is defined:

   http://www.w3.org/
   urn:ietf:rfc:2648
   data:,Moby%20Dick

(The documentation for the data: scheme is not explicit about retrieval, but based on what RFC 3986 says about retrieval, combined with what Web browsers actually do with data: URIs, it seems reasonable to say that obtaining the data from a data: URI constitutes retrieval.)

"Referential use" is where you use a URI in something resembling a sentence, such that it seems to refer to something, similarly to how the string (word?) "Paris" often refers to the capital of France in English sentences. For example (the notation used here is Turtle):

    <urn:ietf:rfc:2648> dc:creator "R. Moats".

Here the URI "urn:ietf:rfc:2648" is used referentially. It refers to RFC 2648, in a sentence that says that someone named "R. Moats" is a "creator" of the RFC (the precise sense of "creator" being given by the Dublin Core vocabulary).

In principle one could use a URI referentially in any language or notation, but for the purposes of this discussion, the only languages that currently matter in practice are RDF and its derivatives.

There is a natural inclination to align "identification" in the sense used in RFC 3986 and so on with "reference" in the sense used above. It is not clear that this idea commands consensus, or is forced by recognized specifications. Because the nature of "identification" in the RFCs is unclear I'll stay away from it.

Needs

Notational need D1: Referring to documents on the Web

There is a document on the web, and we want to refer to it (maybe in an RDF statement). E.g. we want to know what to write for ___ in

      ___ dc:title "HTML 4.01 Specification".

when we want ___ to refer to the document that's retrieved using the URI "http://www.w3.org/TR/html401/".

Applications: provenance, metadata, FOAF, annotation, content ratings (e.g. POWDER), etc. - any time you want to say something about a document.

In general, for any hashless URI permitting retrieval, we'd like a way to refer to the document "at" that URI. Ideally there is a deterministic procedure that goes from such a URI to the way-of-referring.

This is not a precise problem statement since "the document that's retrieved using" is vague. But that is something a concrete proposal would tighten up.

Notational need S1: Creating a way to refer to an arbitrary thing

There is something we want to refer to, but we need to refer to it somehow; that is, we need to decide what to call it. At the same time we need to put an explanation that we're using this manner of referring (i.e. what we called it) somewhere, so that others will know what we're referring to when they see the reference.

That's pretty obtuse, but the idea is simple. Here's an example. We want to know what to put for ___ in

       ___ eq:magnitude 6.9 .

when we want some ___ to refer to the Loma Prieta earthquake; and we need to know where to put the explanation that ___ is meant to so refer, so that others can find it.

Applications: linked data / semantic web / RDF generally.

Answers

Now each of these needs can be met in various ways, and when methods are chosen that don't conflict, there is no issue. Following are the particular methods that, considered together, create the difficulty.

Notational answer D2: A hashless URI refers to the document at that URI, when there is one

For any hashless URI permitting retrieval, use that URI to refer to the document found there. In the example, we'd say

      <http://www.w3.org/TR/html401/> dc:title "HTML 4.01 Specification".

That is, we transform a URI into the way of referring to its document by putting pointy brackets around it (in Turtle; or the corresponding notation in other notations). The document it refers to is the one for which retrieved representations are specific versions (instances, specializations).

Tim Berners-Lee has described the D2 approach here (2002 email), here (2004 email), [+8/28] here (2007 email), here (design note), and in several other places. My memo Information resources and Web metadata gives a rigorous treatment of "the document found there" based on what I think Tim must be trying to say.

Notational answer S2: A hashless URI permitting retrieval refers to something described by what's retrieved

Choose any hashless URI, say "http://example/eq108", arrange for retrievals using that URI to yield an explanation of what that URI is supposed to refer to, and then use the resulting URI to refer to it.

For example, one would arrange for retrievals using "http://example/eq108" to yield a description similar to

     The URI "http://example/eq108" refers to the earthquake of 17 October
     1989 with epicenter near Loma Prieta, California.

or, expressed in RDF,

     <http://example/eq108> eq:epicenter <geo:37.040,-121.877> .
     <http://example/eq108> eq:date "1989-10-17" .

either of which would suffice.

Then to talk about the Loma Prieta earthquake, write

     <http://example/eq108> eq:magnitude 6.9 .

For one formulation and defense of this idea see Ian Davis, Is 303 Really Necessary?

There are numerous details to be filled in here: are there any non-explanation retrievals; if so how are they distinguished from explanations; what languages are permitted; and so on.

It is possible to assume notation S2 and then employ notation D2 as an opt-in, by providing an explanation that says (or implies) that the URI refers to the document found at the URI. But this is not adequate to satisfy need D1, which requires a way to refer, given an arbitrary hashless URI, to the document found at that URI.

The Conflict

It should be clear that these answers are mutually exclusive. If for some URI you think, based on D2, that its referent has certain properties, then you will be confused or make mistakes when you talk to someone who, based on S2, thinks its referent (something different) has different properties.

What would an actual skirmish look like? To construct a scenario in which the result is not just confusion (a type error) but an incorrect answer, we need a URI that is interpretable according to both D2 and S2, and a statement expressed using the URI that has consequentially different meanings depending on whether D2 or S2 is assumed. The setup is as follows:

A document, call it X, is found at "http://example/x". X is a description of a different document, Y. Suppose there is RDF in X that uses the URI "http://example/x" to refer to document Y.

(This is a very frequent occurrence, for example for journals, digital repositories, media sites such as Flickr, catalogs such as Amazon, and so on; the description page is usually called a "landing page" and it normally gives metadata and perhaps an offer to sell a document, and a link to a PDF or some other way to obtain the intended document.)

Now consider the statement

   <http://example/x> xhv:license 
       <http://creativecommons.org/licenses/by/3.0/>.

According to D2, this says that document X is licensed. According to S2, this says that document Y is licensed. Which of these is the case matters a lot, since if you get it wrong you could be misled into making an unlicensed use of either X or Y.

It doesn't matter where the statement is placed; it could be in X (perhaps placed there by an D2-assuming tool unaware that X's author assumed S2), in Y, or in some other document (such as a manifest).

There are many examples of this particular conflict in the wild, e.g. see any Flickr photo page or any music page at Jamendo. Retrievals using one of these URIs contain RDF that uses the URI to refer to something other than the document found at the URI. A D2 client such as Tabulator will infer something other than what the author of the page meant.

[+8/28] There is some evidence that S2, rather than D2, is meant to apply in the case of some resolvable URNs as well. For example when "urn:nbn:de:0002-938" is resolved using a resolving service, the result is a landing page, not the named document. It is not completely clear that what the "resolution" service does constitutes a retrieval in the sense of RFC 3986, but the specification governing the NBN URN namespace is clear that a description (landing page) maybe substituted for content.

[+8/28] In the case of the URI "data:,Moby%20Dick", we would have a reference to something having two words according to D2, or a reference to something having over a million words according to S2.

If the conflict is resolved in favor of D2, need S1 can be met by some other method, such as fragment identifiers, a new "thing explained by" URI scheme, blank node notation, etc. (see the ISSUE-57 memo). If resolved in favor of S2, need D1 can be met using URNs, DOIs, the handle system, a new URI scheme, blank node notation, etc. But the conflict persists without resolution because the same notation is the strongly preferred answer to both needs.

Many arguments for S2 over D2 have the form of criticism of D1 or D2. [+8/28] The legitimacy of need D1 has been questioned; it is sometimes seen as marginal compared to what "ordinary" linked data practice needs, and therefore not deserving of such valuable syntactic space. Furthermore D2 is considered to be either confusing, or useless, or wrong in detail. D2 is also seen as an uphill battle. It is very easy to disregard (as evidenced by the Flickr case ), is not the outcome of a formal consensus process, and has weak support, if any, in applicable specifications. In addition it currently has no enforcement points, so its continued use has to be maintained by social, not technical, pressure.

[+8/28] S2 is often defended by citing the lack (in the applicable RFCs) of any restriction on what a URI can "identify," by pointing to the sovereignty of the "URI owner" according to AWWW, and by noting the very loose interpretation of "representation" implied by Roy Fielding's REST writings.

The above summary is of course just a tiny and shallow sampling of what has been expounded on the subject over the years.

Current situation

D2 has widespread uptake for two reasons. First, it is entirely natural in some situations, such as the HTML4 example above and for POWDER. Second, it was promoted indirectly by the TAG's httpRange-14 resolution.

(D2 and clause (a) of the httpRange-14 resolution are closely related, but not equivalent. D1/D2 as formulated here are my own belief about what the httpRange-14 resolution was intended to accomplish, and how it has been interpreted it in practice; but it does not actually accomplish what I think it was meant to.)

D2 was promoted in turn by those who were persuaded by it (e.g. Cool URIs for the Semantic Web, the pedantic-web group, Creative Commons).

D2 uptake takes the form not only of use of hashless URIs to refer to documents, but also the avoidance of hashless URIs when the implication that the referent is the document implied by D2 is to be avoided. For example, Crossref uses 303 redirects to avoid the implication that its URIs refer to what's retrieved: When there is a landing page, the URI needs to refer to the document linked or described from the landing page, not the landing page itself.

S2 likely had uptake prior to the httpRange-14 debate because it too is entirely natural. It is often assumed to be the way things should work (in RDF) by people who have never heard of the httpRange-14 idea.

The httpRange-14 resolution suggested the 303 redirect response to GET as a way out for those wanting to use hashless (fragment-free) URIs to meet need S1. The 303 status signals the absence of retrieval behavior for the URI. This method provides the desired syntax, but makes discovery more difficult than S2 because it's hard for some publishers to deploy and because of performance worries (typically two network round trips required instead of one per S2).

The main purpose of my recent ISSUE-57 report was to start working toward sweetening answer D2 for S2 advocates, so that they might be more willing to accept D2. A gap in my personal understanding of S2 is why the widely used method of fragment identifiers is rejected, but I am assured that there are important situations it is unacceptable.

Note: S1 and S2 are not mutually exclusive - in the special case of documents that contain their own metadata (and can be determined to do so), they can be realized at the same time.

What's at stake?

Harry Halpin (email), Manu Sporny (email), Ian Davis (blog post), and others point the finger at the httpRange-14 resolution as a major impediment to linked data uptake - perhaps so much so that without official sanction of S2, RDF-based linked data may lose out to some other emerging generic data notation.

D2, being in some sense the dominant position, has had few active defenders other than TimBL, although my personal guess is that defenders would materialize should a move be made to amend it. What would be lost without it: an easy and natural way to refer to Web documents. Either Web architecture would require an overhaul, or reference in RDF would have to be divorced from webarch "identification".

[+8/28] A compromise approach would consist of a way to recognize D2 URIs, a way to recognize S2 URIs, and a method to construct a reference to the document at any retrieval enabled hashless URI (including those not admitting D2). Some compromises have been put forth, but further work needs to be done before any is ready for advancement through anything like a consensus process.

It has also been suggested that neither D2 nor S2 is a good solution to the problem it is intended to address, and that a new architecture for reference is needed in both cases. If this were agreed, a revolution would be required for RDF content, which at this point would be costly and probably impractical.

The issue remains mainly academic, since for the most part the two worlds do not intersect. We don't currently refer to landing pages very much, and most documents don't look like explanations of what the URI where it's found refers to. It is possible that D2 and S2 can be used side by side by different communities for quite a while before a collision of the sort described above becomes a serious interoperability problem. On the other hand, when the conflict does happen, it will be very painful.

The way forward

I feel that in spite of the enormous volume of words spilled over this, communication between the "factions" has been poor and there has been little progress in developing mutual sympathy and understanding. More writing probably won't help, since nearly every contribution so far has focused on defending and/or attacking particular points of view. A teleconference or in-person meeting probably will be required if this thing is going to budge, because it's so difficult to develop joint solutions to complex problems in email. This is just my opinion.

[+9/3] (Since writing that I've thought of one thing that might be done, and that's to articulate requirements so that proposals can be compared. Will discuss at F2F.)

TAG members please review:

Thanks once again to Alan Ruttenberg for helpful comments on a draft, and to Martin Dürst for correcting my terminology.