AWWSW task group's final report to the TAG (draft)

A draft report prepared by Jonathan Rees, for consideration by the AWWSW task group for presentation to the TAG, for the TAG's consideration.

What the TAG asked us to do

The group's charter with respect to the TAG was always unclear, as can be seen by reviewing the minutes that led to the group's creation. The effort was never the TAG's initiative, but rather sprang from concerns brought to the TAG by SWHCLSIG (the Semantic Web Health Care and Life Sciences Interest Group). SWHCLSIG was having difficulty understanding how to follow TAG advice in its Semantic Web efforts, specifically:

  1. using http: URIs (as opposed to, say, LSIDs) for making making clear and persistent references in RDF
  2. correctly using http: URIs for which there are GET/200 exchanges (as opposed to, say, 303 responses, or hash URIs).

The context, in part, was that LSIDs set a certain bar for persistence, clarity of reference, and discoverability, and SWHCLSIG didn't understand how to reach this bar while following the advice (attributed to the TAG and attractive in many ways) to use http: URIs instead of LSIDs.

SWHCLSIG considered these issues to be the TAG's responsibility. TAG members did not disagree, but in effect said it did not have the answers and pushed the problem back to SWHCLSIG, while also creating the task group. Although the task group's charter was never clearly articulated anywhere, it is probably safe to say that TAG members felt that these concerns were too difficult and/or distracting for treatment by the TAG as a whole, and it was therefore asking the group (which ought to be motivated by having a stake in the matter) to suggest its own answers, that the TAG might then consider for endorsement.

The group convened in November 2007, and did not reach agreement on any significant matter related to this goal by the time it suspended discussion in February 2012.

The group's teleconferences and mailing list were used informally as a sounding board for many neighboring issues, such as the appropriate interpretation of HTTP redirects and other status codes. See the group's home page for links to list archives, wiki, etc.

Despite the lack of overall consensus on candidate resolutions, the group's work has produced a number of useful outputs, which are summarized below.

Coordinating meaning

According to both AWWW and RDF, a URI "identifies" a "resource", but when used within RDF, it seems important for communicating parties to agree to some extent on which resource is identified. Just as the meaning of, say, an HTML element name in HTML, a media type name in HTTP, or a fragment identifier within a document might involve a specification or registry to ensure interoperability, so the meaning of a URI in RDF has to involve some kind of publicly mediated coordination.

The members of the group came into the project with a shared understanding of how meaning is in general coordinated, which might be summarized as follows:

Providing an exposition, or more technically a set of axioms (sentences, statements), in which a token (media type name, URI, etc.) occurs, is the best you can ever do to explain what the token "identifies" or means (or is supposed to mean, etc.). The designated axiom set canmay be formal or informal, and may include as much or as little information and context as deemed necessary for successful use. For example, the axioms for the 'p' tag in HTML do not explain what is meant by a "paragraph".

Such an axiom set is sometimes called a "description" because, in the usual case, interpreting the axioms as true will imply that you've interpreted the URI (which occurs in them), i.e. "identified" what it refers to, and once you've done this the axioms will read like a description of that thing. Equivalently, if you take the URI to be an unknown, the axioms will read like a description of something, and you can take the URI as identifying that thing, in other contexts. HST doesn't understand what is meant by the last sentence here. . .

Because of rdfs:label and rdfs:comment values, as well as meaning-laden URI choices often used in RDF, some axioms expressed in RDF will be taken to have consequences expressible in natural language, and that fact will help coordinate the interpretation of URIs (or the statements in which they occur) beyond what one might understand from formal RDF or OWL semantics. The importance and place of natural language description and informal interpretation in the architecture was a topic of debate.

The question is where such axiomsdescriptions might be discovered (the URI equivalent of the HTML specification or media type registry). Unfortunately the specifications governing the http: scheme only answer this question in the case of URIs with fragment identifiers (and not very satisfactorily there). Users of RDF could restrict their use of URIs to hash URIs, but this seems like a lost opportunity, if such a restriction can be avoided. For example, one might, in some circumstances at least, be able to use a hashless URI to refer to the content retrieved using that URI, for the purpose of assigning Dublin Core metadata, providingassertingChange motivated by need to avoid a garden path document type or schema information, and so on in RDF. The TAG's httpRange-14 advice gives hope that perhaps the community might somehow make use of hashless URIs, either those for Web documents or new ones created for the purpose.

Several suggestionsfor mechanisms to support definition discovery were offered by group members. There was no consensus on any of them, and in fact no agreement on evaluation criteria with which to rank them.

The suggestions fell into three broad categories:

1. The httpRange-14 advice is an adequate answer, if we replace "information resource" with "generic resource"

I.e. parts of the HTTP protocol itself can be interpreted as expressing axioms involving the URI. A 200 response to a GET signals to communicating parties that they can take the URI to "identify" a generic resource whose representations are the ones retrieved using the URI. That is, GET/200 is (by proposed consensus) equivalent to the presence of an axiom saying that what's identified is such a generic resource.

It was agreed that a 303 response never by itself implies that what's identified is not an information (or generic) resource.

Disagreement centered on whether the definition of "generic resource" from the design note was adequate, what "representation" means in that context, etc.

2. The httpRange-14 advice is an adequate answer assuming "information resource" is only an arbitrary label

Same as the above but any connection between "information resource" in the httpRange-14 advice, as reiterated, and its AWWW definition would be severed. The position was that the interpretation of "information resource" (i.e. whether any particular thing is or is not one) must be left up to "the application layer" and obtaining consensus would be harmful.

See Draft N3 rules for HTTP inferences (a product of a group member, not of the group).

Disagreement centered on the extent to which this simply moved the problem and was therefore as vulnerable to competing interpretations (now at the application layer) as is the status quo.Is this close enough to true?

3. Representation invariants

One member created a theory of representation invariants, which would give a rule for agreeing on useful properties (such as author or title) of the identified resource based on the totality of possible HTTP responses for GET of the URI. See Generic resources and Web metadata (a product of a group member, not of the group). The theory was received with skepticism by nearly everyone, including its author.

Persistence

While it was agreed that persistence of content and its discoverability (i.e. convincing someone to hang on to the content and make it available through well-known channels) is a social problem that LSIDs and http: URIs do nothing to address, some limited kind of persistence could be addressible through a standards process, since by design LSIDs meet a certain persistence bar that http: URIs don't. The difference is that while it is expected that http: URIs are in general might be reusedavailable for reuse for various purposes as domain name ownership changes hands and so on, all users of LSIDs agree that provision of different content for an LSID at different times is simply not OK. The specification doesn't tell you who is to blame if there is a conflict, but this is a purely theoretical problem at present. The social contract is clear, even if nobody knows how to implement it rigorously, and this satisfies current users of LSIDs.

Not sure what you intended here. . .Because of the approach adopted by IANA (as authorized by IETF) that the domain name registry is to be treated propertyas mutable, rather than persistently as the other IANA registries are, this difficulty is inherent in any URI that uses domain names (other than those under .arpa) authoritatively.

Although the persistence issue was one of the motivators for creating the group, it was not discussed at any length.

Relation of retrieved representations to meaning, for hashless URIs

If an LSID has associated retrievable content, then it refers to that content. Otherwise, it is like a hash URI: one obtains other content (called "metadata" for an LSID) through an alternative channel, and taketakes that content to be an axiom set for the URI. (This is not specified, but is close to what people do.)

From the TAG's httpRange-14 advice, reading between the lines, it seems that if one obtains a 303 response to a GET of a hashless URIsURI, a document at the redirect target can be taken as an axiom set for the hashless URI. This reduces the hashless http: URI to the hash URI or LSID case. This practice was already in widespread use in 2007 and the task group did not question it. Rather the question was whether there were any semantically clear uses of non-303 URIs where "what the URI identifies" matters in communication.

For non-303 responses (specifically 200, but more generally any kind of retrieval response using any protocol), there was general agreement that the specification (RFC 3986) says that a representation retrieved using a hashless URI is a representation of the identified resource, and that it's usually a usually good idea to agree with the governing specification. But whether such agreement helps to guide what applications do in any way, or helps you tell "what a URI identifies" beyond it being some resource that has the retrieved representation as one of its representations (whatever that means), could never be discerned by the group.

One use case for SWHCLSIG was whether a URI that, upon retrieval, yields only the landing page (a description) for a journal article, was appropriate for use in "identifying" (naming) the journal article. For example, was it good practice to use a digital object identifier URI (http://dx.doi.org/10....) yielding only a landing page on retrieval, to identify its corresponding digital object? (In 2007 DOIs used 302 responses to redirect to landing pages; since then, the DOI foundation has been transitioning from 302 redirects to 303 redirects, thus sidestepping this question. But the question remains in other contexts.) There is no written specification that is clear on this point.

Even the apparently simple case where retrieval yields the document directly and there is no content negotiation or change over time is fraught with difficulty. If a hashless URI were used with, say, Dublin Core, in this way, how could a text mining application know that it was the retrieved document, and not some other one, that was meant? The specifications simply don't say.

There was some disagreement over whether any community (RDF or broader) could be convinced to use any particular notion of "representation", or even whether any kind of convincing should even be attempted.

Beyond this only one answer was offered, the representation invariant idea given above, and it did not gain support within the group.

3. Link: header and .well-known

These discovery options came up at various times but were not discussed in depth.

Relationship to TAG issues

The AWWSW's work was made necessary by the difficulty of interpreting the TAG's httpRange-14 advice, and so in a sense its creation reopened TAG ISSUE-14 (later tracked under ISSUE-57). It is important to note that SWHCLSIG's purpose was not to challenge the advice, but to clarifyseek clarification of it.

The question of whether to use http: URIs or some other kind of token such as LSIDs falls under TAG ISSUE-50, which was raised in 2005 and remains open.

Next steps

The group, with its current unspecific charter and with its current membership, has gone as far as it would like to go. Further progress, if any, will have to come from a different source.

The TAG has taken up for itself most of the questions originally delegated to this group.