W3C

How to refer to something using a URI

W3C Editor's Draft 13 March 2011

This version:
http://www.w3.org/2001/tag/awwsw/issue57/20110313/
Latest version:
http://www.w3.org/2001/tag/awwsw/issue57/latest/
Editor:
Jonathan A. Rees <jar@creativecommons.org >

This document is also available in these non-normative formats: XML.


Abstract

Semantic Web and Linked Data applications require URIs that refer to arbitrary entities. Deployment and performance difficulties have led to a search for new mechanisms that address problems that are being experienced. This report brings together in one place information on current and proposed practices, with analysis of benefits and shortcomings of each.

The purpose of this report is not to make recommendations but to initiate discussion that might lead to some.

Status of this Document

This document has been developed by the AWWSW Task Force of the W3C Technical Architecture Group in order to provide background material for further discussion among those affected by this architectural question, and to help drive TAG issue 57 [issue-57] to a conclusion.

This version is an editor's draft with no standing. It has not received review within either the task force or the TAG.

Publication of this draft finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.

Please send comments on this document to the publicly archived TAG mailing list www-tag@w3.org ( archive ).

Table of Contents

1 Introduction
    1.1 Information resources
    1.2 Glossary
2 Use case scenarios
    2.1 Preparing and consuming metadata for a Web-accessible information resource
    2.2 Choosing a phrase, providing an account of the phrase, using the phrase
    2.3 Referring to the primary topic of a document
3 Conventions in current use
    3.1 URI scheme and URN namespace registrations
    3.2 The LSID URN namespace
    3.3 A dereferenceable URI refers to the information resource at that URI
    3.4 Non-URI phrase
    3.5 Cite your sources
    3.6 'Hash URI'
    3.7 'Slash URI' with HTTP 303 See Other redirect
4 Critique of the current solution suite
    4.1 Registration, citing your sources, and non-URI phrases are too hard
    4.2 Fragment identifiers get lost
    4.3 The common fragment identifier pattern fails with large namespaces
    4.4 Fragment identifiers aren't seen by servers
    4.5 303 is difficult, sometimes impossible, to deploy
    4.6 303 leads to too many round trips
    4.7 303 makes the URI difficult to bookmark
5 Possible new conventions
    5.1 Syntactic sugar
    5.2 'Hash URI' with fixed suffix
    5.3 'Slash URI' with chimera entity
    5.4 'Slash URI' with site-specific discovery rules
    5.5 'Slash URI' with new HTTP request or response
    5.6 Refer to information resources in some other way
    5.7 Overload dereference, and use response properties to distinguish the two cases
6 Summary
7 References


1 Introduction

The emergence of languages such as OWL and RDF that pervasively use URI-based vocabularies brings to prominence the problem of referring, in those languages, to things one has to refer to, in such a way that the reference will be understood by those encountering the reference. These references either are URIs or are built on URIs, so the problem of referring reduces to that of either knowing, or influencing, the way that readers will interpret URIs.

[Advice welcome on what needs to go in an intro]

1.1 Information resources

"Information resources" figure prominently in this narrative both as providers of information and as subjects of metadata. The following explains the particular theory of "information resources" assumed in this document.

Each information resource has one or more associated versions each having fixed content (octet sequence) and interpretation directives (media type, language). An information resource having more than one version is said to be generic.

No particular meaning is implied by the word "version;" the word is chosen as suggestive of its most common use.

One can attribute metadata properties such as author, title, and topic to versions in the obvious way. These properties extend to generic information resource in a systematic way: if a property is shared by all of an information resource's versions, then we attribute that property to the information resource, and vice versa.

Operationally, this means that based on knowledge of its versions one can write metadata using an information resource as subject, and someone reading this metadata can then apply that metadata to whatever version they access.

Information resources need not be accessible at a URI; they might exist only inside a local database, or they may be ephemeral.

1.2 Glossary

[All terminology choices are provisional; for most of them I am testing the waters to see how well the word works, but I'm prepared to change.]

accessible via
When a URI is dereferenceable, "the information resource accessible via a URI" (abbreviated IR(that URI), see below) is the information resource whose specializations are obtained by dereferencing that URI. If there is only one such specialization, it is that specialization; otherwise the information resource is "generic".
account
A document or document part that provides information about the meaning of a URI or other phrase.
dereferenceable
This term is defined as per RFC 3986 [rfc-3986]. URIs possessing fragment identifiers are never considered dereferenceable. Non-fragid http: URIs are dereferenceable if a GET method (or equivalent) yields a success (2xx) response. Some URIs belonging to some other schemes are also dereferenceable.
information resource
An "information resource" is either specific, i.e. a document or other replicable entity such as an image or sound recordings, or generic, with specializations (specific information resources) that have something in common. Metadata that is true of every specialization (e.g. that a specialization has a certain author) of a generic information resource is considered to be true of the generic information resource itself.
IR(u)
IR(u) is shorthand for the information resource accessible via URI u. For example, IR('http://example/image23') is the information resource accessible via the URI 'http://example/image23'.
metadata
Information about an information resource. In RDF, metadata might be expressed using vocabularies such as Dublin Core, FOAF, or CC REL.
phrase
a URI or other symbol or symbol sequence that can serve as subject or object in a statement. In an RDF serialization, for example, a phrase might be a qname, URI, or blank node label. In Turtle, a phrase might be any Turtle term, including one written using blank node [...] notation. In natural language, it might be a URI or a noun phrase.
version
An information resource with fixed content (octet sequence) together with directives, such as media type and language, intended to guide the interpretation of the content. [Cf. TimBL 'fixed resource.']
WS(u)
WS(u) is shorthand for the meaning of a URI u as accounted for in the information resource IR(u). For example, if IR('http://example/fred') says that 'http://example/image23' refers to Fred the mynah, then WS('http://example/fred') is Fred the mynah. ('WS' = 'whatever it says'.)

2 Use case scenarios

2.1 Preparing and consuming metadata for a Web-accessible information resource

Bob is preparing a bibliography. He finds a report on spoonwings provided by Alice at the URI http://example/spoonwing and wishes to refer to the report for the purpose of composing metadata such as its title, author, and publication date. He selects a phrase to use to refer to the report, then composes the metadata, using the chosen phrase as the subject of each statement.

Subsequently Carol encounters an entry from Bob's bibliography. Wanting to know what the subject phrase refers to, she is led somehow to dereference http://example/spoonwing, and is led to understand that IR('http://example/spoonwing') is the document that Bob is talking about.

Variant: Bob's bibliography includes a number of RDF documents, and his metadata includes information relevant for making use of those RDF documents.

Variant: Instead of being a person, Bob is a tool that is charged with updating all the documents on a Web site with license metadata.

2.2 Choosing a phrase, providing an account of the phrase, using the phrase

Alice wants to refer to Fred, a mynah living at a local zoo. Alice "mints" a new phrase (one that is not yet in use; either a new URI or a phrase built on one) with the purpose of using that phrase to refer to Fred. Alice publishes a document that would lead a reader to realize that the phrase refers to Fred.

Bob then learns of Alice's phrase and uses it in a document of his own.

Subsequently Carol encounters Bob's document. Wanting to know what the phrase means, she is led to Alice's published account, which she reads. She is enlightened.

Variant: instead of Fred, the referent of the phrase is to be an information resource that is not accessible on the Web, or at least not at any URI known to Alice.

Variant: instead of Fred, the referent is to be an information resource that is accessible, via a URI known to Alice. The referent is not the account that Alice publishes, it is the document that Alice's account describes. (In this situation, which is common in the publishing industry and digital archives, Alice's account is often called a "landing page".)

2.3 Referring to the primary topic of a document

Bob desires to refer to Chicago. He finds a Web page on the Web at http://example/chicago (provided by, say, Alice) that consists of a description of Chicago. Somehow he comes up with a phrase that will be understood as referring to the primary topic of Alice's Web page.

Carol encounters the phrase Bob used, is led to Alice's description of Chicago, and then somehow discovers that the phrase is meant to refer to Chicago.

[This use case keeps coming up (e.g. tdb:) but I don't think anyone is seriously interested in it. TBD: Explain how it differs from the previous one.]

3 Conventions in current use

This section describes how people currently implement the "somehows" in the use cases.

3.1 URI scheme and URN namespace registrations

A URI scheme registration helps to account for the meaning of URIs using that scheme For example, the registration for the data: URI scheme fully explains the meaning of URIs that use that scheme.

Most URI scheme registrations, such as that for http:, only provide a partial ('schematic' you might say) account, and other sources of information must be consulted in order to understand a URI using that scheme.

Registering a new URI scheme requires community and IETF Expert Review; see RFC 4395.

3.2 The LSID URN namespace

[Not exactly common - is this worthy of mention? But it is used. Maybe rule out all non-linked-data solutions up front?]

urn:lsid: has an associated protocol that has separate methods for dereference and discovery.

3.3 A dereferenceable URI refers to the information resource at that URI

To refer to the information resource accessible via a given URI, use the URI: http://example/ir refers to IR('http://example/ir'). Those who encounter the reference can dereference the URI, and on seeing that the dereference is successful, will take the URI to be a reference to the information resource accessible via that URI.

3.4 Non-URI phrase

URIs are just one kind of phrase that might be used to refer to something. In RDF serializations, for example, we have blank node notation:

[ foaf:isPrimaryTopicOf <http://example/about-fred> ]

The problem of figuring out (or documenting) the meaning of the overall phrase reduces to that of figuring out (or documenting) the meaning of the URIs that occur in it.

3.5 Cite your sources

Whenever using a URI to refer to something, provide a link to the document that carries an account of the URI's meaning. This is the approach taken by OWL (owl:imports). The rdfs:definedBy property could also be used for this purpose.

Both of these properties beg the question in that they do not say how to figure out what the target URI refers to.

3.6 'Hash URI'

To refer to something, mint a URI with a fragment identifier, and provide an account of the intended meaning at the pre-fragment stem of the URI. That is, if the URI is http://example/vocabulary#term, then put an account of that URI in the document at http://example/vocabulary . [mention 3986 and AWWW?]

Those encountering http://example/vocabulary#term will access http://example/vocabulary and read the account.

This approach and the following one are completely generic mechanisms and may be used in situations where a dereferenceable URI would also be correct. The choice would be based on the weighing the importance of dereference against the importance of a more explicit account (usually involving metadata).

3.7 'Slash URI' with HTTP 303 See Other redirect

To refer to something, mint an http: or https: URI without a fragment identifier (say http://example/fred), make an account of it accessible via a second URI (say http://example/fred.account) , and arrange for a GET of http://example/fred to yield a 303 response carrying a Location: header with http://example/fred.account as its target.

Those encountering http://example/fred will dereference, but this will fail with a 303 redirect, indicating that http://example/fred does not refer to an information resource at http://example/fred, but rather that the document at http://example/fred.account accounts for the URI's meaning. [see HTTPbis]

[Is anyone, in practice, deploying 303 redirects to a "primary topic" page not mentioning the URI to be accounted for, rather than to be a document that explicitly mentions the URI?]

With any of these conventions other than dereferenceable URIs, the URI may refer to anything at all, including an information resource. [COMMON MISUNDERSTANDING, not sure where this goes in the document. This email gets it totally wrong, it's not about IR vs. NIR, it's about which thing the URI is to refer to, the one generalizing what you get, or the one accounted for by what you get.]

4 Critique of the current solution suite

4.1 Registration, citing your sources, and non-URI phrases are too hard

Enough said.

4.2 Fragment identifiers get lost

"People forget to put it there when writing and cut and pasting URIs." (Harry) [More information needed.]

4.3 The common fragment identifier pattern fails with large namespaces

When a large number of URIs are formed by combining a fixed "namespace" prefix with a single suffix using hash as a connector, there will be a single underlying document that must provide accounts of all of the large number of URIs. This is an unacceptable performance hit for the server, the network, and the client. "Slash" URIs don't have this problem as the response can be specific to each URI.

4.4 Fragment identifiers aren't seen by servers

(1) The document provided by the server must account for all hash URIs based on the document's URI. This could be a large number. (2) Hash URIs don't work with HTTP PUT, POST, or DELETE methods. (Manu)

4.5 303 is difficult, sometimes impossible, to deploy

Many hosting solutions do not support Apache .htaccess or any equivalent.

The Chicago use case is an extreme version of this - the entity providing access to the Chicago document (Alice) does not even care about providing URIs that refer to Chicago; it is someone having no control over how the URI dereferences (Bob) who needs a reference to Chicago.

4.6 303 leads to too many round trips

To get accounts of N URIs provided by redirecting through 303 responses, you need to do 2N HTTP requests.

4.7 303 makes the URI difficult to bookmark

[See JAR's "tempolink" blog post]

5 Possible new conventions

With fragment identifiers and the 303 redirect identified as the sources of current difficulties, a number of alternative mechanisms have been suggested to get around these problems.

5.1 Syntactic sugar

Use a new kind of non-URI phrase, for example

*<http://example/about-fred>

the asterisk being suggestive of indirection in languages derived from C.

[This idea derives from JAR's TAG slides. This is mainly to get people thinking: the problem is notational engineering, not philosophy.]

5.2 'Hash URI' with fixed suffix

This idea attempts to address one reason for using "slash URIs" instead of fragment identifiers. Suppose you want to combine a large number of local name a, b, c, ... into a namespace. The usual solutions would be to write http://example/namespace#a (a "hash namespace") or http://example/namespace/a (a "slash namespace").

In the "singleton fragid" approach one would write http://example/namespace/a# (a null fragment identifier) or http://example/namespace/a#_, using a fixed suffix for every URI and varying the part between the namespace prefix and the suffix.

As in the 303 approach, each URI in the namespace would (or could) have its own document, providing an account for that single URI rather than every URI in the namespace.

The choice of fixed fragment identifier (null, "_", or something else) is largely a matter of taste.

A null fragid precludes the use of qnames to abbreviate such URIs. (In particular it would not be possible to use them as predicate names in RDF/XML.) However, SPARQL, Turtle, and RDFa are being extended to admit CURIEs that include #, making this a newly attractive option.

To address the "hash gets lost" problem we could explore heuristics to automatically replace http://example/fred with http://example/fred# (or http://example/fred#_) when needed.

5.3 'Slash URI' with chimera entity

[Ed Summers's favorite]

In this approach we use IR('http://example/fred'), which seems to say that http://example/fred refers to Fred, as a proxy for Fred. We attribute to IR('http://example/fred') information that seems to be about Fred, and then interpret it to be either Fred or itself as the need arises. We call this the "chimera" approach because we have a single entity that has two different personalities. In effect:

IR('http://example/fred') = WS('http://example/fred')

Ways that this can fail:

  1. "Alice likes http://example/fred" and "http://example/fred says to turn on the radio" are ambiguous
  2. If WS('http://example/fred') is an information resource other than IR('http://example/fred') we will end up with nonsense, i.e. inconsistent metadata attributed to a single entity [this is the CC REL lossage case]
  3. If we infer, e.g. due to functional properties, that WS('http://example/fred') = WS('http://example/george'), this will induce IR('http://example/fred') = IR('http://example/george') and, again, inconsistent metadata

To make the "chimera" approach work, strategies are needed for avoiding each of these pitfalls. E.g. (1) could be addressed by a prohibition on the use of predicates that might apply to either IRs or non-IRs, or by a priority system explaining which subject is meant, (2) by saying that the account of the URI must not lead to the URI as being understood to refer to any IR other than IR('http://example/fred'), (3) by having the community agree that axioms that enable equational inferences shouldn't be written for these entities.

5.4 'Slash URI' with site-specific discovery rules

For http://example/fred, obtain the host-meta file for its host via http://example/.well-known/host-meta. (See [hostmeta] and [rfc-5988].) Then look in the host-meta file for a link-template rule that maps http://example/fred to another URI, say http://example/fred.about, and then look for an account by dereferencing http://example/fred.about.

When the host-meta file is cached, and many accounts are sought from the same host, this reduces the number of round trips from two (in the 303 case) to one.

Such rules could augment or replace the use of 303 (or even 404) responses in order to reduce the number of round trips required to obtain accounts of URIs.

Looking for a host-meta file for every host that has URIs for which accounts need to be discovered would be expensive if only a few of them have such files, so some cleverness would be required to reduce the expected number of round trips. The details would have to be worked out, but this could be a boon to bulk consumers of "slash" URIs.

5.5 'Slash URI' with new HTTP request or response

To reduce the number of round trips, we might use a new HTTP method to request an account of a URI's meaning, or the server could use a new status code to indicate that what it is returning is an account of a URI's meaning.

The URIQA specification proposes such an HTTP request method. Unfortunately URIQA sacrifices the works-in-browser property enjoyed by 303.

Possibilities for HTTP response status codes: 203, new 2xx (e.g. 209), new 3xx (e.g. 308), 404. 301, 302, 303, and 307 redirects are problematic as the entity in the response is not displayed in a browser.

Any of these options would mean fewer round trips than a 303 redirect. Unfortunately they are generally as difficult, or more difficult, than 303 redirects to deploy.

5.6 Refer to information resources in some other way

[I've been calling this one "just be clear"; it was suggested first by Harry, then echoed by others.]

Currently we use a dereferenceable URI http://example/fred to refer to the information resource at that URI, IR('http://example/fred'). But we could free it up to refer as accounted for by that information resource - that is, WS('http://example/fred') - by switching to a different notation for referring to IR('http://example/fred'). That is, if IR('http://example/fred') said that 'http://example/fred' referred to Fred the mynah, then it would. This would permit unrestricted use of "slash URIs" without the use of 303 - every dereferenceable URI would refer to WS(that URI).

To make this work all that's needed is a standard way to write IR('http://example/fred') in each affected language. For example, the Turtle phrase

[ :accessibleVia "http://example/fred"^^xsd:anyURI ]

could be ne way to refer to IR('http://example/fred'). A local name could be defined to the same effect:

:fred-doc :accessibleVia "http://example/fred"^^xsd:anyURI .

Or the referring document could just assert that it's using the URI to refer to the IR in question:

<http://example/fred> :accessibleVia "http://example/fred"^^xsd:anyURI .

which would constitute an explicit opt-in to the httpRange-14 rule.

(I'll refer to any of these three as "clumsy notation" below.)

5.7 Overload dereference, and use response properties to distinguish the two cases

Under this approach, some HTTP responses are 'marked' in a recognizable way that declares that the request URI (say http://example/fred) refers to WS('http://example/fred'), as opposed to the usual IR('http://example/fred').

To refer to IR('http://example/fred'), an agent would either uniformly use clumsy notation (above), or it would test, as an optimization, for the presence of the marker in the HTTP response. With no marker, the agent would use the URI to refer to the IR; when the marker is present it would revert to clumsy notation.

One candidate for such a marker would be the presence of a Link: header with some particular link relation. But other headers or even the content might serve.

As a further refinement, instead of clumsy notation to refer to IR('http://example/fred') in the presence of the marker, an agent could look for a second kind of HTTP header, and that would provide a second URI that refers to IR('http://example/fred'). The second URI might also be discovered in other ways.

[need to explain the Content-type: idea, which I don't understand - exchange on public-lod]

6 Summary

[Here's a similar analysis - not the same problem, but a related one - with a matrix.]

Table summarizing the options. Could be as many as 14 rows (one for each current approach + one for each suggested approach) and as many as seven columns (one for each critique).

7 References

issue-14-resolved
[httpRange-14] Resolved. Email to www-tag list, 2005. (See http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html.)
issue-57
Issue 57. W3C Technical Architecture Group, 2007-2011. (See http://www.w3.org/2001/tag/group/track/issues/57.)
rfc-3986
Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, IETF, 2005. (See http://www.ietf.org/rfc/rfc3986.txt.)
rfc-5988
Web linking. RFC 5988, IETF, 2010. (See http://www.ietf.org/rfc/rfc5988.txt.)
hostmeta
Web Host Metadata. Internet-draft, IETF, 2010. (See http://tools.ietf.org/html/draft-hammer-hostmeta-13.)