This document is also available in these non-normative formats: XML.
Semantic Web and Linked Data applications require URIs that refer to arbitrary entities. Deployment and performance difficulties have led to a search for new mechanisms that address problems that are being experienced. This report brings together in one place information on current and proposed practices, with analysis of benefits and shortcomings of each.
The purpose of this report is not to make recommendations but to initiate discussion that might lead to some.
This document has been developed by the AWWSW Task Force of the W3C Technical Architecture Group in order to provide background material for further discussion among those affected by this architectural question, and to help drive TAG issue 57 [issue-57] to a conclusion.
This version is an editor's draft with no standing. It has not received review within either the task force or the TAG.
Publication of this draft finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.
1.1 Information resources
2 Use case scenarios
2.1 Preparing and consuming metadata for a Web-accessible information resource
2.2 Choosing a phrase, providing an account of the phrase, using the phrase
2.3 Referring to the primary topic of a document
3 Conventions in current use
3.1 URI scheme and URN namespace registrations
3.2 The LSID URN namespace
3.3 A dereferenceable URI refers to the information resource at that URI
3.4 Non-URI phrase
3.5 Cite your sources
3.6 'Hash URI'
3.7 'Slash URI' with HTTP 303 See Other redirect
4 Critique of the current solution suite
4.1 Registration, citing your sources, and non-URI phrases are too hard
4.2 Fragment identifiers get lost
4.3 The common fragment identifier pattern fails with large namespaces
4.4 Fragment identifiers aren't seen by servers
4.5 303 is difficult, sometimes impossible, to deploy
4.6 303 leads to too many round trips
4.7 303 makes the URI difficult to bookmark
5 Possible new conventions
5.1 Syntactic sugar
5.2 'Hash URI' with fixed suffix
5.3 'Slash URI' with chimera entity
5.4 'Slash URI' with site-specific discovery rules
5.5 'Slash URI' with new HTTP request or response
5.6 Refer to information resources in some other way
5.7 Overload dereference, and use response properties to distinguish the two cases
The emergence of languages such as OWL and RDF that pervasively use URI-based vocabularies brings to prominence the problem of referring, in those languages, to things one has to refer to, in such a way that the reference will be understood by those encountering the reference. These references either are URIs or are built on URIs, so the problem of referring reduces to that of either knowing, or influencing, the way that readers will interpret URIs.
[Advice welcome on what needs to go in an intro]
"Information resources" figure prominently in this narrative both as providers of information and as subjects of metadata. The following explains the particular theory of "information resources" assumed in this document.
Each information resource has one or more associated versions each having fixed content (octet sequence) and interpretation directives (media type, language). An information resource having more than one version is said to be generic.
No particular meaning is implied by the word "version;" the word is chosen as suggestive of its most common use.
One can attribute metadata properties such as author, title, and topic to versions in the obvious way. These properties extend to generic information resource in a systematic way: if a property is shared by all of an information resource's versions, then we attribute that property to the information resource, and vice versa.
Operationally, this means that based on knowledge of its versions one can write metadata using an information resource as subject, and someone reading this metadata can then apply that metadata to whatever version they access.
Information resources need not be accessible at a URI; they might exist only inside a local database, or they may be ephemeral.
[All terminology choices are provisional; for most of them I am testing the waters to see how well the word works, but I'm prepared to change.]
Bob is preparing a bibliography. He finds a report on spoonwings provided by Alice at the URI http://example/spoonwing and wishes to refer to the report for the purpose of composing metadata such as its title, author, and publication date. He selects a phrase to use to refer to the report, then composes the metadata, using the chosen phrase as the subject of each statement.
Subsequently Carol encounters an entry from Bob's bibliography. Wanting to know what the subject phrase refers to, she is led somehow to dereference http://example/spoonwing, and is led to understand that IR('http://example/spoonwing') is the document that Bob is talking about.
Variant: Bob's bibliography includes a number of RDF documents, and his metadata includes information relevant for making use of those RDF documents.
Variant: Instead of being a person, Bob is a tool that is charged with updating all the documents on a Web site with license metadata.
Alice wants to refer to Fred, a mynah living at a local zoo. Alice "mints" a new phrase (one that is not yet in use; either a new URI or a phrase built on one) with the purpose of using that phrase to refer to Fred. Alice publishes a document that would lead a reader to realize that the phrase refers to Fred.
Bob then learns of Alice's phrase and uses it in a document of his own.
Subsequently Carol encounters Bob's document. Wanting to know what the phrase means, she is led to Alice's published account, which she reads. She is enlightened.
Variant: instead of Fred, the referent of the phrase is to be an information resource that is not accessible on the Web, or at least not at any URI known to Alice.
Variant: instead of Fred, the referent is to be an information resource that is accessible, via a URI known to Alice. The referent is not the account that Alice publishes, it is the document that Alice's account describes. (In this situation, which is common in the publishing industry and digital archives, Alice's account is often called a "landing page".)
Bob desires to refer to Chicago. He finds a Web page on the Web at http://example/chicago (provided by, say, Alice) that consists of a description of Chicago. Somehow he comes up with a phrase that will be understood as referring to the primary topic of Alice's Web page.
Carol encounters the phrase Bob used, is led to Alice's description of Chicago, and then somehow discovers that the phrase is meant to refer to Chicago.
[This use case keeps coming up (e.g. tdb:) but I don't think anyone is seriously interested in it. TBD: Explain how it differs from the previous one.]
This section describes how people currently implement the "somehows" in the use cases.
A URI scheme registration helps to account for the meaning of URIs using that scheme For example, the registration for the data: URI scheme fully explains the meaning of URIs that use that scheme.
Most URI scheme registrations, such as that for http:, only provide a partial ('schematic' you might say) account, and other sources of information must be consulted in order to understand a URI using that scheme.
Registering a new URI scheme requires community and IETF Expert Review; see RFC 4395.
[Not exactly common - is this worthy of mention? But it is used. Maybe rule out all non-linked-data solutions up front?]
urn:lsid: has an associated protocol that has separate methods for dereference and discovery.
To refer to the information resource accessible via a given URI, use the URI: http://example/ir refers to IR('http://example/ir'). Those who encounter the reference can dereference the URI, and on seeing that the dereference is successful, will take the URI to be a reference to the information resource accessible via that URI.
URIs are just one kind of phrase that might be used to refer to something. In RDF serializations, for example, we have blank node notation:
[ foaf:isPrimaryTopicOf <http://example/about-fred> ]
The problem of figuring out (or documenting) the meaning of the overall phrase reduces to that of figuring out (or documenting) the meaning of the URIs that occur in it.
Whenever using a URI to refer to something, provide a link to the document that carries an account of the URI's meaning. This is the approach taken by OWL (owl:imports). The rdfs:definedBy property could also be used for this purpose.
Both of these properties beg the question in that they do not say how to figure out what the target URI refers to.
To refer to something, mint a URI with a fragment identifier, and provide an account of the intended meaning at the pre-fragment stem of the URI. That is, if the URI is http://example/vocabulary#term, then put an account of that URI in the document at http://example/vocabulary . [mention 3986 and AWWW?]
Those encountering http://example/vocabulary#term will access http://example/vocabulary and read the account.
This approach and the following one are completely generic mechanisms and may be used in situations where a dereferenceable URI would also be correct. The choice would be based on the weighing the importance of dereference against the importance of a more explicit account (usually involving metadata).
To refer to something, mint an http: or https: URI without a fragment identifier (say http://example/fred), make an account of it accessible via a second URI (say http://example/fred.account) , and arrange for a GET of http://example/fred to yield a 303 response carrying a Location: header with http://example/fred.account as its target.
Those encountering http://example/fred will dereference, but this will fail with a 303 redirect, indicating that http://example/fred does not refer to an information resource at http://example/fred, but rather that the document at http://example/fred.account accounts for the URI's meaning. [see HTTPbis]
[Is anyone, in practice, deploying 303 redirects to a "primary topic" page not mentioning the URI to be accounted for, rather than to be a document that explicitly mentions the URI?]
With any of these conventions other than dereferenceable URIs, the URI may refer to anything at all, including an information resource. [COMMON MISUNDERSTANDING, not sure where this goes in the document. This email gets it totally wrong, it's not about IR vs. NIR, it's about which thing the URI is to refer to, the one generalizing what you get, or the one accounted for by what you get.]
"People forget to put it there when writing and cut and pasting URIs." (Harry) [More information needed.]
When a large number of URIs are formed by combining a fixed "namespace" prefix with a single suffix using hash as a connector, there will be a single underlying document that must provide accounts of all of the large number of URIs. This is an unacceptable performance hit for the server, the network, and the client. "Slash" URIs don't have this problem as the response can be specific to each URI.
(1) The document provided by the server must account for all hash URIs based on the document's URI. This could be a large number. (2) Hash URIs don't work with HTTP PUT, POST, or DELETE methods. (Manu)
Many hosting solutions do not support Apache .htaccess or any equivalent.
The Chicago use case is an extreme version of this - the entity providing access to the Chicago document (Alice) does not even care about providing URIs that refer to Chicago; it is someone having no control over how the URI dereferences (Bob) who needs a reference to Chicago.
To get accounts of N URIs provided by redirecting through 303 responses, you need to do 2N HTTP requests.
With fragment identifiers and the 303 redirect identified as the sources of current difficulties, a number of alternative mechanisms have been suggested to get around these problems.
Use a new kind of non-URI phrase, for example
the asterisk being suggestive of indirection in languages derived from C.
[This idea derives from JAR's TAG slides. This is mainly to get people thinking: the problem is notational engineering, not philosophy.]
This idea attempts to address one reason for using "slash URIs" instead of fragment identifiers. Suppose you want to combine a large number of local name a, b, c, ... into a namespace. The usual solutions would be to write http://example/namespace#a (a "hash namespace") or http://example/namespace/a (a "slash namespace").
In the "singleton fragid" approach one would write http://example/namespace/a# (a null fragment identifier) or http://example/namespace/a#_, using a fixed suffix for every URI and varying the part between the namespace prefix and the suffix.
As in the 303 approach, each URI in the namespace would (or could) have its own document, providing an account for that single URI rather than every URI in the namespace.
The choice of fixed fragment identifier (null, "_", or something else) is largely a matter of taste.
A null fragid precludes the use of qnames to abbreviate such URIs. (In particular it would not be possible to use them as predicate names in RDF/XML.) However, SPARQL, Turtle, and RDFa are being extended to admit CURIEs that include #, making this a newly attractive option.
To address the "hash gets lost" problem we could explore heuristics to automatically replace http://example/fred with http://example/fred# (or http://example/fred#_) when needed.
[Ed Summers's favorite]
In this approach we use IR('http://example/fred'), which seems to say that http://example/fred refers to Fred, as a proxy for Fred. We attribute to IR('http://example/fred') information that seems to be about Fred, and then interpret it to be either Fred or itself as the need arises. We call this the "chimera" approach because we have a single entity that has two different personalities. In effect:
IR('http://example/fred') = WS('http://example/fred')
Ways that this can fail:
To make the "chimera" approach work, strategies are needed for avoiding each of these pitfalls. E.g. (1) could be addressed by a prohibition on the use of predicates that might apply to either IRs or non-IRs, or by a priority system explaining which subject is meant, (2) by saying that the account of the URI must not lead to the URI as being understood to refer to any IR other than IR('http://example/fred'), (3) by having the community agree that axioms that enable equational inferences shouldn't be written for these entities.
For http://example/fred, obtain the host-meta file for its host via http://example/.well-known/host-meta. (See [hostmeta] and [rfc-5988].) Then look in the host-meta file for a link-template rule that maps http://example/fred to another URI, say http://example/fred.about, and then look for an account by dereferencing http://example/fred.about.
When the host-meta file is cached, and many accounts are sought from the same host, this reduces the number of round trips from two (in the 303 case) to one.
Such rules could augment or replace the use of 303 (or even 404) responses in order to reduce the number of round trips required to obtain accounts of URIs.
Looking for a host-meta file for every host that has URIs for which accounts need to be discovered would be expensive if only a few of them have such files, so some cleverness would be required to reduce the expected number of round trips. The details would have to be worked out, but this could be a boon to bulk consumers of "slash" URIs.
To reduce the number of round trips, we might use a new HTTP method to request an account of a URI's meaning, or the server could use a new status code to indicate that what it is returning is an account of a URI's meaning.
The URIQA specification proposes such an HTTP request method. Unfortunately URIQA sacrifices the works-in-browser property enjoyed by 303.
Possibilities for HTTP response status codes: 203, new 2xx (e.g. 209), new 3xx (e.g. 308), 404. 301, 302, 303, and 307 redirects are problematic as the entity in the response is not displayed in a browser.
Any of these options would mean fewer round trips than a 303 redirect. Unfortunately they are generally as difficult, or more difficult, than 303 redirects to deploy.
[I've been calling this one "just be clear"; it was suggested first by Harry, then echoed by others.]
Currently we use a dereferenceable URI http://example/fred to refer to the information resource at that URI, IR('http://example/fred'). But we could free it up to refer as accounted for by that information resource - that is, WS('http://example/fred') - by switching to a different notation for referring to IR('http://example/fred'). That is, if IR('http://example/fred') said that 'http://example/fred' referred to Fred the mynah, then it would. This would permit unrestricted use of "slash URIs" without the use of 303 - every dereferenceable URI would refer to WS(that URI).
To make this work all that's needed is a standard way to write IR('http://example/fred') in each affected language. For example, the Turtle phrase
[ :accessibleVia "http://example/fred"^^xsd:anyURI ]
could be ne way to refer to IR('http://example/fred'). A local name could be defined to the same effect:
:fred-doc :accessibleVia "http://example/fred"^^xsd:anyURI .
Or the referring document could just assert that it's using the URI to refer to the IR in question:
<http://example/fred> :accessibleVia "http://example/fred"^^xsd:anyURI .
which would constitute an explicit opt-in to the httpRange-14 rule.
(I'll refer to any of these three as "clumsy notation" below.)
Under this approach, some HTTP responses are 'marked' in a recognizable way that declares that the request URI (say http://example/fred) refers to WS('http://example/fred'), as opposed to the usual IR('http://example/fred').
To refer to IR('http://example/fred'), an agent would either uniformly use clumsy notation (above), or it would test, as an optimization, for the presence of the marker in the HTTP response. With no marker, the agent would use the URI to refer to the IR; when the marker is present it would revert to clumsy notation.
One candidate for such a marker would be the presence of a Link: header with some particular link relation. But other headers or even the content might serve.
As a further refinement, instead of clumsy notation to refer to IR('http://example/fred') in the presence of the marker, an agent could look for a second kind of HTTP header, and that would provide a second URI that refers to IR('http://example/fred'). The second URI might also be discovered in other ways.
[need to explain the Content-type: idea, which I don't understand - exchange on public-lod]
[Here's a similar analysis - not the same problem, but a related one - with a matrix.]
Table summarizing the options. Could be as many as 14 rows (one for each current approach + one for each suggested approach) and as many as seven columns (one for each critique).