Defining and discovering the meaning of a URI

W3C Editor's Draft 27 March 2011

This version:
http://www.w3.org/2001/tag/awwsw/issue57/20110327/
Latest version:
http://www.w3.org/2001/tag/awwsw/issue57/latest/
Previous version:
http://www.w3.org/2001/tag/awwsw/issue57/20110320/
Editor:
Jonathan A. Rees <rees@mumble.net>

This document is also available in these non-normative formats: XML.


Abstract

The specification governing Uniform Resource Identifiers (URIs) [rfc-3986] allows URIs to mean anything at all, and this feature is exploited in a variety contexts, notably the Semantic Web and Linked Data. To use a URI to mean something, an agent (a) selects a URI, (b) provides a definition of the URI in a manner that permits discovery by agents who encounter the URI, and (c) uses the URI. Subsequently other agents may not only understand the URI (by discovering and consulting the definition) but may use it themselves.

A few widely known methods are in use to help agents provide and discover URI definitions, including RDF fragment identifier resolution and the HTTP 303 redirect. However, difficulties in using these methods have led to a search for new methods that are easier to deploy, and perform better, than the established ones. This report brings together in one place information on current and proposed practices, with analysis of benefits and shortcomings of each.

The purpose of this report is not to make recommendations but to provide a foundation for a discussion that might lead to consensus on the use of current and/or new methods.

Status of this Document

This document is an editors' copy that has no official standing.

This report has been developed by the AWWSW Task Force of the W3C Technical Architecture Group in order to provide background material for further discussion among those affected by this architectural question, and to help drive TAG issue 57 [issue-57] to a conclusion.

This version has not received review within either the task force or the TAG.

Publication of this draft finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.

Please send comments on this document to the editor at rees@mumble.net. The development of this report is discussed on the public-awwsw@w3.org mailing list, with archives at http://lists.w3.org/Archives/Public/public-awwsw/.

Table of Contents

1 Introduction
    1.1 Glossary
2 Use case scenarios
    2.1 Choosing a URI, providing a definition of the URI, using the URI
    2.2 Using a document as a definition by reference to its primary topic
3 General definition methods in current use
    3.1 Colocate definition and use
    3.2 Link to documents containing definitions
    3.3 Register a URI scheme or URN namespace
    3.4 Use the LSID getMetadata() method
    3.5 'Hash URI'
    3.6 'Slash URI' with HTTP 303 See Other redirect
4 Critique of the current solution suite
    4.1 Fragment identifiers are fragile
    4.2 The common fragment identifier pattern fails with large namespaces
    4.3 Fragment identifiers aren't seen by servers
    4.4 303 is difficult, sometimes impossible, to deploy
    4.5 303 leads to too many round trips
    4.6 303 makes the URI difficult to bookmark
    4.7 The normative specifications are incomplete
5 Possible mitigations
    5.1 Use something other than a URI
    5.2 'Hash URI' with fixed suffix
    5.3 'Slash URI' with site-specific discovery rules
    5.4 'Slash URI' with new HTTP request or response
    5.5 Dereferenceable URI refers to chimera entity
    5.6 Dereferenceable URI refers to FV(u) or IR(u), depending
    5.7 Do we need interoperability?
6 Summary
7 Appendix. About information resources
    7.1 Use case: Preparing and consuming metadata for a Web-accessible information resource
    7.2 Natural history of information resources
    7.3 Using a URI to refer to the information resource accessible via that URI
8 Ackowledgments
9 References

End Notes


1 Introduction

This is an old issue, and people are tired of it. — Sandro Hawke, January 2003 [disambiguating]

In any kind of discourse it is very useful for an agent to be able to provide a definition of a term, in such a way that other agents can discover and use that definition in order to make sense of sentences that use that term, and to compose new sentences that use the term.

[Draft note: still thrashing on terminology "definition" vs. "documentation" vs. "account"]

Consider the following scenario. Suppose that Alice, in communication with Bob, uses the term "Peak XV" to mean Mount Everest, as in "Alice would like to climb Peak XV next summer". If Bob does not know what "Peak XV" means, he will have to find out. He might be able to ask Alice directly, although in many cases this will be impossible - Alice might be too busy, or otherwise unavailable. Lacking that option he must do some research, consulting dictionaries, encyclopedias, or search engines in the hope of obtaining the correct explanation of Alice's use of the term "Peak XV".

The essential idea is that there are one or more methods available to Bob by which he can discover bits of writing that explain what what Alice means by "Peak XV". In this report, these bits are called "definitions", and the terms being defined are URIs.

The nature of definitions need not concern us here - many forms are familiar, including translation between languages (e.g. providing an English or Spanish term equivalent to a given term), descriptions (the term refers to an entity possessing some set of properties), explanation by example, axiomatic systems, and so on. Also not of concern here are the many ways in which meaning can fail as a result of what a definition says about the term in question, and how it is used. Our concern is only with the method by which enabling information is conveyed.

When the term in question is a URI such as 'http://example/everest', discovery methods include, in addition to those already mentioned, network protocols that involve the URI directly.

Definition discovery is not the same as Web dereference, since dereference gives you a version of some information resource, while discovery yields a definition of a URI. Care must be taken to avoid confusing the two operations. In theory a version of an information resource could play a role in explaining the meaning of the URI that refers to the information resource, but this is not common.

[David asked what problem we are trying to solve and whether it was worth it.] We only need universally agreement on methods such as the ones surveyed here if URIs are to be shared between communities. If agents that use a URI in one way never use it in communication with agents that use it in another way, then it is OK for the URI to have distinct senses in the two communities, and there is no problem to be solved - each community can use the URI in its own way, and there will be no confusion.

This report presents discovery methods in current use, reports some criticisms of them, and presents some new FYN methods that have been proposed to address the criticisms.

Note: For brevity, we say "http: URI" when we really mean "http: or https: URI".

[Maybe talk in the introduction about alternatives to defining a URI: using non-URI phrases and syntactic sugar (these used to be sections). Discussion currently relegated to 5.1 Use something other than a URI.

1.1 Glossary

This section defines terms that are used in this report. An attempt has been made to avoid gratuitous differences from the way these terms are used elsewhere, but in a few cases choice of terminology has been difficult and words with other meanings (such as "definition") are given technical definitions. These definitions are not being proposed for general adoption.

[Draft comment: All terminology choices are provisional; for most of them I am testing the waters to see how well the word works, and am prepared to change.]

accessible via
When a URI is dereferenceable, "the information resource accessible via a URI" (abbreviated IR(that URI), see below) is the information resource whose versions are the versions obtained by dereferencing that URI.
ADI(u,v)
The meaning of URI v, as defined in IR(u). [not sure we need this one.]
definition
A document or document part that provides information about the meaning of a URI or other kind of term. This term is not meant to be either rigorous or exclusive. The "information" could be prose, RDF, OWL, or some combination. It needn't be successful, specific, or comprehensive in defining the term in the ordinary sense of "defining". Rather, the term as used here refers to the role it plays in discovery. We might more accurately say "putative definition". [Alan R: Is sound recording possible definition?]
dereferenceable
A URI is dereferenceable if it may be used with a standard access mechanism to retrieve information, or to perform some other action on an associated resource ([rfc-3986] section 1.2.2). URIs possessing fragment identifiers (#) are by definition not dereferenceable. http: URIs without fragment identifiers are dereferenceable if some HTTP method (or equivalent) is successful (2xx response). Some URIs belonging to some other URI schemes are also dereferenceable.
fixed information resource
A document, image, sound recording, or other replicable entity as encoded in an octet sequence, together with optional brief annotations, such as media type and language, intended to guide the interpretation of the content.
FV(u)
FV(u) is shorthand for the meaning of a URI u according to the definition of u in (a version of) the information resource IR(u). For example, if IR('http://example/p16') says that 'http://example/p16' refers to Alice's canoe, then FV('http://example/p16') is Alice's canoe. ('FV' stands for 'take at face value'.)
information resource
Roughly speaking, something that is appropriate as the subject of metadata. Because of the controversy around this term we will not attempt to define it, but rather say that: (1) An information resource is associated with a set of fixed information resources (its versions). (2) An information resource is "similar" to its versions in that metadata that applies to each version of an information resource applies to the information resource itself, and vice versa.
IR(u)
IR(u) is shorthand for the information resource accessible via URI u. For example, if 'http://example/image23' is dereferenceable, then IR('http://example/image23') is the information resource accessible via that URI.
metadata
Information about information, or about an information resource. In RDF, metadata might be written using vocabularies such as Dublin Core, FOAF, or CC REL.
term
A URI, word, name, or phrase that can serve in subject or object position in a statement. In an RDF serialization, for example, a term might be a qname, URI, or blank node label. In Turtle, a term might be any Turtle term, including one written using blank node [...] notation.
refer
For the purposes of this report, reference is just one way to mean. There may be other ways to mean other than to refer, but none are specified here.
version (of an information resource)
A fixed information resource associated with an information resource is a version of the information resource. [1]

2 Use case scenarios

Use cases need to be presented as being independent of any particular solution to be used, in order that the solution space can be explored more objectively. This leads to some frustrating vaguenesses in the following, but the vagueness is intentional and necessary.

2.1 Choosing a URI, providing a definition of the URI, using the URI

Alice wants to refer to a particular canoe being offered for sale. Alice "mints" a new URI (one that is not yet in use) with the purpose of using that URI to refer to her canoe. Alice publishes a document that would lead a reader to understand that the URI refers to the canoe.

Bob then learns of Alice's URI and uses it in a document of his own.

Subsequently Carol encounters Bob's document. Wanting to know what the URI means, she is led to Alice's published definition, which she reads. She is enlightened.

2.2 Using a document as a definition by reference to its primary topic

Bob desires to refer to Chicago. He finds a Web page on the Web at 'http://example/about-chicago' (provided by, say, Alice) that consists of a description of Chicago, and wants to use it for the purpose of referring to Chicago. He chooses a URI and associates it with Alice's Web page in such a way that Bob's URI will be understood as referring to Chicago.

Carol encounters Bob's URI, is led to 'http://example/about-chicago' and thence to Alice's description of Chicago, and then somehow understands that Bob's URI is meant to refer to Chicago.

This differs from the previous use case in that the definition (the description of Chicago) was not written with the purpose of defining Bob's URI - in fact Bob's URI doesn't even occur in it.

[This use case keeps coming up (e.g. tdb:) but I don't think anyone is seriously interested in it. Need text to admit that it's important but not important enough to talk about.]

3 General definition methods in current use

This section describes how people currently implement the "somehows" in the use cases.

(Any of these methods may be used to define a URI that refers to something for which a more specialized system exists, for example a mailbox (for which there is the mailto: URI scheme) or an information resource (for which there is http:, gopher:, data:, and so on). In theory, an information resource could be specified in a URI definition by spelling out the details of its versions, perhaps in RDF. However, this is ordinarily not necessary, since usually the specialized naming system can be used.)

3.1 Colocate definition and use

Put the definition in the document in which the URI occurs.

3.2 Link to documents containing definitions

Whenever using a URI to refer to something, provide a link to the document that carries a definition of the URI. This is the approach taken by OWL (owl:imports). The rdfs:definedBy property could also be used for this purpose.

Both of these properties beg the question in that they do not say how to figure out what the URI that is the target of owl:imports or rdfs:definedBy refers to.

3.3 Register a URI scheme or URN namespace

Each URI scheme, e.g. mailto:, http:, ftp:, and so on, has its own URI scheme registration, accessible via a registry maintained by IANA [rfc-4395]. A URI scheme registration helps to define the meaning of URIs using that scheme. For example, the registration for the data: URI scheme fully explains the meaning of URIs that use that scheme.

Most URI scheme registrations, such as that for http:, only provide a partial definition, and other sources of information must be consulted in order to understand a particular URI using that scheme. For example, to understand an http: URI, one generally needs to dereference it (and even then one only knows a single version of it; but that's a topic for another day).

In theory, to define a URI to refer to Mount Everest, one could invent a new URI scheme, say mountain:, and publish a registration for it via IETF and IANA that says that 'mountain:everest' refers to Mount Everest. Practically speaking this challenging due to the rigor of the review process (see [rfc-4395]). Furthermore, Web clients will not understand the new URI scheme, making the definition of the URI effectively inaccessible for most agents encountering the URI.

URN namespaces work in a similar way.

3.4 Use the LSID getMetadata() method

[Not exactly common - is this worthy of mention? Maybe rule out all non-linked-data solutions up front? But it is used and I'd like some of those users to read this report.]

The urn: URI scheme is subdivided into 'namespaces', some of which might be suitable for general definition methods. One URN namespace used for this purpose is the 'lsid' namespace.[2] URIs beginning 'urn:lsid:' are called LSIDs. LSIDs have an associated SOAP-based protocol that has separate methods for dereference (getData) and discovery (getMetadata). According to the LSID specification, an LSID for which the getData method yields nonempty content refers to a what is here called a fixed information resource, while the LSID could refer to anything at all if getData yields empty content. In the latter case the information yielded by the getMetadata method generally constitutes, or at least contains, a definition of the LSID.

3.5 'Hash URI'

To refer to something, mint a URI with a fragment identifier, and provide a definition of that URI at the pre-fragment stem of the URI. That is, if the URI is 'http://example/vocabulary#term', then publish a definition of that URI at 'http://example/vocabulary'. [ADI('http://example/vocabulary', 'http://example/vocabulary#term')]

The interpretation of a URI possessing a fragment identifier, say 'http://example/vocabulary#term', is governed by the media type of some version of the information resource accessible at its stem URI 'http://example/vocabulary', which for RDF-enabled media types defers to the content of the version. (See AWWW section 3.2. [webarch])

If the information resource http://example/vocabulary will has multiple versions, it is important that all versions provide definitions of every URI that needs one, and that corresponding definitions in different versions be compatible with one another.

3.6 'Slash URI' with HTTP 303 See Other redirect

To refer to something, mint an http: URI without a fragment identifier (say 'http://example/p16'), make a definition of it accessible via a second URI (say 'http://example/about-p16') , and arrange for a GET request with target 'http://example/p16' to yield a 303 response carrying a Location: header with 'http://example/about-p16' as its value.

Those encountering 'http://example/p16' will attempt dereference, but this will fail, with a 303 redirect delivered instead. The 303 redirect indicates that the URI does not refer IR('http://example/p16'), but rather that the document at 'http://example/about-p16' provides a definition of 'http://example/p16'. [see HTTPbis]

Another pattern is to use 303 redirect to a document whose primary topic is the intended referent, similar to the Chicago example above. This could, in theory, lead to ambiguities, as the primary topic and the entity referred to using the URI might be different. [Is anyone, in practice, deploying 303 redirects to a "primary topic" page not mentioning the URI to be defined, rather than to a document that explicitly mentions the URI?]

4 Critique of the current solution suite

[TBD: Take care of some of the listed solutions quickly, including URI scheme registration, colocation, and "cite your sources".]

4.1 Fragment identifiers are fragile

"People forget to put it there when writing and cut and pasting URIs." (Harry) [More information needed.]

4.2 The common fragment identifier pattern fails with large namespaces

When a large number of URIs are formed by combining a fixed "namespace" prefix with a single suffix using hash as a connector, there will be a single underlying document that must provide definitions of all of the large number of URIs. This is an unacceptable performance hit for the server, the network, and the client. "Slash" URIs don't have this problem as the response can be specific to each URI.

4.3 Fragment identifiers aren't seen by servers

(1) The document provided by the server must define all of the hash URIs that are based on the document's URI. This could be an impracticably large number. (2) Hash URIs don't work with HTTP PUT, POST, or DELETE methods. (Manu)

4.4 303 is difficult, sometimes impossible, to deploy

Deploying a 303 redirect requires giving the correct directive to a web server, for example adding a Redirect line to .htaccess in Apache. Unfortunately many hosting solutions do not allow this.

The Chicago use case is an extreme version of this - the entity providing access to the Chicago document (Alice) does not even care about providing URIs that refer to Chicago; it is someone having no control over how the URI dereferences (Bob) who needs a reference to Chicago.

4.5 303 leads to too many round trips

To get definitions of N URIs by redirecting through 303 responses, you need to do 2N HTTP requests.

4.6 303 makes the URI difficult to bookmark

"the user enters one URI into their browser and ends up at a different one, causing confusion when they want to reuse the URI of the toucan. Often they use the document URI by mistake." (Ian Davis)

"Redirection has in fact very confusing side effects; as we expect the semantic web to work seamlessly with the web, it is very odd that a semantic web uri cannot be copy pasted to a browser without seeing it change to something that is not the same as before." (Giovanni Tumarello, July 2007)

4.7 The normative specifications are incomplete

[Harry's complaint, TBD]

[Talk about conneg, media type, and FYN woes here?]

5 Possible mitigations

With fragment identifiers and the 303 redirect identified as the sources of current difficulties, a number of alternative methods have been suggested to get around these problems.

5.1 Use something other than a URI

[This section derives from JAR's TAG F2F presentation slides. The purpose of talking about this idea is mainly to remind people that the problem is one of notational engineering, not philosophy. This doesn't work very well, though, and I will probably flush this section.]

URIs are just one kind of term that might be used to refer to something. If defining a URI is too difficult or costly, then perhaps one might do without. In RDF serializations such as Turtle, for example, we have blank node notation:

         [ foaf:isPrimaryTopicOf <http://example/about-chicago> ] 

Here we have managed to refer to Chicago without defining a new URI; we have simply referred indirectly using a URI that refers to an information resource according to a generic method (see 7.3 Using a URI to refer to the information resource accessible via that URI).

A more concise alternative is syntactic sugar:

           *<http://example/about-chicago> 

(The asterisk is meant to be suggestive of indirection in the C programming language.)

If supported in various RDF serializations, such syntactic sugar could be a concise way to write something like

         [ foaf:isPrimaryTopicOf <http://example/about-chicago> ] 

5.2 'Hash URI' with fixed suffix

This idea attempts to address one reason for using "slash URIs" instead of fragment identifiers. Suppose you want to combine a large number of local name a, b, c, ... into a namespace. The usual solutions would be to write 'http://example/namespace#a' (a "hash namespace") or 'http://example/namespace/a' (a "slash namespace").

In the "singleton fragid" approach one would write 'http://example/namespace/a#' (a null fragment identifier) or 'http://example/namespace/a#_', using a fixed suffix for every URI and varying the part between the namespace prefix and the suffix.

As in the 303 approach, each URI in the namespace would (or could) have its own document, providing a definition for that single URI rather than for every URI in the namespace.

The choice of fixed fragment identifier (null, "_", or something else) is largely a matter of taste.

A null fragid precludes the use of qnames to abbreviate such URIs. (In particular it would not be possible to use them as predicate names in RDF/XML.) However, SPARQL, Turtle, and RDFa are being extended to admit CURIEs that include #, making this a newly attractive option.

To address the "hash gets lost" problem we could explore heuristics to automatically replace 'http://example/p16' with 'http://example/p16#' (or 'http://example/p16#_') when needed.

5.3 'Slash URI' with site-specific discovery rules

For 'http://example/p16', obtain the host-meta file for its host via 'http://example/.well-known/host-meta'. (See [hostmeta] and [rfc-5988].) Then look in the host-meta file for a link-template rule that maps 'http://example/p16' to another URI, say 'http://example/p16.about', and then look for a definition by dereferencing 'http://example/p16.about'.

When the host-meta file is cached, and many definitions are sought from the same host, this reduces the number of round trips from two (in the 303 case) to one.

Such rules could augment or replace the use of 303 responses in order to reduce the number of round trips required to obtain definitions of URIs.

Looking for a host-meta file for every host that has URIs for which definitions need to be discovered would be expensive if only a few of them have such files, so some cleverness would be required to reduce the expected number of round trips. The details would have to be worked out, but this could be a boon to bulk consumers of "slash" URIs.

5.4 'Slash URI' with new HTTP request or response

To reduce the number of round trips, we might use a new HTTP method to request a definition of a URI, or the server could use a new status code to indicate that what it is returning is a definition of a URI.

The URIQA specification [uriqa] defines MGET, a new HTTP request method. An MGET request on a URI yields a response containing a definition of that URI.

In response to GET of a nondereferenceable URI, a server might provide a definition in the response. Possibilities for HTTP response status codes that might signal this situation: 203, a new 2xx status (e.g. 209), a new 3xx status (e.g. 308), or 404. Placing the definition in the content of a redirect response (status code 301, 302, 303, and 307) is unsatisfactory as the content would not be displayed in a Web browser.

Any of these options would mean fewer round trips than following a 303 redirect. A downside is that they are all generally as difficult, or more difficult, to deploy than 303 redirects.

5.5 Dereferenceable URI refers to chimera entity

In this approach a URI dereferences to a definition of the URI. The goal is to accept the use of a URI u both in metadata statements and in statements that seem to be able something that might not be suitable as the subject of metadata statements. In order to do so, we must come up with some interpretation of the URI and the statements in which it occurs.

If this approach is to be consistent, there will have to exist a single entity CH(u) for which many (or maybe all) statements that apply to IR(u) are true, and at the same time for which many (maybe all) statements that apply to FV(u) are true. ('CH' stands for 'chimera'.)

For example, one might have to reconcile metadata that applies to IR('http://example/p16'), stated using the URI 'http://example/p16', with information retrieved from IR('http://example/p16') that would apply to FV('http://example/p16'), stated using the same URI:

          <http://example/p16> dc:creator "Carol".
          <http://example/p16> dc:title "All about Alice's canoe".

          <http://example/p16> foo:mass 2140.
          <http://example/p16> foaf:name "Assabet Angler". 

Metadata statements applied to the chimera CH(u) would be true according to whether they are true of IR(u), while other statements applies to CH(u) would be true accoring to whether they are true of FV(u).

Why would one be dealing with both kinds of statements at the same time? Well, the two groups of statements might be inserted as RDFa into a single HTML document by different tools, or by different modules in a content management system. Or the statements might be combined in a single triple store from multiple sources.

The chimera approach presents a number of challenges.

First, not all properties are easily classified as metadata properties vs. non-metadata properties. For example, "Alice likes http://example/p16" and "http://example/p16 is located on Iroquois Rd." are not obviously about an information resource as opposed to a canoe, or vice versa. This problem could be addressed by a prohibition on the use of ambiguous predicates of this sort, or by a priority system explaining which of the two subjects is meant in case the URI refers to a chimera.

Second, if FV('http://example/p16') happens to be an information resource (other than IR('http://example/p16')), we will end up with nonsense, since metadata for two distinct information resources would be attributed to a single entity. Consider, for example, the case where copyright license A applies to IR('http://example/p16') and copyright license B applies to FV('http://example/p16'). This would lead to both licenses being applied to CH('http://example/p16'), which would be impossible to interpret correctly, as neither subject is such that both licenses apply to it. We would have to say that the definition of 'http://example/p16' must not lead to the URI being understood to refer to an information resource other than IR('http://example/p16') itself.

Third, care must be taken to ensure that inferences are sound with respect to the chimeras, and just with respect to either part of it. Especially troubling would be if two chimeras were equated due to, say, use of a functional property applying to their FV() parts, when their information resource parts were distinct. To rule this out would require adoption of practices and conventions designed to prevent contradictions, such as avoiding the use of functional properties and owl:sameAs in conjunction with chimeras.

5.6 Dereferenceable URI refers to FV(u) or IR(u), depending

Currently we use a dereferenceable URI 'http://example/p16' to refer to the information resource at that URI, IR('http://example/p16') (see 7.3 Using a URI to refer to the information resource accessible via that URI). To use an http: scheme 'slash URI' to refer to anything else, one uses a 303 redirect. To address performance and deployment difficulties with 303, it has been suggested that in some or all cases a dereferenceable URI u should be used to refer to FV(u) instead of IR(u). Then we would be able to arrange for u to mean anything we like by simply making a definition accessible via u, avoiding the difficulties with 303 redirects altogether.

This would be an incompatible change, as tools that assume, following the current convention, that u refers to IR(u) will misunderstand uses of u where u is meant to refer to FV(u), and tools that assume that u refers to FV(u) will misunderstand uses of u where u is meant to refer to IR(u). However, most URIs do not dereference to definitions of themselves - that is, there is no such thing as FV(u), since IR(u) doesn't contain a definition of u. So it might make sense for most or all such URIs to refer to their information resources. This would maintain backward compatibility for those URIs, at least.

The problem is how to distinguish the two situations. The criterion "provides a definition of URI u" is not machine actionable as stated, both because the definition might be couched in an arbitrary language or notation, and because it is not obvious how to distinguish content that contains a definition of a particular URI from content that doesn't. But perhaps some approximation to the criterion could be made actionable, based on some combination of media type and aspects of the content. One simple rule might be: If IR(u) has a version with media type 'application/rdf+xml', then take u to refer to FV(u), otherwise take u to refer to IR(u). This rule would generate false positives and false negatives, but it illustrates the idea.

Some machine-actionable rule is desirable since otherwise there would be no reliable way to use any dereferenceable URI u to refer IR(u). There would always be the possibility that u might be understood to mean FV(u) instead.

Whatever rule is adopted, for those URIs u whose meaning would be changed incompatibly from IR(u) to FV(u), another way would have to be provided to refer to IR(u), so that metadata applicable to IR(u) can be written. This could be done in RDF given a standard way to write the predicate corresponding to what we've been calling 'is accessible via'. For example, the Turtle term

          [ :accessibleVia "http://example/p16"^^xsd:anyURI ] 

could be a new way to refer to IR('http://example/p16') (as Turtle '<http://example/p16>' now refers to FV('http://example/p16')). A local name could be defined to the same effect:

          :about-p16 :accessibleVia "http://example/p16"^^xsd:anyURI . 

Or the referring document could just assert that it's using the URI to refer to the IR in question:

          <http://example/p16> :accessibleVia "http://example/p16"^^xsd:anyURI . 

which would constitute an explicit opt-in to the httpRange-14 rule [issue-14-resolved]. [3]

Let's call this approach "clumsy notation". To avoid the need for clumsy notation, some convention could be used to provide a URI (other than u) to refer to IR(u), when one is available. This could be done using a Link: header, or via an RDF statement such as

          <http://example/p16#ir> :accessibleVia "http://example/p16"^^xsd:anyURI . 

5.7 Do we need interoperability?

[Draft note: This section added after David asked what problem we are trying to solve and whether it was worth it.]

We only need widely understood methods such as the above if URIs are to be shared between communities. If agents that use u to mean IR(u) never use u in communication with agents that use u to mean FV(u), and vice versa, then it is OK for u to have distinct senses in the two communities, and there is no problem to be solved - each community can use the URI in its own way, and there will be no confusion.

6 Summary

The following table summarizes the candidate new discovery methods, evaluating each against a set of criteria, as described below.

compatible?robust?easy to deploy?min round tripsns scales?>1 definition?
Hash +-+1-+
Slash + 303 ++-2++
Hash + fixed suffix +-+1++
Slash + hostmeta rule++?1+ε++
Slash + new HTTP ++-1++
Chimera, IR(u) + FV(u)+++1+-
Depends, IR(u) or FV(u)-++1++
compatible?
Does it assign a new, incompatible definition to existing URIs?
robust?
Is the URI free of fragment identifiers that can get lost?
easy to deploy?
Can a publisher with a file-upload-only hosting solution use this method?
min round trips
How many network round trips are needed to find a definition, assuming (a) the definition is not cached and (b) the /.well-known/host-meta cache misses with probability ε ?
ns scales?
Can definition-containing document sizes be bounded as namespaces grow in size?
>1 definition?
Can distinct definitions give the same meaning to distinct URIs?

[For reference, here's a similar analysis - not the same problem, but a related one - with its own matrix.]

7 Appendix. About information resources

"Information resources" figure in this story both as providers of definitions, and as things that one refers to (metadata subjects). As the desire to refer to information resources using their URIs interferes with the proposal to refer to other things using dereferenceable URIs, it is important to understand what kinds of things one says about information resources, and with what justification.

7.1 Use case: Preparing and consuming metadata for a Web-accessible information resource

Bob is preparing a bibliography. He finds a report on cicadas provided by Alice at the URI 'http://example/cicada' and wishes to refer to the report for the purpose of composing metadata such as its title, author, and publication date. He selects a URI, blank node, or other term to use to refer to the report, then composes the metadata, using his term in the metadata to refer to the report. (Bob's term might be 'http://example/cicada' but could be something else, if there is the possibility that 'http://example/cicada' does not refer to Alice's document.)

Subsequently Carol encounters an entry from Bob's bibliography. Wanting to know what the subject of the entry is, she is led somehow (depending on discovery method) from Bob's term to Alice's URI, and from there to Alice's document IR('http://example/cicada'), which is the document that Bob's term refers to.

7.2 Natural history of information resources

The following explains the particular theory of "information resources" assumed in this report. The theory is independent of how one refers to information resources. More elaborate theories are certainly possible, but this is all we need to assume here.

Each information resource has one or more associated versions, where each version has fixed content (octet sequence) and additional information (media type, language) affecting the interpretation of the content. Different versions may be appropriate at different times or in different interaction contexts. No particular meaning is implied by the word "version;" the word is chosen as suggestive of its most common use.

Metadata statements such as those giving authorship, title, and topic are true or false of versions in the obvious way. Such statements also apply to arbitrary information resources in a systematic way, as follows: If a statement is true of all of an information resource's versions, then the statement is true of the information resource, and vice versa.

Operationally, this means that if you have knowledge of its versions, you can write metadata using an information resource as subject, and someone reading this metadata can then apply that metadata to whatever version they access.

Information resources need not be accessible via a URI, or even have any associated URI at all. An information resource might exist only inside a local file system or database, or it might be ephemeral.

7.3 Using a URI to refer to the information resource accessible via that URI

To refer to the information resource accessible via a URI when that URI is dereferenceable, one generally uses the URI itself. E.g. 'http://example/ir' refers to IR('http://example/ir'), if 'http://example/ir' is dereferenceable. One might use such a URI in a metadata statement, for example: "The creator of http://example/ir is Carol", or, expressed equivalently in Turtle,

          <http://example/ir> dc:creator "Carol". 

If one wants to refer to an information resource, but it isn't accessible via any URI, one might choose a URI, publish the information resource's versions at that URI, and then use the URI to refer to the information resource.

An agent who encounters a URI and wants to know what it means can dereference it, and if the dereference is successful (HTTP 2xx status as opposed to 303 or 404 or anything else), [4] the agent can take the URI to be a reference to the information resource that is accessible via that URI. [5]

8 Ackowledgments

David Booth (affiliation), Nathan Rixham (affiliation), and Alan Ruttenberg (University at Buffalo) contributed to the creation of this report.

9 References

issue-14-resolved
[httpRange-14] Resolved. Email to www-tag list, 2005. (See http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html.)
issue-57
Issue 57. W3C Technical Architecture Group, 2007-2011. (See http://www.w3.org/2001/tag/group/track/issues/57.)
rfc-3986
Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, IETF, 2005. (See http://www.ietf.org/rfc/rfc3986.txt.)
rfc-4395
Guidelines and Registration Procedures for New URI Schemes. RFC 4395, IETF, 2006. (See http://www.ietf.org/rfc/rfc4395.txt.)
rfc-5988
Web linking. RFC 5988, IETF, 2010. (See http://www.ietf.org/rfc/rfc5988.txt.)
hostmeta
Web Host Metadata. Internet-draft, IETF, 2010. (See http://tools.ietf.org/html/draft-hammer-hostmeta-13.)
webarch
Architecture of the World Wide Web, Volume One. W3C Recommendation, December 2004. (See http://www.w3.org/TR/webarch/.)
disambiguating
Disambiguating RDF Identifiers. W3C, January 2003. (See http://www.w3.org/2002/12/rdf-identifiers/.)
uriqa
The URI Query Agent Protocol. Nokia, 2010. (See http://sw.nokia.com/uriqa/URIQA.html.)

End Notes

[1]
"Version of" as used here is similar to one of the senses in which the relationship "representation of" is used in discussions of Web architecture. Unfortunatey these discussions have been waylaid by arguments over the meaning of the word "representation," due to the different ways in which Roy Fielding (in his REST work) and Tim Berners-Lee [citation needed] use the word. It seems better to avoid the word entirely and use a new word to specifically mean the Tim Berners-Lee sense.
[2]
Unfortunately the 'lsid' URN namespace is not in the IANA registry. Someone encountering an LSID may need to do a search to locate the LSID specification and consequently determine what the LSID means.
[3]

One might think that the notation IR(u) could relate the information resource to the referent of u (written '<http://example/p16>' in Turtle) instead of to u itself (written '"http://example/p16"^^xsd:anyURI'):

	    [ :accessibleVia <http://example/p16> ] 

But the meaning of this expression is then sensitive to the interpretation of the URI 'http://example/p16', which is exactly what the notation is meant to avoid. It is also ambiguous according to RDF semantics. If two URIs, say 'http://example/p16' and 'http://example/canoe571', both refer to the same thing (whatever it is), there might be two distinct information resources IR('http://example/p16') and IR('http://example/canoe571') satisfying this relationship, with no way to choose between them.

[4]
Simple redirects (301, 302, 307) are generally taken as transparent with respect to dereference, but this is a side issue that we don't want to take up in this report.
[5]
The "u refers to IR(u)" convention is a common and intuitive interpretation of the HTTP specification and is in widespread use. In 2005 the W3C TAG confirmed this interpretation (in contrast to "u refers to FV(u)") in its "httpRange-14 resolution" [issue-14-resolved].