Providing and discovering definitions of URIs

1 Introduction

This is an old issue, and people are tired of it. — Sandro Hawke, January 2003 [disambiguating]

In any kind of discourse it is very useful for an agent to be able to provide a definition of a term, in such a way that other agents can discover and use that definition in order to make sense of utterances that use that term, and to compose new ones.

Example: Definition discovery

Suppose that Alice, in communication with Bob, uses the term "EQ 018" to mean the Loma Prieta earthquake, as in "Alice was in the laboratory during EQ 018". If Bob does not know what "EQ 018" means, he will have to find out. He might be able to ask Alice directly, although this may be impossible, as Alice might be too busy, or otherwise unavailable. Lacking that option he does some research, consulting a dictionary or similar resource (reference book, database, search engine) in order to obtain the explanation of Alice's use of the term "EQ 018".

In this report, the terms to be defined are assumed to be URIs. URIs can be used to mean all sorts of things in many different technical contexts. Contexts of special interest to this report are those processed by machine, including the RDF and OWL family of languages. The question may appear to be limited to RDF and its derivatives, but to the extent that there is supposed to be a single meaning for each URI common to RDF and Web architecture [webarch], the issue transcends RDF.

The nature of definitions need not concern us here - many forms are familiar, including translation between languages (e.g. providing an English or Spanish phrase equivalent to a URI), descriptions (the URI refers to an entity possessing some set of properties), explanation by example, axiomatic method, and so on. Also not of concern here are the many ways in which meaning can fail as a result of what a definition says or doesn't say about the URI in question, or the particular way in which a URI is used. Our concern is only with the method by which definitions are conveyed, and with meaning only to the extent the method impinges on interpretation.

Definitions are typically carried in documents. No assumptions are made about what else might be in such a document; there could be additional related information, definitions of other URIs, and so on. Nor is it important here that a definition be delimited or set off from the other information in the document. As in an encyclopedia, the definition part blurs into the other-information parts of the document.

When the term to be defined is a URI, discovery methods include, in addition to those already mentioned, network protocols such as HTTP that involve the URI as a protocol element.

Definition discovery is similar to Web dereference in that in both cases one starts with a URI and ends with a document. The two must not be confused, however, since dereference often yields a document that either does not define the URI or is not recognized as doing so. At present, by convention, a dereferenceable absolute URI refers to the information resource on the Web at that URI (see [ir]), independent of anything that the information resource says about what the URI means.

The reason we define definition discovery methods is interoperability: so that there is agreement on how each URI is to be understood. In principle, we only need consensus on methods such as the ones surveyed here for URIs that are to be shared widely. If agents in one community never use the URI in communication with agents in another community, then it is OK for the URI to have distinct senses in the two communities, and there is no problem to be solved. Each community can use the URI in its own way, and there will be no confusion.

The operative word here is "if". Isolation is fragile and means lost opportunities for synergy and unintended reuse. All the arguments in favor of a World Wide Web, which depends on the global nature of the URI vocabulary, apply here.

This report presents discovery methods in current use, reports some criticisms of them, and describes some additional discovery methods that have been proposed to address the criticisms.

1.1 Success criteria

The ideal definition discovery method would have the following properties:

Simple. Having too many options or too many things to remember makes discovery fragile and impedes uptake.
Easy to deploy on Web hosting services. Uptake of linked data depends on the technology being accessible to as many Web publishers as possible, so should not require control over Web server behavior that is not provided by typical hosting services.
Easy to deploy using existing Web client stacks. Discovery should employ a widely deployed network protocol in order to avoid the need to deploy new protocol stacks.
Efficient. Accessing a definition should require at most one network round trip, and definitions should be cacheable.
Browser-friendly. It should be possible to configure a URI that has a discoverable definition so that 'browsing' to it yields information useful to a human.
Compatible with Web architecture. A URI should have a single agreed meaning globally, whether it's used as a protocol element, hyperlink, or name.

It is not certain that all of these goals can be met simultaneously.

2 Use case scenarios

Use cases need to be presented as being independent of any particular solution to be used, in order that the solution space can be explored without bias. This leads to some frustrating vagueness in the following, but the vagueness is intentional and necessary.

2.1 Choosing a URI, providing a definition of the URI, using the URI

Alice wants to refer to a particular earthquake. Alice "mints" a new URI (one that is not yet in use) with the purpose of using that URI to refer to the earthquake. Alice publishes a document containing a definition of the URI, i.e. a document that would lead a reader to understand that the URI refers to the earthquake.

Bob then learns of Alice's URI and its definition, and uses the URI in a document of his own.

Subsequently Carol encounters Bob's document. Wanting to know what the URI means, she is led somehow to Alice's published definition, which she reads. She is enlightened.

Any method for implementing this use case would need to explain: what kind of URI Alice should use (syntactic constraints); where and how should Alice should publish the definition so that it can be found; and how Carol might come to discover Alice's definition, given the URI.

2.2 Using a document as a definition by reference to its primary topic

Editorial note	2011-04-14
Consider dropping this use case, and explain the situation in some less prominent way. The only evidence we have for this situation is from Hugh Glaser's message, and most of the discussion in this document does not apply to this case.

Bob desires to refer to Chicago. He finds a Web page on the Web at 'http://example/about-chicago' (provided by, say, Alice) that consists of a description of Chicago, and wants to use it for the purpose of referring to Chicago. He chooses a URI and associates it with Alice's Web page in such a way that Bob's URI will be understood as referring to Chicago.

Carol encounters Bob's URI, is led to 'http://example/about-chicago' and thence to Alice's description of Chicago, and then somehow understands that Bob's URI is meant to refer to Chicago.

Any method for implementing this use case would need to explain: what are the syntactic constraints on the URI Bob chooses; what Bob needs to do to associate his URI with the document about Chicago; and how Carol comes to discover and use that association.

(This differs from the previous use case in that the document about Chicago was not written with the purpose of defining Bob's URI. In fact Bob's URI doesn't even occur in it. Rather than look in the document for a definition mentioning Bob's URI, Carol must determine the topic of the document and take the topic as the meaning of Bob's URI.)

3 General definition methods in current use

This section describes currently accepted methods for providing and discovering definitions of URIs.

3.1 Colocate definition and use

One way to lead someone encountering a URI to a definition of the URI is to make sure that the definition of the URI occurs in each document in which the URI occurs. This makes the definition easy to find, since anyone who encounters the URI will have in hand the definition that they need. The form of the URI in this case is arbitrary.

This method treats URIs similarly to blank nodes in RDF, which have to stay close to their own definition, since they are scoped to a graph. An example of the application of this approach would be the use of a URI in an OWL ontology file that defines that URI.

Criticism: In RDF, this method is fragile in the same way as are blank nodes, because use and definition can get separated, e.g. when uses of the URI are deposited into a triple store and then retrieved by a query. Carrying a definition around with a reference does not help in the common case where an out-of-context reference is needed (as one would want in, say, a Semantic Web).

3.2 Point to the document that contains the URI's definition

When using a URI, provide, again in the document in which the URI occurs, a reference to a document that carries a definition of the URI. This is the approach taken by OWL; the document containing the URI is related to the one from which the definition of the URI should be obtained via the owl:imports relation.^[1]

The rdfs:isDefinedBy property might also be used for this purpose, but it probably isn't.

Criticism: Like the previous approach, this one is good so far as it goes, but it suffers in similar ways. The URI and the link to its definition can get separated, or keeping the definition link close to the occurrence of the URI may prove to be too difficult for applications.

3.3 Register a URI scheme or URN namespace

In principle, one could create a new URI scheme or URN namespace, in which case the registration document would constitute a definition (although perhaps not on its own; often there is delegation of some kind to other documents). A recent example is RFC 5870 for URIs defined to name geographic locations. Another is the definition of the URI about:blank, which is in progress as of this writing. A "tdb:" (thing-described-by) URI scheme has also been proposed, [TBD: cite Masinter] as has "xri:" for "extensible resource identifiers" (n.b. xri: has been deprecated in favor of http: and Web Linking). See [rfc4395] and [rfc3406] for details.

Criticism: The review process for new URI schemes and URN namespaces is probably too stringent for all but a very few definition discovery applications. There would likely be poor protocol support for discovering definitions in a new URI scheme or URN namespace. It is possible, manually, to look up a scheme or namespace in the appropriate registry, but few client applications are able to do this, and the resulting document is not machine actionable in any standard way. One could attempt to modify all Web clients to understand the new scheme, but this would be difficult.

3.4 Use the LSID getMetadata() method

A URN namespace for which there is a general definition method is the 'lsid' namespace. URIs beginning 'urn:lsid:' are called LSIDs. [lsid] LSIDs have an associated SOAP-based protocol that has separate methods for dereference (getData) and discovery (getMetadata). According to the LSID specification, an LSID for which the getData method yields nonempty content refers to a representation, while the LSID could refer to anything at all if getData yields empty content. In the latter case the information yielded by the getMetadata method generally constitutes, or at least contains, a definition of the LSID.

For clients lacking an LSID protocol implementation, HTTP/LSID gateways are available.

The LSID protocol improves on 303 redirects (see below) in that only one round trip is required to obtain a definition.

Criticism: LSIDs rely on an unregistered URN namespace, calling their consensus status into question and making them impossible to understand through the usual chain of IETF URI specifications. The LSID protocol itself is poorly deployed. As currently used, LSIDs rely on DNS for both authority and resolution, and therefore have the same vulnerabilities as http: URIs. LSIDs do not meet the "browser friendly" criterion.

3.5 'Hash URI'

With this method, the URI must be a 'hash URI', i.e. must contain a hash character '#'. (For historical reasons the part of the URI following '#' is called the 'fragment identifier', even when it is null.) The definition of the URI is placed in the document on the Web at the URI that is the pre-hash stem of the URI.

Example: 'Hash URI'

The interpretation of a 'hash URI', say 'http://example/eq#eq018', depends (according to [rfc3986]) on the media types of representations of the information resource on the Web at its stem URI 'http://example/eq'. For media type application/rdf+xml, the media type registration defers to the content of the representation — that is, the representation itself gets to arbitrarily define what the 'hash' URI means.^[2]

Criticism: Using 'hash URIs' in this way is a retrofit of an existing architecture intended for locating parts (fragments) of documents to definition discovery. As such the mechanism has some rough edges. Some of the objections to the use of 'hash URIs' are as follows.

3.5.1 'Hash URI' semantics is sensitive to media type

If there is content negotiation, session sensitivity, etc., then the definition that is intended and sought may not be present in the representation that is accessed. Worse, the definition that is found may be incompatibly different from the one that is meant. For example, if there is an application/rdf+xml representation and a text/html representation, then the former may define the URI to name an earthquake, while the latter may define it to name an HTML element.

Response: The answer to this objection is that a server that wants to avoid risking such confusion shouldn't do this. A server should either avoid content negotiation completely, or if it must do CN, it should make sure that the URI is defined in all representations, and in the same way in all of them.

At present the only media type registration that supports defining 'hash URIs' in arbitrary ways is application/rdf+xml. Since this media type has no human-friendly presentation and is not enabled for XSLT, many providers (e.g. FOAF, dx.doi.org) use CN between HTML and RDF so that access in a browser delivers information that is useful to a human. E.g. if you access FOAF without special CN parameters you will not get discoverable definitions of its non-element fragids.

The advent of RDFa, which should eliminate the need for HTTP/RDF CN, may create an opportunity to smooth this inconsistency over.

3.5.2 The common 'hash URI' pattern fails with large namespaces

When a large number of URIs are formed by combining a fixed "namespace" prefix with many suffixes using hash as a connector, there will be a single underlying document at the pre-hash URI that must provide definitions of all of the large number of URIs. This is an unacceptable performance hit for the server, the network, and the client. Absolute URIs don't have this problem as the response can be specific to each URI.

Response: The answer to this has been reported a number of times [degraauw]. For a set of namespace members a, b, c, ... instead of using URIs

  http://example/ns#a  http://example/ns#b  http://example/ns#c ...

use URIs that look like

  http://example/ns/a#_  http://example/ns/b#_  http://example/ns/c#_ ...

where _ is a common suffix of your choice. (One might consider an empty suffix:

  http://example/ns/a#  http://example/ns/b#  http://example/ns/c# ...

but, while technically correct, this approach interacts badly with many deployed tools.)

3.5.3 Fragment identifiers are easily lost

Harry Halpin [halpin] says that fragment identifiers are often lost during document preparation and cut/paste operations.

Rumor has it that some MVC-based web frameworks (Jango?, Sinatra?) are not good about preserving fragids. But this is just rumor; it needs to be verified.

Response: It's not obvious that this should be the case. More detail is needed on this objection. Concrete scenarios would help. This is really important because without the anti-hash arguments, there is no need to use absolute URIs.

3.5.4 'Hash URIs' don't support REST architecture

Manu Sporny says that hash URIs should work with HTTP PUT, POST, and DELETE methods; they don't.

Response: More information needed. Why not use a separate dereferenceable URI for REST controls related to the referent and/or documentation of a hash URI?

3.5.5 'Hash URIs' are unattractive, silly, and/or vestigial

?

3.6 Absolute URI with HTTP 303 See Other redirect

Initially (around 2000) 'hash URIs' were advanced as the recommended method for definition provision and discovery. In the 2002-2005 time period demand arose for a discovery method applicable to absolute URIs. This led to the invention of a new protocol for use in situations where 'hash URIs' are considered unacceptable.

In this approach, one mints an absolute (i.e. hashless) http: URI, puts a definition of it on the Web at a second URI, and then arranges for a GET request of the first URI to redirect, using a 303 'See Other' status code, to the second URI. The first URI is not dereferenceable, and therefore does not name the information resource at that URI (since there is none). The first URI then gets its meaning by interpreting the document on the Web at the second URI, which presumably contains a definition of the first URI. The document may carry definitions of other URIs as well, so the referent of the URI is not necessarily the document's primary topic - it may be only one of many things "described by" the document. [Draft note: TBD: cite HTTPbis]

Example: 303 redirect

Alice chooses 'http://example/eq018' as the way she will refer to a particular earthquake. At 'http://example/about-eq018' she publishes text and/or RDF that defines 'http://example/eq018', explaining the URI by providing details about the earthquake (date, location). For the URI 'http://example/eq018', which will not be dereferenceable (since otherwise, it would refer to the information resource at that URI [ir], not the earthquake), she arranges that a GET request yields a 303 redirect with a Location: header specifying 'http://example/eq018' as the redirect target.

Those encountering 'http://example/eq018' will attempt to dereference it, but this will fail, with a 303 redirect delivered instead. The 303 redirect indicates that the document at 'http://example/about-eq018' provides a definition of the URI 'http://example/eq018'.

Another pattern is to use a 303 redirect to a document whose primary topic is the intended referent, similar to the Chicago use case (2.2 Using a document as a definition by reference to its primary topic). This could, in theory, lead to ambiguities, as the primary topic of the document and the entity referred to using the URI might be different things.

Criticism: Again, a number of objections to this approach have been raised:

3.6.1 303 is difficult, sometimes impossible, to deploy

Deploying a 303 redirect requires giving the correct directive to a web server, for example adding a Redirect line to .htaccess in Apache HTTPD. Unfortunately many hosting solutions do not allow this, putting this manner of publishing definitions off limits to many who would otherwise like to use it.

Response: Web publishers whose ISP does not permit them to set up a 303 redirect, or for whom the overhead such as expertise acquisition is prohibitive in some other way, could choose to use a service that provides 303 redirects to a location of their choosing. One such service is purl.org, operated by OCLC, which permits anyone to set up a 303 or other redirect from their domain. The URI to be defined would have to have the form http://purl.org/..., while the URI for the document carrying the definition could be anything at all.

Unfortunately, use of a redirect service makes one dependent on two service providers instead of one, making one's definitions more vulnerable than if only one provider were involved.

3.6.2 303 leads to too many round trips

To get definitions of N URIs by redirecting through 303 responses, you need to do 2N HTTP requests. This is a frustrating and apparently gratuitous performance hit for those interested in publishing and accessing large numbers of definitions.

Response: See 5.1 Absolute URI with site-specific discovery rules.

3.6.3 303 responses aren't cached

RFC 2616 [rfc2616] says that 303 responses shouldn't be cached. Some caching software obeys this directive, with negative consequences for the performance of GET/303 exchanges.

Response: This problem was recognized quite early on as a mistake in RFC 2616 [rfc2616], and an erratum was circulated. This is one of many changes made in HTTPbis, which is being developed by the IETF HTTP working group and should be published some time soon. Any software that fails to cache 303 responses when allowed to by HTTPbis needs to be fixed.

3.6.4 303 makes the URI difficult to bookmark

"Redirection has in fact very confusing side effects; as we expect the semantic web to work seamlessly with the web, it is very odd that a semantic web uri cannot be copy pasted to a browser without seeing it change to something that is not the same as before." [tumarello]

Response: The location bar issue is discussed here. [TBD: citation] The content from the redirect target does not originate from the referent of the original URI, so an interface that suggests otherwise is guilty of misattribution. The best answer to this is that an additional user interface element should be added to browsers that provides access to the original URI.

3.6.5 This use of 303 has no consensus specification

HH: "The hash 303 redirect method in common use has not received adequate review such as W3C recommendation track; in fact it is not really documented at all in any adequate form." [halpin]

Response: The IETF HTTP working group has taken on this issue. HTTPbis's new text for GET/303 specifies the pattern, which is now in common use in RDF deployment. There is no issue of incompatibility with prior usage because the current HTTP specification [rfc2616] only defines what 303 means in conjunction with POST and says nothing about what it means with GET.

4 Don't do it: Potential workarounds

If issues around 'hash URIs' and 303 redirects render them unacceptable, it is worth considering alternatives. In this section we reconsider ways in which definition discovery can be bypassed altogether. In the following secion potential new discovery methods are considered.

4.1 Use something other than a URI

Editorial note	2011-04-14
This section derives from JAR's TAG F2F presentation slides. The purpose of talking about this idea would be mainly to remind people that the problem is one of notational engineering, not philosophy. I have been asked to remove this section.

URIs are just one kind of term that might be used to refer to something. If defining a URI is too difficult or costly, then perhaps one might do without. In RDF serializations such as Turtle, for example, we have blank node notation:

  [ foaf:isPrimaryTopicOf <http://example/about-chicago> ]

Here we have managed to refer to Chicago without defining a new URI; we have simply referred indirectly using a URI that refers to an information resource according to a generic method (see [ir]).

A concise alternative would be syntactic sugar:

  *<http://example/about-chicago>

which might be supported in a hypothetical new RDF serialization as a shorthand for the previous example. (The asterisk is meant to be suggestive of indirection in the C programming language.)

Criticism: These are good as far as they go, but they do not meet the demand for defined URIs. In particular, it can be difficult to detect that blank nodes in separate graphs are meant to refer to the same thing. Data integration is easier when shared URIs are used.

In the case of syntactic sugar, there would be adoption overhead in publishing new RDF serialization specifications and getting them implemented.

4.2 Express data in terms of information resources

[Or, "parallel properties."] The idea here is that you don't need to define a URI if you are willing to use properties that are defined or understood as indirecting through information resources. Instead, just use a URI that refers to the information resource at that URI, and use it as the subject of such properties.

Assume that each information resource can have an associated entity, which we'll call its "designated subject".^[3] Information about the designated subject is expressed using properties whose subject is the information resource.

Example: Combining metadata and data using the same URI

Suppose that Alice wants to record some information about an earthquake. She publishes a definition containing the following so that it's on the Web at the URI 'http://example/eq018':

  <http://example/eq018> eq:magnitude 6.9.
  <http://example/eq018> eq:epicenter <geo:37.040,-121.877>.

Bob then comes along and writes the following metadata about IR('http://example/eq018') in the usual way, i.e. using the URI to refer to the information resource, based on what information is accessed via that URI:

  <http://example/eq018> dc:creator "Alice".
  <http://example/eq018> dc:title 
    "Loma Prieta earthquake URI definition".

Suppose that Carol encounters both bits of RDF (or either) and needs to make sense of them. She is aware that 'http://example/eq018' might be used in both kinds of statement - in metadata, with the intent that the metadata is about IR('http://example/eq018'); and also in statements that relate to an eathquake.

Instead of defining eq:epicenter to be a property relating an earthquake to its epicenter, one defines eq:epicenter to be a property that relates an information resource to the epicenter of its designated subject. Then, as long as you have a URI for the IR, you don't need a URI for the earthquake. If property eq:epicenter has domain eq:Earthquake, then the members of eq:Earthquake are IRs whose designated subjects are earthquakes.

The nature of the designated subject is inferred from information found in the IR. For example, if the IR says that its eq:epicenter is E, then you can infer that the designated subject has epicenter E.

The overall effect when reading the RDF is that the information resources, being ubiquitous, seem to disappear, and one focuses naturally on information about their designated subjects without being aware of the indirection.

All considerations that apply to the subject of a property also apply to the object, making the situation more complex in ways that we won't work out in detail here.

[via TimBL] This pattern has some degree of uptake. Using the open graph protocol on Facebook, you can get a page about a movie. The RDF references <>, which is of class Movie. (<> is equivalent to a reference via the base URI, the one from which the page was retrieved, and therefore refers to an information resource.) The members of class Movie are information resources whose designated subjects are movies.

Criticism: If a property that refers directly to movies also needs to be used, then two properties have to be defined (with distinct URIs), one relating to the movie and one relating to the Movie. This results in clerical overhead and potential user confusion.

4.3 Rely on implicit coercion from an information resource to its designated subject

[Draft note: We are trying to represent Ed Summers's proposal, which others have echoed, in this section. This is sometimes call "punning".]

If one's domain of discourse mixes information resources (used as above) and entities that might be their designated subjects, then maintaining parallel properties, one set that applies the 'designated subject' coercion and one that doesn't, might be considered an unacceptable cognitive and clerical burden. (There is quite a lot of variation in opinion on this point.) In this case one might try combining the two properties into a single property that can be used in either way. Suppose that P is the initial property (not defined via designated subject coercion) and Q is the overloaded property we'd like to define and write. Then an obvious definition of Q would be

Q(x,y)
if and only if
P(x,y) OR P(designated-subject(x),y)

For example, taking P = dc:creator as defined by the Dublin Core definition, and Q = dc:creator as overloaded, the statement

  <http://example/eq018> dc:creator "Alice".

could be taken to imply that P(<http://example/eq018>, "Alice") as long as it is agreed ahead of time that earthquakes don't have creators.

This manner of overloading can make correct recovery of P-relationships impossible when a designated subject is an information resource, so it's probably better use a "tie breaking" rule such as

Q(x,y)
if and only if
P(x,y) OR {P(designated-subject(x),y) AND designated-subject(x) is not an information resource}

There may be better tie-breakers than this one; this is just for illustration.

All considerations that apply to the subject of a property also apply to the object, making the coercion rules that much more complex.

Criticism: This approach presents a couple of challenges.

First, any tie-breaking rule is going to be fragile and will make the "losing" side of the race difficult to express. One can expect many mistakes where the designated subject was the intended subject of some metadata but the tie-breaking rule implicated the other information resource.

Second, this method, by design, creates the illusion that the URI actually refers to the designated subject, not the information resource. If predicates that already possess meaning are being reinterpreted as overloaded properties, there is risk that an agent will draw unsound conclusions. For example, if two URIs u, v refer to distinct information resources with the same designated subject, and one then writes <u> owl:sameAs <v> having their designated subjects in mind, then one can incorrectly impute that the two information resources are identical. A similar situation holds with functional properties, which induce equations.

5 Potential new discovery methods

5.1 Absolute URI with site-specific discovery rules

The network round-trip (303 redirect) used to map the URI whose definition is to be discovered to the URI of the information resource that defines it can be avoided if we know a general rule that maps the one kind of URI to the other, as such a rule can be applied on the client without server involvement. It is probably too much to hope for that a single rule could work uniformly for all URIs whose definition might be sought, but an individual host may have a rule that applies for URIs at that host.

The "well known URIs" protocol gives a place where a file containing such rules can be stored [rfc5988]. The rule might be stored in a well-known file 'definition-rule', as in 'http://example/.well-known/definition-rule'. To obtain a definition of 'http://example/eq018', obtain the definition-rule file for its host. Then if the rule says to map 'http://example/{path}' to, say, 'http://example/{path}.about', a definition of 'http://example/eq018' can be sought by dereferencing 'http://example/eq018.about'.

When the mapping rule is cached, this reduces the number of round trips from two (in the 303 case) to one.

This would be a new protocol and the name and format of the definition-rule file would have to be pinned down. One option might be to use the link-template feature of the host-meta file, but registering a new well-known file name would also be a viable option.

Looking for a definition-rule file for every host that has URIs for which definitions need to be discovered would be expensive if only a few of them have such files, but with some cleverness the number of such failed requests can probably be kept small. The details would have to be worked out, but this approach could be a boon to bulk consumers of absolute URI definitions.

For compatibility with clients that are not aware of discovery rules, 303 redirects for these URIs should be retained when possible.

Criticism: Web site authors without write access to the appropriate .well-known file would not be able to take advantage of this facility.

Jeni says: "the disadvantage is that you lose the distinction between status codes for the thing [described] and the document" -- but JAR doesn't understand this. Any information that would have been conveyed by the status code from a GET on the original URI, could be conveyed in the document retrieved by definition discovery.

Jeni says: "in some cases the mapping from thing URI to document URI can be complex or change over time in ways that make it hard to use a definition rule file; in legislation.gov.uk for example, we return a 303 redirection from a legislation item to either an as-enacted version or the most recently revised version, depending on what is available for that particular item of legislation (which changes as new revised versions are added). It would be quite hard to create a definition-rule file in those circumstances (we would have to solve it by having a simple mapping with some URIs 307 redirecting to others)."

5.2 Absolute URI with new HTTP method or status code

To reduce the number of round trips relative to the 303 redirect, we might use a new HTTP status code to indicate that what is being returned is a definition of the request URI, rather than a representation associated with the information resource at that URI. Alternatively, we could define an HTTP method to request a definition of a URI.

New status code: In response to GET of a URI, a server might provide a definition of the URI directly in a non-success response, as opposed to indirectly via a 303 redirect. (The definition can't go in a successful GET response since that would mean that the URI refers to the information resource at the URI.) Possibilities for HTTP response status codes that might signal this situation: 203 Non-Authoritative Information; a new 2xx status (maybe 209); a new 3xx status (maybe 308); or a variety of 4xx codes. Placing the definition in the content of a redirect response (status code 301, 302, 303, and 307) is unsatisfactory as the content would not be displayed in a Web browser; the same situation might apply to any 3xx or 4xx response, making a 2xx status code the most attractive.

New method: The URIQA specification [uriqa] defines MGET, a new HTTP request method. An MGET request on a URI yields a response containing information about the referent of the URI. If the URI is dereferenceable, then the URI refers to the information resource at that URI, so the MGET result is metadata for that information resource. Otherwise, the MGET result might be a definition of the URI. A GET in that case would yield a 303 See Other linking to the same definition obtained by MGET, or maybe to a 405 Method Not Allowed response.

Either of these options would mean fewer round trips than following a 303 redirect.

The Link: HTTP header [rfc5988] is useful for indicating a metadata source for an information resource (see POWDER spec, citation needed). In case a URI is not dereferenceable, Link: could be used for directing a client to a definition of a URI. However, the advantage of Link: over a 303 redirect is unclear, since a second network round trip would be required in either case.

Criticism: Although they reduce the expected number of round trips, all HTTP extensions are generally as difficult, or more difficult, to deploy than 303 redirects. And it's not clear which status codes play nicely with the "browser friendly" goal. We would have to check to make sure that proxies, caches, and Web clients do something reasonable with the proposed status code.

5.3 Repurpose some or all dereferenceable absolute URIs

Under this approach, some or all dereferenceable absolute URIs - call them "indirect" URIs - would get their meaning according to a definition found in the information resource (document, usually) at the URI; they would no longer refer to their information resource [ir]. This approach avoids the deployment and performance difficulties of 303 redirects. Defining an indirect URI is easy — it is the same as publishing any Web document — and access to its definition is also easy, not requiring an indirection step.

How does one learn whether a URI is indirect or not? One might like to say that an indirect URI is one that dereferences to a definition of itself, and that all others are direct. But this criterion is not machine actionable as stated, both because the definition might be couched in an arbitrary language or notation (the number of RDF serializations is increasing steadily), and because even for a known notation it may not be obvious how to distinguish content that contains a definition of a particular URI from content that doesn't. One actionable approximation that has been proposed is as follows: If IR(u) has an associated representation with media type 'application/rdf+xml', then take u to be indirect; otherwise take u to be direct. This rule would generate false positives (e.g. RDF/XML documents not containing u) and false negatives (e.g. those defining the URI only in an associated text/owl-manchester representation), but it illustrates the idea.

In order to compose or use metadata, agents would first check whether a URI is direct by requesting an application/rdf+xml representation. If the URI is direct, agents could compose or use metadata in the usual way (at some risk that the URI might change status in the future from direct to indirect). If the URI is indirect, agents would have to write or interpret the metadata in some new way (see below).

Criticism: Currently it is easy to write and interpret Web metadata (meaning metadata written using a dereferenceable absolute URI to refer to the information resource at that URI). This proposal makes metadata more complicated, fragile, and costly, and forces all existing producers and consumers of Web metadata to be updated to be aware of indirect URIs.

It is likely that there is deployed content that would be interpreted differently under the proposed rule than at present. This would be hard to know, and inconsistencies could be consequential, such as the assignment of authorship or a copyright license to the wrong information resource. (Think about the case where an information resource at URI U defines U to be a different information resource.) More complex and costly heuristics than those given above might eliminate some kinds of misinterpretation, but would never eliminate it.

As most of the Web (e.g. HTTP clients and servers) will continue to adhere to the current interpretation of dereferenceable absolute URIs, the proposed rule introduces a split in the URI namespace, with two communities interpreting the same URIs in incompatible ways. Having multiple namespaces imposes an overall system cost in that one has to determine which one to use in each instance (see [webarch] 2.2.1).

5.3.1 How to refer to information resources, then?

Any proposal that displaces the current meanings of some URIs from those URIs has to compensate by providing new homes for those meanings. That is, some rule must be specified that yields a way to refer to IR(u), given any dereferenceable absolute URI u. This is not a matter of semantics or philosophy; it is just notational engineering.

There are many applications that need such a rule for writing references to information resources at arbitrary URIs, including those concerned with metadata (including licensing), provenance, Web site testing, validation, text processing, text annotation, and access control.

A standard way to refer to IR(u) is needed in a variety of circumstances:

when u is an indirect URI
when it is not known whether u is direct or indirect
when the cost of determining whether u is direct or indirect is judged to be too high
when it is desired not to impose on others the cost of determining whether u is indirect
to guard against u possibly becoming indirect in the future

Although direct URIs might still be used to refer to their information resources, when they are known to be direct, the risks and costs of doing so might lead people to stop using them, in preference to a common approach that worked uniformly for direct and indirect URIs.

In any case, there are many design alternatives for referring to an information resource other than using its URI. For example, the Turtle term

    [ ir:onWebAt "http://example/eq018"^^xsd:anyURI ]

could be a new way to refer to IR('http://example/eq018'), which we formerly referred to in Turtle as '<http://example/eq018>'. [TBD: Reference Halpin and Presutti's closed access ESWC 2009 paper.] A local shorthand for use within a document or graph could be defined to the same effect:

    :about-eq018 ir:onWebAt "http://example/eq018"^^xsd:anyURI .

(Note that :about-eq018 could be either a 'hash' URI or a 303 URI.)

Yet another possible replacement notation would be syntactic sugar:

    &<http://example/eq018>

which might be supported in a hypothetical new RDF serialization. (The ampersand is meant to be suggestive of the address-of operator in the C programming language.) (This would of course have significant deployment cost.)

Alternatively, the referring document could just assert that a URI is direct, without checking whether it is or not:

    <http://example/eq018> ir:onWebAt "http://example/eq018"^^xsd:anyURI .

This would be an instance of 3.1 Colocate definition and use. However, this runs some interoperability risk as there may be other agents that interpret the same URI as indirect. ^[4]

Another design option would be a rule or protocol for providing a URI (other than u) to refer to IR(u), when one is available. One way to do this would be with a Link: HTTP response header [rfc5988]: if GET u or HEAD u yielded a response with a Link: header with an agreed link relation, the target of the link would be the URI naming IR(u). Using a Content-location: header has also been suggested. It would be necessary that the extra header be provided for all indirect URIs, since otherwise some of these information resources would lack URIs.

It is not clear how difficult it would be to correctly deploy Link: or Content-type: headers on hosting services.

6 Summary

[Jeni: "I think you could do with making more of (ie explaining in more detail up front) the criteria against which the various alternatives are judged. There are various criteria that crop up in the criticism sections that aren't necessarily reflected in the table here, such as the copy/paste factor, cachability (as I described above)"]

The following table summarizes some of the current and proposed definition discovery methods, evaluating each against a set of criteria, as explained in the key below.

	webarch?	robust?	easy to deploy?	min round trips	sound?
Hash	+	-	+	1	+
Absolute + 303	+	+	-	2	+
Absolute + discovery-rule	+	+	?	1+ε	+
Absolute + new HTTP	+	+	-	1	+
Coerce	+	+	+	1	-
Take at face value	-	+	+	1	+

webarch?: Does it assign a new, incompatible meaning to existing URIs?
robust?: Is the URI free of fragment identifiers that can get lost or misinterpreted?
easy to deploy?: Can a publisher with a file-upload-only hosting solution use this method?
min round trips: How many network round trips are needed to find a definition, assuming (a) the definition is not cached and (b) the /.well-known/host-meta cache misses with probability ε ?
sound?: Is the method likely to respect deployed axioms and inference rules (i.e. is safe with respect to logical soundness)?

7 Glossary

This section defines terms that are used in this report. An attempt has been made to avoid gratuitous differences from the way these terms are used elsewhere, but in a few cases choice of terminology has been difficult and words with other meanings (such as "definition") are given technical definitions. These definitions are not being proposed for general adoption.

[Draft comment: All terminology choices are provisional; for most of them I am testing the waters to see how well the word works, and am prepared to change.]

absolute: A URI is absolute if it contains no hash '#' sign. This usage is a bit unintuitive but is used for consistency with RFC 3986 [rfc3986].
associated with: [Draft note: This is too sketchy. TBD.] "Association" of a representation with an information resource is by fiat according to each particular information resource. See [ir].
definition: A document or document part that provides information about the meaning of a URI or other kind of term. This term is not meant to be either rigorous or exclusive. The "information" could provided in any human-readable or machine-readable language, or combination of languages. It needn't be successful, specific, or comprehensive in defining the term in the ordinary sense of "defining". Rather, the term as used here refers to the role it plays in discovery. We might more accurately say "putative definition". [Draft note: Alan R: Is a sound recording a possible definition?]
dereferenceable: A URI is dereferenceable if there is at least one representation that is authorized as the result of a retrieval operation. (This definition is derived from [rfc3986] section 1.2.2, which also applies 'dereference' to operations such as POST.) In particular, absolute http: URIs are dereferenceable if some HTTP method or equivalent is successful (yields a 2xx response). Some URIs belonging to some other URI schemes are also dereferenceable.
http: URI: A URI whose scheme (the part before the colon) is 'http' or 'https'.
information resource: Roughly speaking, something that is appropriate as the subject of metadata. See [ir].
IR(u): IR(u) is shorthand for the information resource on the Web at URI u. For example, if 'http://example/image23' is dereferenceable, then IR('http://example/image23') is the information resource on the Web at that URI.
metadata: Information about information, or about an information resource. In RDF, metadata might be written using vocabularies such as Dublin Core, FOAF, or CC REL.
on the Web at: When a URI is dereferenceable, "the information resource on the Web at a URI" (abbreviated IR(that URI), see below) is the information resource whose associated representations are the ones obtained by dereferencing that URI (or more precisely, the ones that are authorized for dereferences of that URI). See [ir] for a rigorous definition.
refer: For the purposes of this report, reference is just one way to mean. There may be ways to mean other than to refer, but none are specified here.
representation: Content (an octet sequence) tagged with media type and perhaps other information meant to guide interpretation of the content. "Representation" is used as a term of art; these representations don't necessarily "represent" anything at all. Similar to "entity" in RFC 2616. [rfc2616] See [ir] for a treatment of representations and information resources.

8 Acknowledgments

David Booth, Michael Hausenblas, Nathan Rixham, and Alan Ruttenberg contributed to the creation of this report. Pat Hayes and Henry S. Thompson participated in discussions. Timothy Danford gave some helpful suggestions on a draft. Jeni Tennison and the rest of the TAG gave many helpful comments.

9 References

issue-57: Issue 57. W3C Technical Architecture Group, 2007-2011. (See http://www.w3.org/2001/tag/group/track/issues/57.)
rfc3986: T. Berners-Lee, R. Fielding, L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, IETF, 2005. (See http://www.ietf.org/rfc/rfc3986.txt.)
disambiguating: Sandro Hawke. Disambiguating RDF Identifiers. W3C, January 2003. (See http://www.w3.org/2002/12/rdf-identifiers/.)
webarch: Ian Jacobs and Norman Walsh, editors. Architecture of the World Wide Web, Volume One. W3C Recommendation, December 2004. (See http://www.w3.org/TR/webarch/.)
ir: Jonathan A. Rees, editor. Information resources and Web metadata. Editor's draft, W3C, 2011. (See http://www.w3.org/2001/tag/awwsw/ir/20110625/.)
rfc4395: T. Hansen, T. Hardie, and L. Masinter. Guidelines and Registration Procedures for New URI Schemes. RFC 4395, IETF, 2006. (See http://www.ietf.org/rfc/rfc4395.txt.)
rfc3406: L. Daigle, D.W. can Gulik, R. Iannella, and P. Faltstrom. Uniform Resource Names (URN) Namespace Definition Mechanisms. RFC 3406, IETF, 2002. (See http://www.ietf.org/html/rfc3406.txt.)
lsid: Life Sciences Identifiers Specification. Object Management Group, 2004. (See http://www.omg.org/cgi-bin/doc?dtc/04-05-01.pdf.)
rfc5988: M. Nottingham. Web linking. RFC 5988, IETF, 2010. (See http://www.ietf.org/rfc/rfc5988.txt.)
hostmeta: E. Hammer-Lahav. Web Host Metadata. Internet-draft, IETF, 2010. (See http://tools.ietf.org/html/draft-hammer-hostmeta-13.)
uriqa: Patrick Stickler. The URI Query Agent Protocol. Nokia, 2010. (See http://sw.nokia.com/uriqa/URIQA.html.)
halpin: Harry Halpin. Reversing HTTP Range 14 and SemWeb Cool URIs decision. Email to public-awwsw list, 2011. (See http://lists.w3.org/Archives/Public/public-awwsw/2011Jan/0021.html.)
degraauw: Marc de Graauw. The #referent convention. Blog post, 2007. (See http://www.marcdegraauw.com/2007/02/20/the-referent-convention/.)
tumarello: Giovanni Tumarello. http-range-14 303 issue, request for reopening the discussion. Email to www-tag list, 2007. (See http://lists.w3.org/Archives/Public/www-tag/2007Jul/0034.html.)
rfc2616: R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1. RFC 2616, IETF, 1999. (See http://www.ietf.org/rfc/rfc2616.txt.)