]>
This report has been developed by the
Earlier versions of this document have been reviewed by the
task group and the TAG but this version has not.
The content of this version is the sole responsibility of the
editor.
Publication of this draft
does not imply endorsement by the W3C Membership. This is
a draft document and may be updated, replaced, or obsoleted by
other documents at any time.
Please send comments on this
document to the publicly archived TAG mailing list
The specification governing Uniform Resource Identifiers
(URIs)
A few widely known methods are in use to help agents provide
and discover URI documentation,
including RDF fragment identifier resolution and the HTTP 303
'See Other'
redirect.
Difficulties in using these methods
have led to a search for new methods that
are easier to deploy, and perform better,
than the established ones.
However, some of the proposed methods introduce new problems, such
as incompatible changes to the way metadata is written.
This report
brings together in one place information on current and
proposed practices, with analysis of benefits and shortcomings
of each.
The purpose of this report is not to make recommendations but
rather to
explore the design space and
initiate a discussion that might lead to
consensus on the use of current and/or new methods.
$Id: issue57.xml,v 1.2 2012/01/30 20:58:13 jrees Exp $
In any kind of discourse it is very useful for an agent to be
able to provide documentation for a term, in such a way that other agents
can discover and use that documentation in order to make sense of
utterances that use that term, and to compose new utterances
that use it.
Suppose that Alice, in
communication with Bob, uses
the term "EQ 018" to mean
the Loma Prieta earthquake, as in "Alice was in the laboratory
during EQ 018". If Bob does
not know what "EQ 018" means, he will have to find out. He
might be able to ask Alice directly, although
this may be impossible, as Alice might be too busy, or
otherwise unavailable.
Lacking that option he does some research, consulting
a dictionary or similar resource (reference book, database,
search engine)
in order to obtain the
explanation of Alice's use of the term "EQ 018".
In this report, the terms to be documented are assumed to be
URIs. URIs can be used
to mean all sorts of things
in many different technical contexts. Contexts of
special interest to this report are
those processed by machine,
including the RDF and OWL family of languages. The question
may appear to
be limited to RDF and its derivatives, but to the
extent that there is supposed to be a single
meaning for each URI common to RDF and Web architecture
The nature of URI documentation need not concern us here - many forms
are familiar, including translation between
languages (e.g. providing an English or Spanish phrase equivalent to a
URI), descriptions (the URI refers to an entity possessing
some set of properties), explanation by example, axiomatic
method, and so on. Also
not of concern here are the many ways in which
meaning can fail as a result
of
URI documentation is typically carried in documents. No
assumptions are made about what else might be in such a
document; there could be additional related information,
documentation for other URIs, and so on. Nor is it important
here that URI documentation be delimited or set off from the other
information in the document. As in an encyclopedia, the
URI documentation part blurs into the other-information parts of the
document.
URI documentation
discovery methods
include, in addition to those already mentioned, network
protocols such as HTTP that involve the URI as a protocol element.
Henceforth, in a URI documentation discovery scenario, the URI whose URI
documentation is to be discovered will be called
the probe URI.
URI documentation discovery is similar to Web retrieval in that in
both cases one can start with a URI and end with a document.
The two must not be confused, however, since retrieval often
yields information that does
The reason we define URI documentation discovery methods is
interoperability: so that there is agreement on how each URI
is to be understood.
In principle, we only need consensus on methods, such as the ones
surveyed here, for URIs
that are to be shared widely. If
agents in one community never use the URI in communication with
agents in another community, then it is OK for the URI
to have
distinct senses in the two communities, and there is no
problem to be solved. Each community can use the URI in its
own way, and there will be no confusion.
The operative word here is "if". Isolation is fragile and
means lost opportunities for synergy and unintended reuse. All
the arguments in favor of a World Wide Web, which depends on the
global nature of the URI vocabulary, apply here.
This report presents discovery methods in current use,
reports some
criticisms of them, and describes some additional discovery methods that
have been proposed to address the criticisms.
No consensus on success criteria has emerged from the
discussion of this question. The following properties have
been articulated as desirable by various parties to the
discussion. Unfortunately they apparently form a mutually
inconsistent set.
It is not certain that all of these goals can be met
simultaneously.
Use cases need to be presented as being independent of any
particular solution to be used, in order that the solution space
can be explored without bias. This leads to some
frustrating vagueness in the following, but the vagueness is
intentional and necessary.
Alice wants to refer to a particular earthquake.
Alice "mints" a new URI (one that is not yet in use) with the
purpose of using that URI to refer to the earthquake. Alice
publishes a document containing documentation for the URI, i.e.
a document that
would lead a reader to understand that the URI refers to the
earthquake.
Bob then learns of Alice's URI and its documentation, and uses
the URI in a document
of his own.
Subsequently Carol encounters Bob's document. Wanting to
know what the URI means, she
is led somehow to Alice's published URI documentation, which she
reads. She is enlightened.
Any method for implementing this use case would need to explain:
what kind of URI Alice should use (syntactic constraints);
where and how should Alice should publish the documentation so that it
can be found;
and how Carol might come to discover Alice's documentation, given
the URI.
Bob desires to refer to Chicago.
He finds a Web page
on the Web at 'http://example/about-chicago' (provided by,
say, Alice) that consists
of a description of Chicago, and wants to use it for the
purpose of referring to Chicago. He chooses
a URI and associates it with Alice's Web page
in such a way that Bob's URI will be understood as referring to
Chicago.
Carol encounters Bob's URI, is led to 'http://example/about-chicago'
and thence to Alice's description of
Chicago, and then somehow understands that Bob's URI is
meant to refer to Chicago.
Any method for implementing this use case would need to
explain: what are the syntactic constraints on the URI Bob
chooses; what
Bob needs to do to associate his URI with the document about
Chicago; and how Carol comes to discover and use that
association.
(This differs from the previous use case in that the
document about Chicago was
not written with the purpose of documenting Bob's URI. In fact
Bob's URI doesn't even occur in it. Rather than look
in the document for
URI documentation for Bob's URI, Carol must determine the
topic of the document and take the topic as the meaning of
Bob's URI.)
This section describes currently accepted methods for
providing and discovering URI documentation.
One way to lead someone encountering a URI to documentation
for the URI is to
make sure that the URI documentation occurs in
each document in which the URI occurs.
This makes the URI documentation easy to find, since anyone who
encounters the URI will already have it in hand.
The form of the URI in this case is arbitrary.
This method treats URIs similarly to blank nodes in RDF, which
have to stay close to their own documentation, since they
are scoped to a graph. An example of the application of
this approach would be the use of a
URI in an OWL ontology file that carries the URI documentation.
When using a URI, provide,
again in the document in which the URI occurs,
a recognizable
reference to a document that carries the URI documentation.
This is the approach taken by OWL; the document containing
the URI is related to the one from which the
URI documentation should be obtained via the owl:imports
relation.
It is possible to create a new URI scheme or URN
namespace equipped with its own URI documentation discovery regime.
A recent example is RFC 5870 for URIs documented as naming
geographic locations, where the RFC itself constitutes URI
documentation for all of its URIs. Another is
the URI documentation for the URI about:blank and other
about: URIs,
which is in progress as of this writing.
A "tdb:" (thing-described-by) URI scheme has also been
proposed,
[TBD: cite
Masinter]
as has "xri:" for
"extensible resource identifiers"
(n.b. xri: has been deprecated in favor of http: and Web Linking).
See
The most fully developed and widely implemented such design is
the 'lsid' URN namespace.
URIs beginning 'urn:lsid:' are called LSIDs.
For clients lacking an LSID protocol implementation,
HTTP/LSID gateways are available, suggesting the possible
applicability of the
With this method, the probe URI must be a 'hash URI', i.e. must contain a hash character '#'. The URI documentation is placed in the document on the Web at the stem (where stem URI = the pre-hash prefix of the URI).
For historical reasons the part of the URI following '#' is called the 'fragment identifier', even when it is null. We will call these 'local identifiers' in recognition of their uses beyond just references to document fragments.
The interpretation of a 'hash URI', say 'http://example/eq018#_',
depends (according to
Because of the dependence on media type, care must be
taken to ensure that content negotiation does not muddy the
meaning of the probe URI. Fortunately any of three
approaches may be used: (1) avoid content negotiation, (2)
make sure that all representations provide the same
documentation (following section 3.2.2 of
Similar considerations apply for competing use of local identifiers as script-defined or as document fragment identifiers: any potential conflicts must be either avoided or resolved.
A second caveat around hash URIs is that when a number of hash URIs are formed by combining a fixed namespace prefix (stem) with many different suffixes using hash as a connector, there must be a single underlying document at the stem URI that provides URI documentation for all of the URIs. This leads to a number of annoyances, including inefficiency (repeated retrieval of a large document is an unacceptable performance hit for the server, the network, and the client), analytics imprecision, and unavailability of HTTP methods such as DELETE specific to the particular URI.
The answer to this difficulty has been reported a number of times
(e.g.
use URIs that look like
where _ is a common suffix of your choice.
Rumor has it that some MVC-based web frameworks (Django?,
Sinatra?) are not
good about preserving local identifiers. This
needs to be verified.
(Desideratum:
Widely observed convention relating retrieval to meaning is the following:
In effect, a response to a retrieval request is equivalent,
according to
Convention 1, to URI documentation that says that the response
is an instance of the thing named by the URI. This in turn
implies (as explained in
Initially (around 2000) 'hash URIs' were advanced as the recommended method for URI documentation provision and discovery. In the 2002-2005 time period demand arose for a discovery method applicable to hashless URIs. This led to the invention of a new protocol for use in situations where 'hash URIs' are considered unacceptable.
In this approach, one mints an absolute hashless http: URI, puts documentation for it on the Web at a second URI, and then arranges for a GET request of the first (probe) URI to redirect, using a 303 'See Other' status code, to the second URI. The probe URI is not retrieval-enabled, and therefore does not name the resource at that URI according to Convention 1 (since there is none). The probe URI then gets its meaning by interpreting the document on the Web at the second URI, which presumably contains documentation for the first URI. The document may carry documentation for other URIs as well, so the referent of the URI is not necessarily the document's primary topic - it may be only one of many things "described by" the document. [Draft note: TBD: cite HTTPbis]
Alice chooses 'http://example/eq018' as the way she will refer
to a particular earthquake.
At 'http://example/about-eq018' she publishes text and/or RDF
that carries URI documentation for 'http://example/eq018',
explaining the URI's meaning by
providing details about the
earthquake (date, location).
For the URI 'http://example/eq018', which will not be
retrieval-enabled (since otherwise, it would, by
Convention 1, refer to the
resource on the Web at that URI
Those encountering 'http://example/eq018' will attempt a retrieval, but this will fail, with a 303 redirect delivered instead. The 303 redirect indicates that the document at 'http://example/about-eq018' provides documentation of the URI 'http://example/eq018'.
Another pattern is to use a 303 redirect to a document whose
primary topic is the intended referent, similar to the
Chicago use case (
Again, a number of objections to this approach have been raised:
Unfortunately, use of a redirect service makes one dependent on two service providers instead of one, making one's URI documentation more vulnerable than if only one provider were involved.
"Redirection has in fact very confusing side effects; as
we expect the
semantic web to work seamlessly with the web, it is very odd that a
semantic web uri cannot be copy pasted to a browser without seeing it
change to something that is not the same as before."
If issues around 'hash URIs' and 303 redirects render them unacceptable, it is worth considering alternatives. In this section we reconsider ways in which URI documentation discovery can be bypassed altogether. In the following secion potential new discovery methods are considered.
URIs are just one kind of term that might be used to refer to something. If defining a URI is too difficult or costly, then perhaps one might do without. In RDF serializations such as Turtle, for example, we have blank node notation:
Here we have managed to refer to Chicago without defining a
new URI; we have simply referred indirectly using a URI that
refers to the resource on the Web at that URI
according to a generic method
(see
A concise alternative would be syntactic sugar:
which might be supported in a hypothetical new RDF serialization as a shorthand for the previous example. (The asterisk is meant to be suggestive of indirection in the C programming language.)
In the case of syntactic sugar, there would be adoption overhead in publishing new RDF serialization specifications and getting them implemented.
The idea here is that you don't need to document a URI if you are willing to use properties that are defined or understood as indirecting through documents. Instead, just use a URI that refers to the document on the Web at that URI, and use it as the subject of such properties.
Assume that each named document (i.e. document+name pair)
can have an associated entity, which we'll call its
"designated subject".
Suppose that Alice wants to record some information about an earthquake. She publishes URI documentation containing the following so that it's on the Web at the URI 'http://example/eq018':
Bob then comes along and writes the following metadata about OW@('http://example/eq018') in the usual way, i.e. using the URI to refer to that resource, based on what information is accessed via that URI:
Suppose that Carol encounters both bits of RDF (or either) and needs to make sense of them. She is aware that 'http://example/eq018' might be used in both kinds of statement - in metadata, with the intent that the metadata is about OW@('http://example/eq018'); and also in statements that relate to an eathquake.
Instead of defining eq:epicenter to be a property relating an earthquake to its epicenter, one documents eq:epicenter to be a property that relates a document to the epicenter of its designated subject. Then, as long as you have a URI for the IR, you don't need a URI for the earthquake. If property eq:epicenter has domain eq:Earthquake, then the members of eq:Earthquake are IRs whose designated subjects are earthquakes.
The nature of the designated subject is inferred from information found in the IR. For example, if the IR says that its eq:epicenter is E, then you can infer that the designated subject has epicenter E.
The overall effect when reading the RDF is that the documents, being ubiquitous, seem to disappear, and one focuses naturally on information about their designated subjects without being aware of the indirection.
All considerations that apply to the subject of a property also apply to the object, making the situation more complex in ways that we won't work out in detail here.
[via TimBL?] This pattern has some degree of uptake. Using the open graph protocol on Facebook, you can get a page about a movie. The RDF references <>, which is of class Movie. (<> is equivalent to a reference via the base URI, the one from which the page was retrieved, and therefore refers to a document.) The members of class Movie are documents whose designated subjects are movies. [is this message on topic?]
All rules presented in this section assume that the probe URI is a hashless http: URI.
For compatibility with clients that are not aware of new method(s) for hashless URIs, a complete discovery solution should grandfather discovery methods that are currently widely known, such as 303 redirects. A current method should be deployed when possible, redundantly. Lacking this a 404 should be returned, and if the content of the 404 response can be controlled it should provide suitable information such as a link to the URI documentation. Agents would be faced with the problem of which method to attempt first, since if the the new method doesn't yield URI documentation, a retrieval using the probe URI might have to be attempted (in hope of either success or a See Also), resulting in one or two extra retrieval requests. It is the editor's belief that this problem is not insurmountable, but the details would have to be worked out.
The network round-trip (303 redirect) used to map the probe URI to the URI of the document that carries its URI documentation can be avoided if we know a general rule that maps the one kind of URI to the other, as such a rule can be applied on the client without server involvement.
The "well known URIs" specification
(There is nothing special about the string 'meta'; it could as easily be, say, 'about' or 'seealso'.)
Considering the transformation rule idea of the previous section, it is probably too much to hope for that a single rule could work uniformly for hosts whose documentation might be sought, but each individual host may have a rule that applies for URIs at that host.
To support site-specific rules, a
a file containing such rules can be provided
When the mapping rule is cached, the number of round trips is one instead of two.
Although it would not be difficult to specify a new .well-known path and syntax for the documentation-rule document, it might be possible to use the link-template feature of the host-meta file. There are pros and cons for each approach.
This approach is essentially the same as the ARK design,
[TBD: reference https://wiki.ucop.edu/display/Curation/ARK
or something better]
which uses as its global URI transformation appending a
'?' to the URI. The main differences are that the ARK
rule only works when the path begins 'ark:', and that the risk of
'squatting' on part of a domain owner's URI space (not all
'?'-ended URIs are for URI documentation discovery) is
somewhat higher than in the case of /.well-known/meta/,
which would be sanctioned by
The Link: HTTP header
To reduce the number of round trips relative to the 303 redirect, we might have HTTP requests that are somehow understood as signalling a request for URI documentation, as opposed to retrieval of an instance of a resource on the Web at the URI, with the documentation coming back in the HTTP response. Such a request could be distinguished from a retrieval or other request by its method, headers, and/or content.
The URIQA specification
In response to GET of a URI, a server might provide documentation for the URI directly in a non-200 response, as opposed to indirectly via a 303 redirect. (The URI documentation can't go in a successful GET response since that would mean that the URI refers to the resource on the Web at the URI.) Possibilities for HTTP response status codes that might signal this situation: 203 Non-Authoritative Information; a new 2xx status (maybe 209); a new 3xx status (maybe 309); or a variety of 4xx codes. Placing the URI documentation in the content of a redirect response (status code 301, 302, 303, and 307) is unsatisfactory as the content would not be displayed in a Web browser; the same situation might apply to any 3xx or 4xx response, making a 2xx status code the most attractive.
A range of discovery method designs involve having clients interpret parts of retrieval (HTTP GET/200 or equivalent) responses, or entire responses, as URI documentation. Depending on design details, any particular response might be treated as carrying URI documentation (or expected to do so), treated as an instance (per Convention 1), both (instance with embedded metadata), or neither.
The following illustration diagrams the case where all retrieval responses are treated as carrying URI documentation, i.e. all responses are instances of something different from what the URI refers to that carries URI documentation.
These designs have in common that at most a single HTTP round trip is required, when discovery uses the HTTP protocol.
After surveying the design choices that have to be made, a few representative method designs are presented. The entire space of possibilities is too broad to cover here.
Designs in this space differ in important ways:
Regarding the last question, any method that conflicts with Convention 1 makes some URIs unavailable for expressing what the URIs mean according to Convention 1. There are many applications that need a method for writing a reference to the resource at an arbitrary retrieval-enabled hashless URI, including those concerned with metadata (including licensing), provenance, Web site testing, validation, text processing, text annotation, and access control. Therefore any complete discovery solution that includes some a discovery method that preempts Convention 1 for any URI should include a way to write such references.
The workaround
One particular point in the design space is presented in
For discovery, do a GET requesting media type application/rdf+xml. If the result is application/rdf+xml, then assume no retrieval response is an instance of the referent (?), and assume the result carries URI documentation for the probe URI. To refer to the URI documentation, use the URI in the Content-location: header of the response.
If there is no application/rdf+xml variant then assume the URI refers to what's on the Web at the probe URI.
When an instance is sought (application/rdf+xml not requested), and the result is application/rdf+xml, it is not clear [to the editor] how the result should be classified: as both instance and URI documentation, just an instance, or just URI documentation.
If one's domain of discourse mixes documents
with entities that might be their designated subjects,
then maintaining parallel properties
(see
For example, taking P = dc:creator as defined by the Dublin Core documentation, and Q = dc:creator as overloaded, the statement
could be taken to imply that P(<http://example/eq018>, "Alice") as long as it is agreed ahead of time that earthquakes don't have creators.
This manner of overloading can make correct recovery of P-relationships impossible when a designated subject is a document, so it's probably better use a "tie breaking" rule such as
There may be better tie-breakers than this one; this is just for illustration.
All considerations that apply to the subject of a property also apply to the object, making the coercion rules that much more complex.
The following table summarizes some of the current and proposed URI documentation discovery methods, evaluating each against the desiderata stated in the introduction, as explained in the key below.
A complete discovery solution would combine methods in some way, conceivably resulting in an overall approach possessing more or fewer virtues than any of its individual constituent methods.
A table entry of '?' means that the answer depends on the details of the method design, while '~' means it depends on the interpretation of the desideratum statement (i.e. the vagueness of the desideratum statement makes it hard to say).
- | - | + | + | 0 | ~ | ~ | + | |
- | - | + | + | 1 | ~ | ~ | + | |
+ | - | ? | - | 1 | + | + | + | |
+ | + | + | + | 1 | - | + | + | |
+ | + | + | - | 2 | - | + | + | |
+ | + | + | + | 1 | + | + | + | |
+ | - | + | + | 1 | + | + | + | |
+ | - | + | + | 1+ε | + | + | + | |
+ | - | ~ | - | 1 | + | + | + | |
+ | ? | ~ | - | 2 | + | + | + | |
+ | + | ~ | - | 1 | + | + | + | |
+ | + | + | + | 1 | + | - | + | |
~ | + | + | + | 1 | + | + | - |
Refer to
This section defines terms that are used in this report. An attempt has been made to avoid gratuitous differences from the way these terms are used elsewhere, but in a few cases choice of terminology has been difficult and words with other meanings are given technical definitions. These definitions are not being proposed for general adoption.
[Draft comment: All terminology choices are provisional; for most of them I am testing the waters to see how well the word works, and am prepared to change.]