Resource discovery: limits of URIs

I'm flagging up an issue as a private individual rather than in my official capacity of Head of Information Resources Management in the European Parliament, although the issue I address has been considered in my professional work as well as my private research and work on XML implementation issues.

My concern is the mechanisms available to "translate" information on a uniquely identifiable artefact to an addressable URI. Please accept my apologies in advance if the issue is not appropriate for this list.

As an "ordinary user" I can "identify" or name a particular information artefact, a book, a document, etc. With a URL, I can address it. The URL will give me an address that usually combines details of an originating authority, a content identifier, sometimes a language version and an application format (MIME extension).

However, with the exception of the language version - that might, depending on the server infrastructure, serve up a version according to my indicated preferences set in the browser - the "discovery" of the full URL cannot be deducted algorithmically from the content identifier. A couple of examples to demonstrate my concern more clearly:

- "bookmark rot": I mark a set of resources from a particular site, only to find a year later that all the references are rotten as the .htm extension has been replaced by .php throughout the site, although no single item of content has changed;
- I reference an item found via a WAP service, knowing that a more complete version of the same content is available in HTML on a parallel web site: the 'URLs' however are completely different despite referring to the same artefact;
- I copy a URL in a site, only to discover that the the URL is attributed not only dynamically but is ession specific and sometimes personalised, and thus un re-useable;
- I'm listening to a voice synthesised web page that contains links to resources thatare available in audio and text, but the link takes me to the text file via the hypertext link;

In architectural terms, my concern is that more and more sites, in the absence of any clear mechanisms for resolving addresses from identifiers, have increasingly complex interfaces with proprietary resolution mechanisms than practically render resources discovery impossible, except indirectly. A user should be able to indicate the minimum information that distinguishes a particular artefact uniquely (I'm not sure the URN does this, because it is still only a URI with a commitment to persistence) and not be bothered with whether it is the most recent version, which languages are available, whether it is in pdf, html, xml,wml, but that the server will resolve this in a context-sensitive manner. The issue will become critical when XPointer starts to be used to identify resource fragments: in fact the XPointer's potential weakness is precisely that the containing document may itself be poorly addressable.

My "ideal scenario" would be the replacement, in the hyperlink target data, of an URI - pointing as it does to a specific file - by a "UCI" ( a "Uniform Content Identifier") that resolves to the specific components:
- a DNS entry or other service locator;
- on the server side, to an URI appropriate to the client context, made up of the content identifier 'wrapped' with language, version, format and other context specific data;

If this sort of issue is handled elsewhere, I'd be happy to be pointed the way, but I feel the issue goes beyond the scope of current W3C activity on addressing and is too "instance specific" to be in the realm of RDF or other semantic resource discovery issues: I believe the issue is analoguous to HTTP language negotiation, and warrants similar treatment.

Peter

Received on Sunday, 16 December 2001 21:13:54 UTC