SkosDev/IdentityCrisis

From W3C Wiki

Working Around the Identity Crisis

Let me try to sum up what I understand as the 'identity crisis' in as few words as possible:

If you use an HTTP URI as an identifier for something other than a web document (such as an abstract concept) then you can run into problems. The problems arise when there just happens to be a web document at the end of that same URI, which you find when you plug that URI into the address bar of your favourite browser.

The problem is that you can have one URI identifying two distinct things. Obviously a unique identifier isn't much good if it doesn't uniquely identify anything.

Let me exemplify the problem.

Let's say I think up a concept, a concept which is important to my work, and I give that concept the URI <http://foo.org/bananas>. Of course I then want to publish an RDF description of that concept, so that other people can use it in their semantic web applications. Such an RDF description might look something like:


<rdf:RDF @@TODO namespaces>

  <skos:Concept rdf:about="http://foo.org/bananas">
    <skos:prefLabel>bananas</skos:prefLabel>
    <skos:definition>A soft yellow fruit.</skos:definition>
  </skos:Concept>

</rdf:RDF>


Now let's say I create an HTML document, which is all about bananas, and I put that document at the URL <http://foo.org/bananas>. I might want to also publish metadata about that document, which could look something like:


<rdf:RDF @@TODO namespaces>

  <foaf:Document rdf:about="http://foo.org/bananas">
    <dc:creator>
      <foaf:Person>
        <foaf:name>Al Miles</foaf:name>
        <foaf:mbox rdf:resource="mailto:a.j.miles@rl.ac.uk"/>
      </foaf:Person>
    </dc:creator>
    <dc:description>A web page all about bananas.</dc:description>
  </foaf:Document>

</rdf:RDF>


If these two RDF descriptions then got merged, I would end up with an RDF description of a horrible chimera, something akin to what happened to Seth Brundle at the end of 'The Fly'. Nasty.

The problem gets worse. Some people in the Semantic Web community actually go so far as to say that it is good practise to have an HTTP URI that is being used as an identifier for an abstract concept resolve to a human readable representation of that concept (i.e. a web document). I.e. they seem to be actually encouraging duplicitous identification AND experiments in organic teleportation. What are they thinking?

There is a sound idea behind this practise. If you're going to use a URI as an identifier for an abstract concept, it's helpful if that URI also happens to resolve to a readable representation of that concept, so that other folks wanting to use that URI can find out what it is supposed to stand for. But if you do this, don't you run the risk of ending up like Brundlefly?

Well, I don't think so.

So you have an abstract concept that you want to describe. You give it an HTTP URI. Then you set up your web server so that this URI resolves *indirectly* to a representation of that concept.

For example, you give your concept the URI <http://foo.org/bananas> and then you put a web page describing your concept at <http://foo.org/bananas.html>. Most web servers these days will automatically redirect you from <http://foo.org/bananas> to <http://foo.org/bananas.html>. So you get what you want with regards to URI resolution, and you have different URIs for the abstract concept and for the web document. Beam me up, Scotty.

In other words, the central tenet here is, that if your concept URI does happen to resolve to some sort of web document, it should do so via some URL redirection or URL rewriting at the server. Most modern web servers have quite powerful mechanisms for doing this sort of thing. This means that all of the web documents involved in resolution transactions will also have their own different, direct URLs (and hence URIs) by which they can be uniquely and unambiguously referred to. Crisis averted.

I would actually go farther and support the notion that it is best practise for anyone using an HTTP URI as an identifier for an abstract concept to set up their web server so that the URI resolves (indirectly, of course) to a *content-negotiable representation* of that concept. What this means is that, if a web client requests an HTML representation of the URI (via the 'content-type' attribute in an HTTP request) that's what they get, and if they request an RDF/XML representation of that URI, well they get that instead. This sort of thing is a bit harder to set up, but is also quite possible with modern web servers.

But as a baseline, I think it is quite reasonable to use an HTTP URI as an identifier for something other than a web document. If the URI doesn't resolve to anything at all, that's fine. If it does happen to resolve to some sort of helpful web document, well that's fine too. And if you do decide to do the latter, to avoid the unpleasant side-effects of leaving the teleport pod door open at the wrong moment, make sure that the resolution is indirect.

And they all lived happily ever after. Except for Seth Brundle. Well, he should have known better. Let that be a lesson to you ;)

By Al Miles.