HCLSIG BioRDF Subgroup/LSID URN URI

From W3C Wiki
HCLS Home Home Discussions

LSID URN/URI Notes

Contributed by Sean Martin, IBM Corp Cambridge, MA, June 30 2006 Forwarded by Eric Neumann

Hello All, On last weeks BioRDF call, Eric Neumann asked me to post information here to help everyone follow the LSIDs/Life Science Identifier as a URI discussion. So here goes.

Firstly let me point you to two articles I was co-author on, with the suggestion that you read them in the order listed.

Clark T, Martin S, Liefeld T. [1] Globally distributed object identification for biological knowledgebases. Brief Bioinform. 2004 Mar;5(1):59-70. PMID: 15153306

Martin S, Hohman MM, Liefeld T. [2] The impact of Life Science Identifier on informatics data. Drug Discov Today. 2005 Nov 15;10(22):1566-72. PMID: 16257380 (not open access)

Together these provide nearly the entire LSID story to date. They include the motivation for the creation of the LSID standard, an explanation of what the syntax is; a description of how the underlying protocol actually works; and discussion of how LSID naming can be retroactively applied without much difficulty to information sources already online as well as new ones. The second article talks about early adopters of the specification and what they are actually doing with LSIDs, it tackles some of the common misconceptions and concerns and concludes with a list of problems and some suggestions for improvements to the current specification.

For those that enjoy such activities, the full specification of the Life Science Identifier is publicly available for your reading pleasure from the Object Management Group[3], but if you have read the above two articles, you are already well enough armed to enter the debate. As I mentioned in an earlier posting, there is quite a useful description in Section 13.3, Page 26 (page 32 in the pdf file) of the spec that describes in a relatively human readable step by step example form how the resolution protocol actually works to decouple the LSID name from the network location of the digital object named.

Next, assuming you have by now read the two articles listed above, let me try to add a little information which they do not cover around the issues concerning URLs as Life Science URIs that directly led to the creation of the LSID URN.

It is certainly true that the DNS system and the semantics of file system paths ensure that by using a URL as a URI you get an easy means to produce globally unique name. The problems begin if you either want to do more than create just a name in the “ether” and actually use the URL to uniquely name existing binary data objects, or if you are tempted to do what is natural with a URL which is put it into a web browser and dereference it to something you or your program can look at.

Obviously a URL can do some measure of both of these things and at first glance it might seem that they are perhaps even the same thing. But this is where it starts to get tricky. The root of the problem is that the URL contains in it more than just a name. It also contains the network location where the only copy of the named object can be found (this is the hostname or ip address) as well as the only means by which one may retrieve it (the protocol, usually http, https or ftp). The first question to ask yourself here is that when you are uniquely naming (in all of space and time!) a file/digital object which will be usefully copied far and wide, does it make sense to include as an integral part of that name the only protocol by which it can ever be accessed and the only place where one can find that copy? Furthermore, does it make sense to use as part of the name a DNS hostname which may easily be transferred to a new owner if the underlying DNS domain name changes hands? In a system where the resolution of the URI to a copy of the object it names has no layers of indirection, one becomes entirely reliant on the issuer of that name or their successors in interest to continue to provide service for it. One has only have to have observed the web for a few years to understand how brittle it is both in terms of objects moving or being taken offline (the dreaded 404 HTTP error) or a domain name being passed to a new concern that has a different set of objectives. Schemes like PURL[4] exist to combat both these problems to some extent. For a general discussion on how successful this is I refer you to the DOI Handbook, section 3.10 [5] that details a number arguments comparing the DOI scheme to the PURL many of which would equally apply to the LSID scheme.

To add a little to those discussions and to make it more specific to Life Sciences please consider the following. One problem that the PURL scheme does not overcome is the requirement that it should be possible for the named digital object to be available from multiple locations that copy the original, possibly long after the original is no longer available from the original source. In fact the LSID scheme goes one further as it provides a standard method for getting no only a copy of the named object from multiple locations using multiple protocols, but also there are methods for retrieving and combining metadata from _multiple_ different sources, all keyed off the same URI. Another problem with the PURL scheme is that they cannot be applied retroactively. You need to use PURL identifiers up front. This is a problem when you want to identify objects internally in private for a while, but then after a year or two would like to expose them externally without changing their names. Does it make sense to create an external redirection reference for every image your research produces – and then to be consistent you would need to use and dereference that image via the PURL service every time you referred to it, unless you are happy to deal with the complexities of having both an internal as well as an external permanent name for the same object. This leads me to the potential issues of scale and reliability in the face of the extraordinarily large number of identifiers that will exist in the Life Sciences domain alone. Given the extremely successful, highly distributed nature of the WWW, when does it make sense to use a scheme which relies entirely on a single centralized redirection service both for registration & resolution?

The notion that the name should be independent of the means used to get a copy of the object is also important, particularly as more sophisticated transport protocols are introduced they can easily be included as alternative or even primary protocols for access to the data – for example if a community wanted to take advantage of the relatively recently introduced Bittorrent[6] scheme, a P2P protocol optimized for quickly sharing large binary objects across a network.

Another serious concern regarding using URL’s to name digital objects is the question of “what is actually named?" In the Life Science research process it is frequently necessary to reproduce the results of another groups experiments. To do this successfully one needs to be certain one is using exactly the same inputs used by the original experimenter. Similarly when basing new research on the work of another it is important to know one is actually using the exact outputs of the earlier research. “In silico” experimentation requires absolute precision. Unfortunately when it comes to URL’s there is no way to know that what is served one day will be served out the next simply by looking at the URL string. There is no social convention or technical contract to support the behavior that would be required. Indeed the URL concept has been so extremely successful precisely because it was allowed to conflate the original document access methods with remote procedure calls (RPCs) when the CGI interface [7] was first introduced in1993. The introductions of XMLRPC and Web Services have cemented this confusion. One type of URL response may be happily cached, perhaps for ever, the other type probably should not, but to a machine program the URL looks the same and without recourse to an error prone set of heuristics it is extremely difficult perhaps impossible to programmatically tell the difference. Given that what we are designing is meant to be a machine readable web, it is vital to know which URLs behave the way we want and which do not. Can one programmatically tell the difference without actually accessing them? Can one programmatically tell the difference even after accessing them? A serious follow-up problem with using URLs as names is that they have no inherent versioning scheme which makes it hard for machines (and people!) to know when revisions are made to previously named data.

The general URN [8] scheme was devised for naming resources and there is a process for registering new URN schemes. As you will know by now if you have read the first listed article, there are also standards in place for dereferencing URNs which avoid the issues related to URLs discussed above. The advent of the Life Science Identifier/LSID specification was due to a consortium of Life Science domain interested parties who chose to take advantage of these pre-existing standards and specifications to create a identifer that met their needs. URNs are URIs.

As I mentioned at the end of the last conference call, the one size fits all approach may be too limited and it is my belief that we could well end up both needing and having to accommodate more types of standards based URIs than we currently know about. Some will be URL based like Dublin Core [9], some will have a URL representation like DOIs and some will be URNs schemes like the LSID. Each will arise to serve a particular set of needs for particular communities and will have their own social and technical “contracts” that will make more tractable the problems of making the data/concepts named both machine accessible & readable. The more successful of these will have more software written to support their specific peculiarities and features, some will persist and some will be a passing phase.

Lee Feigenbaum recently pointed out to me that the job in the Semantic Web groups like ours should be to both expect and embrace this diversity in our decisions & recommendations as it will help the Semantic Web grow in size and usefulness. One thing that might be done to accommodate this proliferation of URI types is perhaps to work to achieve a common interface to them through URL gateways and to reach consensus on a “least/lowest common denominator” set of properties that one should expect from these gateways. This would mean that important common tools like web/data-web browsers and distributed SPARQL query tools engines would work across as wide a set of base information as possible. Both the DOI [10] and LSID [11] schemes have such gateways. For example here [12] is an LSID based URL link to the RDF metadata of one of the articles I recommend at the start of this post using an LSID web gateway: http://lsid.biopathways.org/resolver/urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:pubmed:15153306

No doubt I have left out much that I should have mentioned here, so I reserve the right to a follow-up post or two as I remember or people remind me about areas that require your consideration.

Kindest regards, Sean

-- Sean Martin IBM Corp Cambridge, MA

--

HCLS Home Home Discussions