HCLSIG BioRDF Subgroup/Meetings/2006-06-19 Conference Call

Conference Details

Date of Call: Monday June 19, 2006
Time of Call: 11:00am Eastern Time
Dial-In #: +1.617.761.6200 (Cambridge, MA)
Participant Access Code: 246733 ("BIORDF")
IRC Channel: irc.w3.org port 6665 channel #BioRDF (see W3C IRC page for details, or see Web IRC)
Duration: ~1 hour
Convener: Susie Stephens
Scribe:

Agenda

Task Overviews (Kei Cheung (http://esw.w3.org/topic/SenselabUsecase2), Alan Ruttenberg)
URI Discussion (Kathy Kwan will be participating)
AOB

Attendees

Olivier Bodenreider, Satya Sahoo, Chimezie Ogbuji, Andy Seaborne, Sean Martin, Don Doherty, John Barkley, Kathy Kwan, Andy Law, Bill Bug, Alan Ruttenberg, Scott Marshall, Eric Neumann, Kei Cheung, Helen Chen, Susie Stephens.

Task Overviews

Kei Cheung – Senselab Task

Converted SenseLab data from a custom XML format (EDSP) into RDF, and loaded it into the Oracle RDF Data Model. Loaded the SWAN RDF data set into the Oracle RDF Data Model. Performed queries that span both data sets. The queries helped to identify papers of interest on Alzheimers disease. An update of the SenseLab task is available at: http://esw.w3.org/topic/SenselabUsecase2

Alan Ruttenberg – Reagents Task

Has been focusing on improving skills for resolving names for genes and proteins. Looking into adding additional assay information (e.g. cells). Will have a new release in a few weeks. Has been working with Elizabeth Wu regarding the licensing issues.

URI Discussion

Alan R. – Provided a review of the email that he wrote on June 16, 2006 about the 3 different aspects of URIs that people commonly discuss [1].

1. The relationship between the use of a URI in a representation and what it dereferences to, if anything.

2. What a URI refers to.

3. Social aspects of URIs, i.e. those processes we go through to come to a shared use of URI.

Most things fall into the social category, as we mainly talk about things rather than concepts in the real world. This is an important area to focus on.

There are definitely challenges, e.g. versioning.

Sean M. – LSIDs are good for actual documents and services.

Bill B. – Alan’s overview was good. DOIs are used in publishing. This is a good example of another domain that has similar issues. The crossref project in particular may have similar goals (http://crossref.org/), although different technology is used to resolve the URLs. They have infrastructure for disambiguating entities, and for determining who can see which papers.

DOI addresses things that stay the same, but move. In the life sciences, things tend to change, but stay in the same place. DOI doesn’t handle this well. There are other examples in life sciences, e.g. GO, BioPAX, LSID.

Alan R. – We are focused on talking about technology, rather than meaning.

Bill B. – LSID is important.

Eric N. – The comparison to DOI is interesting, but there are some differences. For example, in life sciences the data tends to change but the location doesn’t, and versioning is important. In life sciences concept mapping is at the crux of the matter. For example, a protein record projects to a physical concept. If this becomes accepted, then a particular version of a document, or the particular instance used, could all point to a common URI.

Bill B. – This is not unique to the life sciences. There has been 30 years of work in this area. Concept mapping is at the crux of the issue. Versioning is also important. All annotations need reviewing if things change. Rigid concept maps can be a large resource drain.

The intelligence and law enforcement communities have done work in this area too, but not using Semantic Web technologies.

Sean M. – LSIDs took a lot of work at I3C. LSID today is a compromise. Weight was given to all of the areas that have been discussed today. For example, some people were focused on images and gene sequences needing versioning. Other people were interesting in naming concepts. Technical people agree that the most efficient way to get data is from many different sources. Everything that people are talking about today was considered when LSID was developed. There isn’t one answer for everything. LSID will be good for a particular set of use cases.

TimBL doesn’t like LSID, as he believes that it will dilute HTTP URI. However, we weren’t able to address everything with HTTP URI. If we could have, we would have just used it. Different kinds of URIs need to be accepted. Around each kind of URI, we need to gather technical and social use cases. In some cases we will want versioning, sometimes we will want concepts, sometimes we’ll want metadata. A URI is very ambiguous and it’s not possible to programmatically recognize what it means.

Bill B. – Lots has changed since ’99. Has anything changed to make LSIDs more fitting?

Sean M. – People with strong use cases can bend it a little. OMG isn’t at the center of semantic stuff. LSID and semantic web have become coupled in people’s minds. The standard currently leaves metadata open. There has been too much debate regarding format, never mind content.

LSIDs can name concepts. Notion of concept of gene with metadata. Metadata points to gene in context. PDB names proteins using LSID. But when you deference it what do you want? For example, which version, a sequence or an image?

Many URLs already exist. Can we supply guidelines? For example, that an entity will be stable for a couple of years?

Chimezie O. – How do we link RDF through concepts? It’s easier to link instance data to URI. I’ve posted some information along these lines on the Wiki.

Sean M. – Use cases are important, as are the technical specifications and the social contract. There are industry efforts that we can learn from, e.g. BioPAX, GO.

Alan R. – It’s a lot safer if we talk about a record. Actual genes in people’s bodies change over time, and differ between individuals. Uniprot and NCBI protein use similar ontologies, and OWL can help here. Versioning is also important as database records change with evolving knowledge.

Scott M. - If the ontology changes, that we need to version the OWL ontology. Genes can refer to a document that then refers to a protein. This approach may also apply to measurements, for example, Northern. Granularity of the URI is important, as people might want to refer to 15 or 20 different things.

Bill B. – This leads to rigid XML data.

Sean M. – That’s true. LSID access to data by linking to an ontology is less unusual than it was. The technicalities and marshalling issues are out of the way. LSIDs weren’t designed for just for one database, they were designed for all databases. The more LSIDs are used, the more value they provide.

It’s not possible to warehouse all data, it’s important to be able to point to additional objects. It’s important to be able to access metadata, and to be able to distinguish between data and metadata.

Bill B. – Once terms have been disambiguated, we need to be able to dig down to figure out what data you want. Data sources are growing like wild fire.

Sean M. – Need a minimal ontology that data conforms to.

Bill B. – There’s a whole field of study that focuses on mediation, rather than warehousing. Lots of people are using the federated approach.

Eric N. – We possible need to have an addition task that focuses on URIs. We need to go back to the organizations that manage the life sciences data with guidelines for the use of URIs.

Susie S. Next call will focus on discussing on learning more about LSIDs, and finding out which use cases they work well with. We’ll also examine where they are able to offer benefits, if any, over HTTP Get.

[1] Alan Ruttenberg’s mail on URIs.

There was an discussion a few weeks ago about URIs touch on various issues. This message is an attempt to untangle them, something I said I would write up as an action item in one of the HCLS conference calls. We'll be discussing URIs at the monday BioRDF conference call.

As I read the discussion I partitioned it in to three distinct issues:

1) The relationship between the use of a URI in a representation and what it dereferences to, if anything. The possibilities seem to be:

a) The identifier is not intended to be dereferenceable. In that case the info: scheme was suggested for the form of the uri, as that is explicitly not dereferenceable.

b) The URI is used primarily as a name. Insofar as we want use names, it is important there be some stable URIs. Of course it doesn't hurt if the URI becomes dereferenceable at some point, and it would even be nice, so let's leave open that possibility (but caveats in discussion below)

c) Any URL we use needs to be able to be dereferenced to something.

d) Any URL we use needs to be able to be dereferenced to the thing it is (and not dereferenced if you can't do that). It's only meaning is what it dereferences to.

2) What a URI refers to. Some of this conversation was made in the form of a discussion about what reasonable arguments to owl:sameAs are - for example should one say that http://www.expasy.org/uniprot/ P04637 is the sameAs http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi?db=protein&id=NP_000537.

Another part of the conversation talked in terms of whether the URI http://www.expasy.org/uniprot/P04637 should, for our purposes, refer to a database record or to a thing in the world - Human P53 proteins.

Of course these are two sides of the same coin - you would only say they the two URIs above referred to things in the world. As database entries, they are obviously different. There are different fields, they are in maintained by different people, etc.

3) Something I will call the social aspect of URIs, for lack of a better term. By this I mean those aspects process we go through to come to a shared use of of URI. Under this category there is the ontology building, the strategies for connecting pieces of information generated by different groups. There was a bit in the conversations where people were arguing about whether using sameAs for mapping was pollution or a necessity, for instance. An important part of this in our context is how to define the use of URLs to things where there was not rigorous ontological engineering applied to create careful definitions, things like terminologies and entries in gene databases.

---

I'll offer some of my own opinions on these issues now.

On the matter of what a URI dereferences to, I think it is more important to get the names in place quickly. I don't agree with the point of view that we should explicitly make them not dereferenceable, even though I'm not sure what should come back when we ask for what they point to yet. And I don't see support for there being a necessity that anything that looks like a URL have a server that returns something specific back. Here's a quote from RFC 3986,

> Although many URI schemes are named after protocols, this does not > imply that use of these URIs will result in access to the resource > via the named protocol. URIs are often used simply for the sake of > identification.

It will part of our social process to come to some understand and agreement about what would be useful for us to have come back, if anything. Is it an RDF graph? A bunch of OWL definitions of things related to the gene? A representation of the asn record? A page of HTML? All of the above?

On the question of what kind of concept an entrez gene URI refers to, I think that concept needs to be "databaseRecord". There's too many different concepts that it could mean if we want it to refer to something in the world - does it refer to the sequence of the gene? The typical gene? All mutations of it that are found in populations? The possible gene products?

Rather, we can use the URI to the database entry to start to build concepts by defining properties and using them in OWL class definitions in a variety of ways. In foaf and SKOS, for instance, there is a property isPrimarySubjectOf. The kind of equivalence we can have between http://www.expasy.org/uniprot/P04637 and http:// eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=protein&id=NP_000537 is something like: The same something isPrimarySubjectof http://www.expasy.org/uniprot/P04637 and http:// eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=protein&id=NP_000537. where "something" is a blank node in RDF. Or in OWL

Class(P53Gene complete restriction(isPrimarySubjectof (value <http://eutils.ncbi.nlm.nih.gov/entrez/ eutils/efetch.fcgi?db=protein&id=NP_000537>)))

Class(P53Transcript partial intersectionOf(mRNA restriction (derivesFrom someValuesFrom(P53Gene))))

Which says that it is necessary and sufficient for x to be a P53Gene,for example, if someone has stated or it has been inferred that

Individual(x value(isPrimarySubjectOf <http://www.expasy.org/uniprot/ P04637>))

and that a P53 transcript, among other things, is a mRNA that derivesFrom some P53Gene.

(there will be more complicated definitions too :)

[sameAs, equivalentClass, equivalentProperty will be a necessity, I think, BTW]

As for the social process, I look forward to the discussion on Monday :)

Regards, Alan

http://www.w3.org/TR/uri-clarification/ Uniform Resource Identifier (URI): Generic Syntax - http:// tools.ietf.org/html/3986 Relations in biomedical ontologies - http://genomebiology.com/ 2005/6/5/R46 http://en.wikipedia.org/wiki/Uniform_Resource_Identifier http://en.wikipedia.org/wiki/URL