HCLSIG BioRDF Subgroup/Meetings/2006-07-17 Conference Call

Conference Details

Date of Call: Monday July 17, 2006
Time of Call: 11:00am Eastern Time
Dial-In #: +1.617.761.6200 (Cambridge, MA)
Participant Access Code: 246733 ("BIORDF")
IRC Channel: irc.w3.org port 6665 channel #BioRDF (see W3C IRC page for details, or see Web IRC)
Duration: ~1 hour
Convener: Susie Stephens
Scribe: Joanne Luciano and Susie Stephens

Participants: Tim Clark, Elizabeth Wu, Bill Bug, Chimezie Ogbuji, Olivier Bodenreider, Kerstin Forsberg, Scott Marshall, Brian Osborne, Alan Ruttenberg, John Barkley, Joanne Luciano, Sean Martin, Francois Belleau, Susie Stephens.
Apologies from Kei Cheung, Davide Zaccagnini

Introductions

Francois Belleau (FB) - Masters in Bioinformatics at Quebec – Focused on applying Semantic Web to build useful knowledge bases. Knowledgebase includes Uniprot, and NCBI data sources. The URL is to be publicly announced shortly, but we can have an early preview (http://bio2rdf.org/). Will have a PhD student for RDF design

Kersten Forsberg (KF) – Clinical Information, AstraZeneca – Focus on Semantic Web since late 90s. Wrote paper on using RDF in corporate areas, and is now moving into life sciences.

Task Overviews

Brian Osborne (BO) - reports on his task that is using SW techniques to find small molecules that bind to proteins. (This is relevant to Francois's work on querying and integrating disparate data sets). SMID is a small molecule interaction database. SMID is available in a relational representation.

Starting with Protein Identifier / Name there is no easy way to find the small molecules that bind to specific proteins. No databases can currently do this, so people must manually integrate data sets in order to be able to perform queries of interest. To find a compound that binds to a receptor, you want to be able to use the Protein ID/Name as input, and get the ID of the compounds as output.

The tricky part is querying SMID using SPARQL

Approach 1: D2RQ. It performs object relational mapping between relational databases and RDF. Found this approach too difficult. The documentation was insufficient.

Approach 2: SquirrelRDF (http://jena.sourceforge.net/SquirrelRDF). It requires Jena 2.5 (won't work with lower version numbers). SquirrelRDF makes its own mapping file, and then use SPARQL.

D2RQ requires manual mapping, which perhaps an expert could do, but BO didn’t find it easy. His evaluation was for the non expert - asking - "Are these sets easy to use? - for D2RQ, the answer is No - for SquirrelRDF the answer is Yes.

SMID is not an ontology, which is an important distinction for people to consider with context and rule based systems.

Next tasks include developing a simple java application, and updating the BioRDF Wiki.

Bill Bugg (BB) - ChEBI (http://www.ebi.ac.uk/chebi/) is also an interesting data source. It is the only one that tries to create subsumption hierarchy.

Brian Osborne (BO) – I have worked to convert ChEBI + ligand into RDF. I will send example of mapping between relational db and RDF doc.

It’s not possible to make ChEBI available on OBO, as there is only a relational data source.

Alan Ruttenburg (AR) – ChEBI is also a taxonomy, rather than an ontology.

BO - Ligand is another interesting data source, but the lack of definition is the limitation - it is difficult to resolve back to the root – as there are multiple parents. Some chemicals are resolvable to a single name, yet have multiple entities in the ligand db.

Often it is the small molecules that will tie together the sub domains - therefore they are a crucial aspect - make known sources plugable

FB - Ideally we want to be able to drag and drop knowledge between databases.

BioRDF Wiki

Susie Stephens (SS) – The BioRDF Wiki has been updated. The link to neuroscience data has been updated as suggested by Bill Bug. There is a FAQ document, which will help people to understand how to sign up and ask questions. A section has been added where people can make their data available, and a link has been created to the triple store that Scott Marshall is hosting. A section has also been added on URIs, so that people can add their thoughts regarding use cases, as well as the pros and cons of LSIDs.

HCLS Workshop at ISWC

SS – Let participants know that the call for presentations has now opened for the HCLS Workshop at ISWC (http://esw.w3.org/topic/HCLS/ISWC/Workshop). The meeting will consist of a mix of task updates, and proposed presentations.

SWAN Overview

Tim Clark (TC) – Gave an overview of the SWAN project based upon the slides at: http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Meetings/2006-07-17_Conference_Call?action=AttachFile&do=get&target=SWAN.ppt

Scott Marshall (SM) – I would be interested in finding out more about the backend.

TC – We are using Jena and Postgres as the backend. For the pilot we used a memory only store. The main issue hasn’t been with the RDF store, but how to integrate it with an existing system. We didn’t want to turn the whole system into a semantic web system. What’s the best system for linking semantic web and non-semantic web.

Elizabeth Wu (EW) – We want the members of the Alzforum to continue to have their existing login. We don’t want them to have a new login for the semantic web database. We are working on this now.

BB – There are many community sites like the Alzforum. However, most aren’t as large. For example, there are communities that focus on model organisms. Much of the data that these communities are interested in is outside of pubmed. Tools like the SWAN tool, which provide context-based data integration, would be of great value to them.

EW – For example, we have an extensive antibody database. However, we are also interested in NCBI resources, and linking a hypothesis to supporting or refuting evidence. We’d like to use the semantic web to link the data sources.

BB – Focus has been a great example as to how this can pay off. Microarray data is very critical. There are now huge XML schemas for microarray data. It would be nice to use the microarray data with FUGO, which is part of the OBO foundry. How do we deal with RDF / XML impedance mis-match?

TC – We have dealt with this, but in the opposite direction. We are taking an RDF store, and wanting to render it nicely as XML. We wrote a note to Michael Miller about this, but he wasn’t very interested.

BB – There was a fork between the communities that are developing large XML schemas and those that are developing ontologies a couple of years ago. This concerns me.

TC – I have been working with the Brain image community, e.g. Dave Kennedy and Bruce Rosen, so have encountered some of these issues. I’m not into getting into the semantics of each domain ontology. Someone does an experiment with microarrays or brain images. They can manage the data in such a way that it is compatible with ontologies. This would decrease the annotation burden. The interface between discourse level of source and structure underneath is of interest. I want to leave domain ontologies to other people.

BB – Lots of data is formally expressed in XML, but the semantics and context are undefined. I want to access other experiments that are clearly related. I need all information about data acquisition. This is critical in order to be able to share data, and to include the pool of biological knowledge.

TC – Think of publications. They include sections such as materials and methods, but there isn’t a common structure for how they are represented. We’re trying to find a way to start with simple sentences as to what a publication means. We’re starting with the web related to a publication. It’d be wonderful if every time an antibody is listed in a material and methods, that it is fully characterized, and can be linked to an antibody ontology.

BB – The materials and methods section is one of the most critical.

TC – we’re trying to do a stepwise improvement. We’re trying to improve the couplings.

BB – One year from now more data will be available in RDF. There will still be the issue of matching strings, rather than semantics. But it will be sharable. I worry about being getting brain image schema to work with FUGO. We need to be able to share the coordinate space. We want to be able to perform topological queries against the Brain Atlas, and for that we will need to be able to share a coordinate system. This would enable us to share and pool data sets. We will need lots of detailed data about the information acquisition. We can’t tell if things are the same or not with an XML schema. We can’t algorithmically determine if we’re looking at the same thing or not.

SS – The next call will be on July 31 and will focus on whether BioRDF will support the LSID convention. There will be participants from W3C’s Technical Architecture Group participating.