HCLSIG BioRDF Subgroup/Meetings/2006-04-24 Conference Call

Conference Details

Date of Call: Monday April 24, 2006
Time of Call: 11:00am Eastern Time
Dial-In #: +1.617.761.6200 (Cambridge, MA)
Participant Access Code: 246733 ("BIORDF")
IRC Channel: irc.w3.org port 6665 channel #BioRDF (see W3C IRC page for details)
Duration: ~1 hour
Convener: Susie Stephens
Scribe: Susie Stephens

Attendees: Olivier Bodenreider, Andy Seaborne, Brian Osborne, John Barkley, Susie Stephens, Alan Ruttenberg, Karen Skinner, Christophe Poulain, Brian Gilman

Apologies: Scott Marshall, Davide Zaccagnini, Kei Cheung.

1. Brian Osborne - Experiences using D2RQ for converting SMID into RDF

Was only able to focus on the mapping for a day or so. Has been talking to Kei Cheung and his postdocs to get some guidance with the tricky parts.

D2RQ’s heart is a mapping function from relational tables to RDF, and it sits on top of Jena. D2RQ appears to struggle with some of the common source ailments including poor documentation, and a lack of example code.

Plans to initially use the RDQL query language, but then see if he can use SPARQL over RDQL to be compliant with W3C standards.

Expects to have a demo complete by May 15.

2. Olivier Bodenreider - Plans regarding OMIM and/or Entrez Gene

At the F2F in January the plan was to convert many of the NCBI data sources into RDF. This would involve working with lots of data and lots of data formats. The recent focus on neuroscience has called for a narrowing of resources.

It wouldn’t be realistic to convert sub-sets of all NCBI resources into RDF, but could most probably do those that are most aligned to neuroscience.

He is more interested in the data content, than the technology. Will rely on people with more of a focus on technology to help reach goals. Will be getting an intern from Georgia Tech for the summer who has a biology and computer science background.

Other people have proposed working on OMIM and Entrez Gene. These are possibly the most interesting resources and a good place to start. The only caveat is that these resources won’t provide the variety of formats that were planned. As BioRDF has also taken on the Text to RDF task, OMIM could also be used to extract RDF from unstructured text using NLP. It may also be interesting to look at GEO, as it has a data structure that is more similar to spreadsheets.

Would be happy to align objectives with BioRDF.

3. Discussion

There is a problem with RDF tools, as most require that the whole graph is loaded into memory, which limits the ability to work with large graphs. Could possibly have a proxy server that intercepts queries to OMIM, by taking the XML result and converting it into RDF.

Oracle is able to work with large graphs due to its memory management capabilities.

An ontology could be used to represent the graph, so that the data is defined in terms of classes and relationships. This is what BioPAX is able to do. The ontology provides a nice document of the content of the graph, as well as an ontology. XML could be loaded into Protégé and then queried using SPARQL.

It’d be nice if NCBI could provide an RDF service. You could retrieve data from a data source, have it converted into RDF by a proxy server, and then the data could be integrated with other data in an RDF browser.

Many data sources are made available as XML. It would be a classic use case for this data to be converted into RDF using GRDDL. It’d be best if the data provider could do this. PubMed would be a good data source for this work, as it has a relatively unproblematic schema.

It would be nice to use NLP techniques to get the medical content out of PubMed. It would be good to start with a subset of genes that are of particular interest.

The following data sets could provide a good starting point:

- (1) Rockefeller site: http://www.gensat.org/index.html

- (2) PubMed site: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gensat

- (3) NIH Neuroscience Microarray Consortium: http://arrayconsortium.tgen.org/np2/public/overview.jsp

- (4) Allen Mouse Brain Atlas: http://www.brainatlas.org/main.asp?section=updates

OMIM is a reasonable size for converting all of it into RDF. If you query on ‘neuro’ then you get 997 results.

There are many NLP approaches. Davide can generate parse trees in XML and RDF, Sheffield Gate is good, and Genia from Japan is also meant to be good.

Ruby on Rails is looking like a good approach for building a thin client with an activeRDF backend. Will have an update next week.

No news from the Alzforum about posting the antibody data on the Web, although people can gain access to the data by emailing Alan. Something also needs fixing to enable the data to be viewed in Protégé.