W3C

Experiences with the conversion of SenseLab databases to RDF/OWL

W3C Interest Group Note 4 June 2008

This version:
http://www.w3.org/TR/2008/NOTE-hcls-senselab-20080604/
Latest version:
http://www.w3.org/TR/hcls-senselab/
Previous version:
http://www.w3.org/TR/2008/WD-hcls-senselab-20080404/
Editors:
Matthias Samwald, Yale Center for Medical Informatics / DERI Galway / Semantic Web Company <samwald@gmx.at>
Kei-Hoi Cheung, Yale Center for Medical Informatics <kei.cheung@yale.edu>
Contributors:
Alan Ruttenberg, Science Commons <alanruttenberg@gmail.com>
Huajun Chen, Yale Center for Medical Informatics / Zhejiang University <huajunsir@zju.edu.cn>

Abstract

One of the challenges facing Semantic Web for Health Care and Life Sciences is that of converting relational databases into Semantic Web format. The issues and the steps involved in such a conversion have not been well documented. To this end, we have created this document to describe the process of converting SenseLab databases into OWL. SenseLab is a collection of relational (Oracle) databases for neuroscientific research. The conversion of these databases into RDF/OWL format is an important step towards realizing the benefits of Semantic Web in integrative neuroscience research. This document describes how we represented some of the SenseLab databases in Resource Description Framework (RDF) and Web Ontology Language (OWL), and discusses the advantages and disadvantages of these representations. Our OWL representation is based on the reuse and extension of existing standard OWL ontologies developed in the biomedical ontology communities. The purpose of this document is to share our implementation experience with the community.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is an Interest Group Note of the Semantic Web in Health Care and Life Sciences Interest Group (HCLS), part of the W3C Semantic Web Activity. It is considered stable and expected to be published as an Interest Group Note in May 2008. This document serves as a companion to A Prototype Knowledge Base for the Life Sciences and describes the process for integrating new data into an existing biological database. We hope other groups who plan to convert their databases into RDF/OWL format will benefit from this document.

The document was produced by the Semantic Web in Health Care and Life Sciences Interest Group (HCLS), part of the W3C Semantic Web Activity (see charter). Comments may be sent to the publicly archived public-semweb-lifesci@w3.org mailing list. Feedback is encouraged, as is participation in the recently re-charted HCLSIG. A list of changes since the last publication is available.

Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the disclosure obligations of the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information to public-semweb-lifesci@w3.org [public archive] in accordance with in accordance with section 6 of the W3C Patent Policy.


Table of Contents


Conversion process

Original data sources

The SenseLab databases can be accessed through a web interface at the SenseLab web site [SENSELAB-WEB]. SenseLab is divided into a number of specialised databases, of which we have converted three to Semantic Web formats. These databases are NeuronDB, BrainPharm and ModelDB. All databases are based on compartmental models of neurons. NeuronDB contains descriptions of anatomic locations, cell architecture and physiologic parameters of neuronal cells. The pilot BrainPharm database is intended to support research on drugs for the treatment of neurological disorders. It enhances the descriptions in a portion of NeuronDB with descriptions of the actions of pathological and pharmacological agents. ModelDB is a large repository of computational neuroscience models and simulations. The mathematical models in ModelDB are annotated with references to NeuronDB. Taken together, these databases allow the researcher to query information and to run simulations pertaining to the function of neurons in healthy and disease states. All databases contain extensive literature references and excerpts from texts that have been used to curate the database entries.

The databases are based on the "entity-attribute-value with classes and relationships" (EAV/CR) schema [EAV-CR]. The data can also be downloaded from the SenseLab Semantic Web development portal [SENSELAB-SW] as a database dump in Microsoft Access format and as text.

Initial RDF and OWL conversions

Motivation

Our motivation was to make the SenseLab databases available in RDF(S) [RDFS] (without OWL) and in OWL DL [OWL Overview]. The two versions were developed in parallel in order to compare the difference between the conversion processes and the outcomes. We wanted to explore the issues in mapping relational databases to RDF/OWL structure. In addition, we wanted to explore the possibility of automatic translation from EAV/CR to RDF.

Process

We developed a converter application in Java that queried the SenseLab database and wrote RDF/XML files. The conversion was fully automatic for the RDF version, but required some manual editing for the OWL version.

Outcome

These conversions were too tied to the original database structure, which resulted in inconsistent OWL ontologies. Some shortcomings of the first conversion to OWL were:

¹ Disjoint classes are used in OWL to assert that they have no members in common. Inferences from this can be used to flag any inconsistent models.

Revised OWL conversions

The revised OWL conversion was based on the first OWL conversions described above. The design of the revised SenseLab ontologies follows the "ontological realism" approach [SMITH-2004]. This means that the revised ontologies are focused on direct representations of physical objects and processes (e.g., neuronal cells, ionic currents), and not on their abstractions (e.g., concepts or database entries).

Motivation

Manually correcting the logical inconsistencies in the first version of the OWL ontology; making use of foundational ontologies (BFO, Relation Ontology) where possible; mapping the ontology to other neuroscience ontologies.

Process

An ontology containing basic class hierarchies and relations was manually created, based on the structure of existing SenseLab databases. This basic ontology could not be created from the database structure in an automated process because this would not have resulted in a logically consistent ontology. This ontology was edited by a domain expert, based on inspection and manual editing with Protege 3.2 [PROTEGE] and Topbraid Composer [TOPBRAID]. The ontologies were built upon established foundational ontologies in order to maximize the interoperability with other existing and forthcoming biomedical Semantic Web resources. The foundational ontologies used were:

Based on this manually created basic ontology, the data from the SenseLab databases were then automatically converted to OWL using programs written in Java and Python. The automated export scripts extended the manually created basic ontology through the creation of subclasses, OWL property restrictions and individuals. The resulting ontologies show no clearly distinguishable divide between the 'schema' and 'data'.

The OWL export of NeuronDB was based on a transformation from the EAV/CR model of the SenseLab database to files in RDF/XML syntax by a Java program. The export from ModelDB and BrainPharm was based on a simple flat text file export of the databases. The text file exports were converted to RDF/XML files with a Python script.

For mappings to external bioinformatics databases that did not yet offer stable URIs for reference on the Semantic Web, we used the URI scheme for database record identifiers established by Science Commons [SC-URI]. URIs for database records could simply be generated by concatenating the record identifier to a predefined namespace. For example, the Entrez Gene record with ID '3579' was identified by the URI http://purl.org/commons/record/ncbi_gene/3579, the Uniprot record 'P46663' was identified by http://purl.org/commons/record/uniprotkb/P46663 and the Pubmed record with ID '11160518' was identified by http://purl.org/commons/record/pmid/11160518. The database entries were connected to the ontological representations of real-word entities through relations such as has_nucleotide_sequence_described_by. For example, the gene of the Dopamine Receptor D1 (DRD1) is defined through a reference to NCBI record 1812, which contains a description of the sequence of this specific gene:

<http://purl.org/ycmi/senselab/neuron_ontology.owl#DRD1_Gene> owl:equivalentClass _:property_restriction1 .
_:property_restriction1 owl:onProperty senselab:has_nucleotide_sequence_described_by .
_:property_restriction1 owl:hasValue <http://purl.org/commons/record/ncbi_gene/1812> .

Mappings were made to the following ontologies:

The mappings were made with the following cross-ontology relations: owl:equivalentClass, rdfs:subClassOf and the "has part" relation from the OBO relation ontology.

Ontology import hierarchy

Figure 1: Import hierarchy of OWL ontologies. Ontologies printed in bold have been created by the SenseLab team, other ontologies have been created by other groups. The arrows point from the imported ontology to the importing ontology, e.g., the NeuronDB Ontology imports the Relation Ontology. Import statements are transitive, e.g., the ModelDB Ontology imports both the NeuronDB ontology and the Relation ontology.

Examples of ontology mappings

Figure 2: Examples of relations ('mappings') spanning between classes from the NeuronDB ontology (in the middle) and classes from external ontologies.

Terse rdfs:labels were replaced by more descriptive ones that could be better understood without knowledge about context. For example, the rdfs:label "Ded" was changed to "Distal part of equivalent dendrite (Ded)". Note that, in this case, the original label was also preserved (in brackets), because it might still be useful for people that do know about the context.

The ontology development was moved to a Subversion (SVN) system on a central web server. During most of the development, the ontologies were simply developed on the client side and were periodically uploaded via FTP. Of course this led to problems when more than one person was working on the ontologies at a time, and it was also impossible for users of the ontology to access previous versions of the ontology, since only the most recent version was available on the web site.

The namespaces / ontology locations were changed to PURL-based URIs. For example, the URI http://neuroweb.med.yale.edu/senselab/neuron_ontology.owl#Dopamine was changed to http://purl.org/ycmi/senselab/neuron_ontology.owl#Dopamine ('ycmi' stands for 'Yale Center for Medical Informatics'). PURL-based URIs are easier to maintain when server configurations change or (in the worst case) the original server is unavailable and the ontologies need to be served from a different location. The increased stability of PURLs encourages the re-use of entities in ontologies developed by other groups -- which is a key factor in the creation of a coherent Semantic Web.

A SPARQL endpoint for the SenseLab ontologies was set up using the open source version of the Openlink Virtuoso server [VIRTUOSO]. A SPARQL endpoint is a service that allows clients to query a RDF store with the SPARQL query language through simple HTTP GET requests. The ontologies were loaded into the triple store of the server to make them accessible to SPARQL queries. Each ontology file was put into a separate labeled graph, the label of each graph was identical to the URL of the ontology file. For example, the ontology located at http://purl.org/ycmi/senselab/neuron_ontology.owl was loaded into a graph labeled http://purl.org/ycmi/senselab/neuron_ontology.owl. Loading each ontology into a separate graph makes it possible to restrict SPARQL queries to certain graphs and hence, certain ontologies. This has the advantage that queries can be more selective and can be executed with better performance.

Outcome

The final products of the project are accessible at http://neuroweb.med.yale.edu/senselab/. A SVN repository can be accessed through a web interface at http://neuroweb.med.yale.edu/svn/trunk/ontology/senselab/. The SPARQL endpoint can be accessed at http://hcls.deri.ie/sparql. The SenseLab OWL ontologies are mentioned as an example of the application of OBO ontologies in the article The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration [OBO-ARTICLE].

Advantages

We experienced the following advantages from using RDF/OWL:

Disadvantages

We experienced the following problems while using RDF/OWL:

Future directions and plans

The SenseLab ontologies will be further integrated with other neuroscientific and biomedical ontologies. User friendly applications will be developed to query a multitude of interrelated ontologies in a scientifically meaningful way. To this end, we have implemented a prototype Web application called 'Entrez Neuron' that allows the user to query data across multiple sources based on key words. The user can browse the query results and retrieve more detailed information about neurons based on a 'brain-anatomy/neuron' view. A paper describing this application was published in the WWW/HCLS2008 workshop. Currently, we are expanding this application to include more views and features.

Suggestions based on our experiences

Based on our experiences we can make the following suggestions for other projects that have similar goals:

Conclusion

We experienced clear benefits from using Semantic Web technologies for the integration of SenseLab data with other neuroscientific data in a consistent, flexible and decentralised manner. The main obstacle in our work was the lack of mature and scalable open source software for editing the complex, expressive ontologies we were dealing with. Since the quality of these tools is rapidly improving, this may cease to be an issue in the near future. The detailed analysis of the experiences with the SenseLab ontologies and other complex biomedical ontologies may help drive the improvement of current ontology editors.

References

[EAV-CR]
L. Marenco, N. Tosches, C. Crasto, G. Shepherd, P.L. Millera and P.M. Nadkarni, Achieving evolvable Web-database bioscience applications using the EAV/CR framework: recent advances, J Am Med Inform Assoc. (2003) 10(5):444-53
[SENSELAB-WEB]
SenseLab database, http://senselab.med.yale.edu/
[SENSELAB-SW]
SenseLab Semantic Web Development, http://neuroweb.med.yale.edu/senselab/
[PROTEGE]
The Protege Ontology Editor and Knowledge Acquisition System, http://protege.stanford.edu/
[TOPBRAID]
TopBraid Composer, http://www.topbraidcomposer.org/
[RO]
Relation Ontology, http://www.obofoundry.org/ro/
[OBO]
The Open Biomedical Ontologies, http://obofoundry.org/
[BFO]
Basic Formal Ontology (BFO), http://www.ifomis.uni-saarland.de/bfo/
[SC-URI]
Explanation of HCLS and Science Commons URIs, http://sw.neurocommons.org/2007/uri-explanation.html
[BAMS]
The Brain Architecture Management System, http://brancusi.usc.edu/bkms/
[SAO]
CCDB Subcellular Anatomy Ontology, http://ccdb.ucsd.edu/CCDBWebSite/sao.html
[CARO]
Common Anatomy Reference Ontology , http://www.obofoundry.org/cgi-bin/detail.cgi?id=caro
[BIRNLEX]
BIRNLex Ontology Documentation, http://fireball.drexelmed.edu/birnlex/OWLdocs/
[GO]
Gene Ontology, http://geneontology.org/
[OBI]
Ontology of Biomedical Investigation, http://obi.sourceforge.net/
[VIRTUOSO]
OpenLink Universal Integration Middleware - Virtuoso Product Family, http://virtuoso.openlinksw.com/
[OBO-ARTICLE]
The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration, Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters et al., Nature Biotechnology 25, 1251 - 1255, 2007, http://dx.doi.org/10.1038/nbt1346
[DOL]
DOLCE Ontology, http://www.loa-cnr.it/DOLCE.html
[RDF-VALID]
RDF Validator, http://www.w3.org/RDF/Validator/
[PELLET]
The PELLET Open Source OWL DL Reasoner, http://pellet.owldl.org/
[SMITH-2004]
Beyond Concepts: Ontology as Reality Representation, Barry Smith, iin A. Varzi, L. Vieu, eds., Proceedings of FOIS (IOS Press, Amsterdam, 2004) 319-330. http://ontology.buffalo.edu/bfo/BeyondConcepts.pdf
[KB]
A Prototype Knowledge Base for the Life Sciences, http://www.w3.org/TR/2008/NOTE-hcls-kb-20080604/
[N3]
Primer: Getting into RDF and Semantic Web using N3, http://www.w3.org/2000/10/swap/Primer
[OWL Overview]
OWL Web Ontology Language Overview, Deborah L. McGuinness and Frank van Harmelen, Editors, W3C Recommendation, 10 February 2004, http://www.w3.org/TR/2004/REC-owl-features-20040210/ . Latest version available at http://www.w3.org/TR/owl-features/
[RDFS]
RDF Vocabulary Description Language 1.0: RDF Schema , Dan Brickley and R.V. Guha, Editors. W3C Recommendation, 10 February 2004,
http://www.w3.org/TR/2004/REC-rdf-schema-20040210/ .
Latest version available at http://www.w3.org/TR/rdf-schema/.

Acknowledgements (Informative)

Thanks to Huajun Chen and Ernest Lim who contributed to the SenseLab conversion. Thanks to Gordon Shepherd, Perry Miller, Luis Marenco and Tom Morse for their input, suggestions and support. Thanks to Susie Stephens for her detailed suggestions for improving this document. Thanks to Alan Ruttenberg for his technical suggestions during the conversion process. Thanks to Eric Prud'hommeaux for technical advice and assistance on the creation of this document.