HCLS/Senselab Conversion

From W3C Wiki
Jump to: navigation, search

Experiences with the conversion of SenseLab databases to RDF/OWL

SenseLab is a collection of databases for neuroscientific research. This document describes how we converted some of the SenseLab databases to the Resource Description Framework (RDF) and the Web Ontology Language (OWL), the considerations we made and the advantages and potential disadvantages of the conversion to RDF/OWL we identified. We will also try to give suggestions to database administrators who consider doing a conversion to RDF/OWL.

The SenseLab ontologies and related material can be downloaded from [1].

Contributors to this page (content of this page might not represent the opinion of all the people in this list):

  • Matthias Samwald
  • Huajun Chen
  • Ernest Lim
  • Kei-Hoi Cheung

Conversion process (different attempts)

Original data sources

The SenseLab databases can be accessed through a web interface at [2]. The databases are based on the "entity-attribute-value with classes and relationships" (EAV/CR) schema (L. Marenco, N. Tosches, C. Crasto, G. Shepherd, P.L. Millera and P.M. Nadkarni, Achieving evolvable Web-database bioscience applications using the EAV/CR framework: recent advances, J Am Med Inform Assoc. (2003) 10(5):444-53). The data can also be downloaded as a database dump in MDB format and as text [3].

First RDF and OWL conversions

The first automated RDF and OWL conversions were mainly done by Ernest Lim and Huajun Chen, respectively.

  • Motivation: Making SenseLab databases available in pure RDF and in OWL DL. The two versions were developed in parallel in order to compare the difference between the conversion processes and the outcomes.
  • Process: The ontologies were automatically converted
  • Outcome: These conversions reflected a lot of the local database structure, which resulted in inconsistent OWL ontologies. Some shortcomings of the first conversion to OWL were:
    • 'Part of' relations were wrongly represented as subclass relations. (This seems to be one of the most common mistakes in ontology developments!)
    • Class disjoints were missing which made it hard to find inconsistencies and data entry errors.
    • After disjoints were introduced, we found some previously unidentified inconsistencies with the help of OWL reasoners: some classes (e.g. 'GABA') were subclasses of both 'neurotransmitter' and 'receptor', which was wrong. This was an artefact caused by the automated conversion -- both GABA transmitters and GABA receptors were simply labeled with 'GABA' in the source database. The conversion algorithm generated URIs based on these labels, so they were represented with an identical URI (neuron_ontology:GABA). This grave mistake would not have been noticed without the use of OWL reasoning.
    • Some of the labels of entities were very terse and not understandable outside the user interface of the original database. For example, "Ded" was the label of the "distal part of the dendrite".

Revised OWL conversions

The revised OWL conversion was done by Matthias Samwald and Huajun Chen, based on the first OWL conversion. The design of the SenseLab ontologies is oriented on the view of ontological realism (B. Smith, Beyond Concepts: Ontology as Reality Representation, in A. Varzi, L. Vieu, eds., Proceedings of FOIS (IOS Press, Amsterdam, 2004) 319-330). This means that the ontologies are focused on direct representations of physical objects and processes (e.g., neuronal cells, ionic currents) in reality, and not on their abstractions (e.g., concepts or database entries).

  • Motivation:
    • Manually correcting the logical inconsistencies in the first version of the OWL ontology
    • Making use of foundational ontologies (BFO, Relation Ontology) where possible
    • Mapping the ontology to other neuroscience ontologies
  • Process:
    • An ontology containing basic class hierarchies and relations was manually created, based on the structure of existing SenseLab databases. This basic ontology could not be created from the database structure in an automated process because this would not have resulted in a logically consistent ontology. This ontology was edited by a domain expert, based on inspection and manual editing with Protege 3.2 and Topbraid Composer. The ontologies were built upon established foundational ontologies in order to maximize the interoperability with other existing and forthcoming biomedical Semantic Web resources. These ontologies were
      • the Relation Ontology [4] from the Open Biomedical Ontologies repository (OBO [5]), which defines basic relations such as 'part of', 'participant of' or 'contained in'.
      • the Basic Formal Ontology (BFO [6]), which defines basic classes such as 'process', 'object', 'quality' or 'function'.
  • Based on this manually created basic ontology, the data from the SenseLab databases were then automatically converted to OWL using programs written in Java and Python. The automated export scripts extended the manually created basic ontology through the creation of subclasses, OWL property restrictions and individuals. The resulting ontologies show no clearly distinguishable divide between a 'schema' and 'data'.

The OWL export of NeuronDB was based on a transformation from the EAV/CR model of the SenseLab database [20] to RDF/XML by a Java program. The export from ModelDB and BrainPharm was based on a simple flat text file export of the databases. The text file exports were converted to RDF/XML files with a Python script.

For mappings to external bioinformatics databases that do not yet offer stable URIs for reference on the Semantic Web, we used the URI scheme for database record identifiers established by Science Commons [24]. URIs for database records could simply be generated by concatenating the record identifier to a predefined namespace. For example, the Entrez Gene record with ID '3579' was identified by the URI 'http://purl.org/commons/record/ncbi_gene/3579', the Uniprot record 'P46663' was identified by '[7]' and the Pubmed record with ID '11160518' was identified by '[8]'. The database entries were connected to the ontological representations of real-word entities through relations such as "has_peptide_sequence_described_by". Example:

example:bradykinin_receptor example:has_peptide_sequence_described_by <http://purl.org/commons/record/uniprotkb/P46663> .

  • Mappings were made to the following ontologies:
    • the BAMS ontology (created by John Barkley, National Institute of Standards and Technology, USA) which was derived from the Brain Architecture Management System
    • the Subcellular Anatomy Ontology (SAO) created by the Cell Centered Database projec. [9]
    • the BirnLex ontology developed by members of the Biomedical Informatics Research Network
    • the Common Anatomy Reference Ontology (CARO). [10]
    • the Gene Ontology. [11]
    • the Ontology of Biomedical Investigation (OBI). [12]
 The mappings were made with the following cross-ontology relations: owl:equivalenClass, rdfs:subClassOf and the "has part" relation from the OBO relation ontology.
 Figure: The arrows point from the imported ontology to the importing ontology, e.g., the NeuronDB Ontology imports the Relation Ontology. Import statements are transitive, e.g., the ModelDB Ontology imports both the NeuronDB ontology and the Relation ontology. [13]
 Figure: Relations ('mappings') spanning between classes from the NeuronDB ontology (in the middle) and classes from external ontologies. [14]
  • Terse rdfs:labels were replaced by more descriptive ones that could be better understood without context. For example, the rdfs:label "Ded" was changed to "Distal part of equivalent dendrite (Ded)". Note that, in this case, the original label was also preserved (in brackets), because it might still be useful for people that do know about the context.
  • While we used the namespace neuroweb.med.yale.edu in the beginning, we switched to the namespace purl.org/ycmi/senselab.
  • The ontology development was moved to a Subversion (SVN) system on a central webserver. During most of the development, the ontologies were simply developed on the client side and were periodically uploaded via FTP. Of course this led to problems when more then one person was working on the ontologies at a time, and it was also impossible for users of the ontology to access previous versions of the ontology, since only the most recent version was available on the website.
  • The namespaces / ontology locations were changed to PURL-based URIs. For example, the URI http://neuroweb.med.yale.edu/neuron_ontology.owl#Dopamine was changed to http://purl.org/ycmi/neuron_ontology.owl#Dopamine. PURL-based URIs are easier to maintain when server configurations change or (in the worst case) the original server is unavailable and the ontologies need to be served from a different location. The increased stability of PURLs encourages the re-use of entities in ontologies developed by other groups -- which is a key factor in the creation of a coherent Semantic Web.
  • A SPARQL endpoint was set up using the open source version of the Virtuoso server instance. The ontologies were loaded into the triplestore of the server to make them accessible to SPARQL queries. Each ontology file was put into a separate labeled graph, the label of each graph was identical to the URL of the ontology file. For example, the ontology http://purl.org/ycmi/senselab/neuron_ontology.owl was loaded into a graph labeled http://purl.org/ycmi/senselab/neuron_ontology.owl. Loading each ontology into a separate graph makes it possible to restrict SPARQL queries to certain graphs and hence, certain ontologies. This has the advantage that queries can be more selective and can be executed with better performance.
  • Outcome: See [15]. The SVN can be accessed through a web interface at [16]. The SPARQL endpoint can be accessed at [17]. The SenseLab OWL ontologies are mentioned as a primary example for the application of OBO ontologies in The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration by Barry Smith et al. [18].

Second (fully automated) RDF conversions

The 'second' RDF conversion has been developed by Ernest Lim. It transforms the contents of the SenseLab databases to RDF in a generic, fully automated way.

  • Motivation: This second, fully automated RDF conversion is not meant to be used as the primary distribution, rather it will serve as an intermediate step for future OWL conversions. These RDF conversions are meant to serve as an intermediate layer between the source database and OWL ontologies. This lowers the entry barrier for others in participating in the conversion of the SenseLab databases, since they are no longer required to understand the underlying EAV/CR model or to access the database directly.
  • Process:
  • Outcome:

Advantages

  • The use of OWL significantly eased the integration of SenseLab data with ontologies developed by other projects. OWL-based data integration does not require the development and maintenance of central mediators, reducing development and maintenance costs. The ontology integration can be accomplished by creating meaningful relations between entities in distributed ontologies.
  • Ontologies can be modularized; dependencies between ontologies can be made explicit through 'owl:imports' statements. This makes distributed development of ontology modules feasible and encourages the re-use of selected ontology modules by other groups.
  • Good OWL ontologies are self-descriptive because every entity can be annotated with text
  • Reasoners can be used identify errors and real (i.e., conscious) contradictions in submitted data sets. You might find more errors than you expected!
  • Ontologies can be used to directly represent biological reality, no unnecessary abstractions (such as database tables, documents) necessary.

Disadvantages

  • The open-source ontology editors used for this project, especially Protege 3.x, were relatively unreliable. A lot of time was wasted with steering around software bugs. Future versions of these editors (e.g. Protege 4) or currently available commercial ontology editors (e.g. Topbraid Composer) might be preferable.
  • Descriptions of OWL classes and their relations (i.e., OWL property restrictions) result in very complex and unintuitive RDF graphs. This makes it hard to generate them automatically, or to write Sparql queries.
  • Current reasoners can still have performance problems when checking / classifying complex OWL ontologies.
  • The RDF/XML serialisation of RDF is not very easy to work with. It is often a source of errors.

Future perspectives and plans

  • Future OWL conversions are planned to be based on the RDF intermediate (the practicality of this conversion procedure has not yet been tested).
  • The SenseLab ontologies will be further integrated with other neuroscientific and biomedical ontologies.

Suggestions (based on our experiences)

  • Try to create consistent OWL DL ontologies. Pure RDF(S) without OWL constructs is not really that much simpler than OWL DL, you end up creating too many properties because pure RDF(S) does not support property restrictions.
  • Try to re-use entities and properties from existing ontologies where possible.
    • If you do not want to import another ontology in its entirety (e.g. because it would be too large, too buggy or would introduce unnecessary constructs), you can 'copy & paste' portions of the ontology into your own.
  • Try to base your ontology on a foundational ontology like BFO, OBO Relation Ontology or DOLCE.
  • Where possible, give clear, understandable rdfs:labels to each entity and property in the ontology. Try to formulate labels in a way that makes them understandable without too much additional context (e.g. a certain user interface).
  • Where possible, give concise rdfs:comments.
  • Make a habit out of running your ontology through the RDF validator periodically, especially when you create RDF/XML with scripts that you wrote yourself.
  • It seems like the RDF validator does not throw an error message when URI contain blanks. Blanks in URIs are problematic for many Semantic Web applications, though (and it is not clear to us at the moment whether they are legal or not). Try to make sure that your URIs do not contain blanks.
  • Check the consistency of your OWL ontology periodically. We used the Pellet reasoner, which seems to be the best choice at the moment.
  • Use purl.org URIs for your ontologies. You can easily register a sub-domain at purl.org.
  • If you write a program that generates RDF/OWL, do NOT try to write RDF/XML source code directly. RDF/XML is relatively complicated and messy, and it is very easy to produce syntactic or even semantic errors because of that. So if you write a program that generates RDF,
    • use a RDF or OWL API for writing triples or
    • generate your RDF in the much simpler TURTLE syntax instead of RDF/XML. You can save the resulting RDF in TURTLE format to a text file. If you need RDF/XML for another application, you can convert the TURTLE to RDF/XML in a second step.
  • By default, Protégé displays the abbreviated URI (QName) of classes, properties and instances in the hierarchy on the left. Many ontologies use meaningless alphanumeric codes as URIs, which makes it hard to navigate through the ontology. However, you can configure Protégé so that it displays the values of rdfs:label instead of the URI, making such ontologies much easier to navigate. A guide for configuring Protégé in such a way can be found in the Protege Wiki.
    • Please note that you can only display rdfs:labels with a certain language tag (e.g. "en" for english). So, for example, when you have ontologies where some labels are tagged with 'en', and some labels are not tagged or are tagged with another language, you can only see some of the labels. Currently there is no work around for such a situation (besides modifying the language tags in the ontology, of course).
  • When using Protégé or Swoop, be prepared to run into bugs (we found plenty of them).
    • Bugs in Protégé:
      • Protégé has a problem with URIs that cannot be split into a QName and an ID, e.g. "http://example.org/123456". In such cases, it creates a new namespace for each URI.
      • Some errors associated with ontology import hierarchies (e.g. namespaces / ontology URIs got mixed up)
      • Newly created annotation properties are not only typed as 'owl:AnnotationProperty', but also as 'owl:DatatypeProperty' or 'owl:ObjectProperty'. This is not allowed in OWL DL!

A separate page for suggestions will be created here: [19]