Semantic Bioinformatics

Introduction

Although bioinformatics (or computational biology) is related to any computational process that increases the capability of processing biological data, nowadays the focus is much bigger in creating data instead of analysing and learning from it. Because the biological data is one of the most complex available, the algorithms involved in this computational effort are not seldom very complex and still yielding very poor results. Therefore so far computers are creating and aggregating data to allow biologists to manually analyse and curate the data and feedback the system for the next cycle.

Another problem of bioinformatics is the lack of well defined standards. Every database or institute has it's own unique identifiers, indexing scheme, data format and tools to deal with them all. There is already a big effort to integrate this data but the amount of work to be done with current data models is huge and because data models change quite often, the maintenance of the code base is just not feasible.

One way of solving this problem is to centralize all data in a single database (such as EMBL, UniProtKB etc) but for that we still need all other groups to produce the data in the same format (which does not fit all models) and we end up having either lack of information (all using the same fixed format) or integration nightmare (different formats).

At last, even when we reach the stage where all data is automatically integrated together in real-time, there is still the need to understand this data and to do what's most important in Bioinformatics: reasoning. Of course reasoning can be done using any kind of data format: flat-files, databases, XMLs but for all of them, we still need to explain the meaning of assertions and for that we need ontologies and a statement mechanism to connect them. So, instead of reinventing the wheel we can use RDF or RDF- based ontology languages like OWL, that can solve most (if not all) of those problems by providing a defined semantics to interpret the data in a tracable way.

RDF in Bioinformatics

The support for RDF in application oriented high-throughput bioinformatics is still small (due to its novelty). Lots of databases provide their data in RDF and ontologies are also available in the OWL format but little is done with this data on a wider 'inter-group' level, but you can already download the UniProtKB and it's taxonomy information and get OBO ontologies in OWL such as GO for example.

The biggest impact RDF can have in bioinformatics, though, is to help integrate all data formats and standardise existing ontologies.

If unique identifiers are converted to URI references, ontologies are expressed in OWL, and data is annotated via these RDF based resources, the integration between them is a matter of merging and aligning the ontologies (in case of owl full using the 'rdf:sameAs' statement). After the data has been integrated we can use the plus that comes with RDF for reasoning: context embeddedness

Projects

Below is the list of know projects related to bioinformatics or tools that bioinformaticians are using in their projects. Some tools might be already listed in other pages like SemanticWebTools or CommercialProducts but here the context is that they're also being used in the Bioinformatics context.

Please, if you know more projects or work in one, add it to the list.

Ontologies & Data

UniProtKB: The protein database
OBO ontologies: Biological ontologies
OBO Download Matrix: All formats' downloads for OBO ontologies
OBI: Ontology for Biomedical Investigations
BioPAX: Biological pathway ontology

Projects, Groups & Research

The new UniProtKB website: Exports UniProt entries to RDF
SBML: Uses RDF to annotate XML fields
Neurocommons: Knowledge management platform for biological research
HCLS: Health Care and Life Sciences Interest Group

Storage engines & Tools

SDS: Use OWL & SPARQL to Query all Data Sources; RDBMS, Excel, Web Services ++ (Includes Data Federation Engine & Distributed Query Optimization)
Virtuoso: Relational database with Sparql support
Sesame: Sesame, a Java interface with Sparql and Serql support
Jena: Another Java API
Oracle 11g: Includes beta support for RDF
YARS: Java storage with "N3 query language"
Protege-owl: An ontology- and knowledgebase-editor for OWL
Bio2Rdf: A set of integrated RDFisers to the major bioinformatics databases
RDFScape: Visualize, query and reason on ontologies within Cytoscape