Warning:
This wiki has been archived and is now read-only.

Life Sciences Data

From SDshare Community Group
Jump to: navigation, search

Use case

A large portion of the linked data cloud consists of datasets coming from the life sciences domain (more than 50 counting) covering entities such as genes, diseases, drugs, pathways, … This should allow us to answer complex, structured queries that cannot be answered by one datasource alone and hopefully lead to the discovery of new knowledge and opportunities to develop new medical treatments.

There are 2 typical approaches possible:

  1. the warehouse approach: load the different datasets assuming they are also available as RDF dumps into a single RDF triple store and fire the queries against the merged dataset
  2. distributed query processing: queries are split into subqueries which are fired against the different sparql endpoints and combined afterwards by the federation layer.

Both have pros and cons.

  1. Warehouse
    • - depending on dumps available
    • - very expensive to load the different dumps: it can take hours, days.
    • - If data sources change frequently, it is very hard to keep the warehouse up to date
    • - Importing lots of data not necessarily relevant for your use case
    • + the fastest query time since limited network communication is need and queries can be highly optimised
  2. Federated SPARQL queries
    • - query responses take much, much longer, most of the time timing out with more complex queries [1]
    • + no setup

There is an additional caveat with these datasets though. Within these datasets (as they are) vocabularies and ontologies are rarely reused and no mapping has been made available by the respective data owners. [2] With other words we have syntactically interoperable data, but no semantically interoperability yet.

Meaning that we need to do this ourselves a) with a warehouse we can use RDFS and OWL constructs to add whatever alignment needed. Still some work to do, but feasible. b) in the federated case we need to do the same, but this is more complicated.

For being able to define the mappings you need to query all the SPARQL endpoints involved. And once the mapping established you need to implement a query rewriter that replaces the concepts/properties/relations you are after with the terminology used by the specific endpoint to which the subquery is fired.

Which makes that, for the time being, the warehouse approach makes, according to our evaluation, most sense. The burden of the initial load is a pain one can live with. However keeping your warehouse in sync with the data sources means for the moment reloading the updated dumps, a not so bright lookout.

This could be solved if we had a, preferably standardised, protocol for tracking changes where a server publishes the changes and a client pulls these in.

[1] An Evaluation of Approaches to Federated Query Processing over Linked Data, Peter Haasse, Tobias Mathäss, Michale Ziller

[2] Cataloguing and Linked Life Sciences LOD Cloud, Ali Hadsnain, Ronan Fox, Stefan Decker, Helena F. Deus

Contributed by Paul Hermans.

Solution

TBW