W3C

Case Study: FeDeRate for Drug Research

① SPASQL Example · ② SPASQL-MySQL · ③ SPASQL XTech Paper · ③ XTech Slides · ⑤ SPASQL CVS Tree

CVS Version:
$Id: Overview.html,v 1.14 2011/03/28 12:56:54 eric Exp $
Author:
Eric Prud'hommeaux, W3C

Abstract

This document describes the use of FeDeRate to perform a cross-orginanizational database join.

Status of this Document

This documents experiments by the author. It is not endorsed by the W3C Team or Membership. It is hoped that the work described here will be pertinent to the life sciences work persued by W3C.

Introduction

FeDeRate is a tool for fitting arbitrary data stores into the semantic web. The core is a unifier that dispatches RDF queries and assertions to a variety of database interfaces. This case study focuses on the use of the SqlDB [SqlDB] interface, which maps RDF queries to SQL queries and expresses the returned knowledge as RDF. FeDeRate will cull the results of a query spread across several relationa stores using SqlDB to translate to SQL and from the relation answers.

Problem Statement (Pharmacology Perspective)

A corpus of microarray experiments has revealed certain chemicals to increase the expressivity. We wish to limit to those involved in the "apoptosis" pathway. We further restrict to those chemicals which affect proteins either very much or very little. Given that set of chemicals, we look for chemicals with similar side-chains that have a minimum toxicity in mice.

Query

The SPARQL query unifies the data from several databases. The query engine connects to an SqlDB each time a named graph with the base URI is loaded. The remaining URI is interpreted as a form-url-encoded series of constructor parameters. In the following example, The databases are configured by pointers to property files. The QName db:MicroArray.prop expands to SqlDB base URI + '?' + properties=..%2Ftest%2FMicroArray.prop. The latter term is the URL-encoded form of the relative path to properties file ../test/MicroArray.prop.

# Assign convenient prefixes to common for later syntactic shorthand.
PREFIX db: <http://www.w3.org/2003/01/21-RDF-RDB-access/ns#SqlDB?properties=..%2Ftest%2F>
PREFIX ma: <http://med.example/ma#>
PREFIX cs: <http://med.example/cs#>
PREFIX up: <http://med.example/up#>
PREFIX sa: <http://med.example/sa#>
PREFIX mt: <http://med.example/mt#>

SELECT ?name ?chemical ?motif ?saProt ?kd50 ?like ?ld

FROM NAMED db:MicroArray.prop
FROM NAMED db:Uniprot.rdf
FROM NAMED db:ScreeningAssay.prop
FROM NAMED db:ChemStructure.prop
FROM NAMED db:MouseToxicity.prop

WHERE
{
# Get a name and a chemical from the (SQL) MicroArray database.
GRAPH db:MicroArray.prop {
         ?g	ma:name		?name .
	 ?g	ma:expression	"up" .
	 ?g	ma:experiment	?kinase .
	 ?kinase ma:against	?agin .
	 ?agin	cs:chemical	?chemical }

# The uniprot data (in RDF) has motif and pathway information.
GRAPH db:Uniprot.rdf {
         ?p	ma:name		?name .		# bound to ?ma.ma:name
	 ?p	up:motif	?motif .
	 ?p	up:pathway	"apoptosis" }

# Use the (SQL) Kinase databaes to limit to the interesting chemicals.
GRAPH db:ScreeningAssay.prop {
         ?a	sa:name		"KinaseAssay" .
	 ?a	cs:chemical	?chemical .	# bound to ?ma.cs:chemical
	 ?a	sa:upname	?saProt .
	 ?a	ma:kd50		?kd50
         FILTER (?kd50 >= .7 || ?kd50 < .2) }

# This (SQL) chemical database indexes like sidechains.
GRAPH db:ChemStructure.prop {
         ?c	cs:chemical	?chemical .	# bound to ?ma.cs:chemical, ?sa.cs:chemical
	 ?c	cs:structure	"asdfasdf" .
	 ?c	cs:sidechain	?side .
	 ?c2	cs:sidechain	?side .
	 ?c2	cs:chemical	?like }

# Limit by toxicity in the (SQL) MouseToxicity experiments.
GRAPH db:MouseToxicity.prop {
         ?t	cs:chemical	?like .		# bound to ?cs.cs:sidechain
	 ?t	mt:toxicity	?ld
         FILTER (?ld < .35) }
}

The Process

FeDeRate dispatches each of the sub-queries in turn, in the order they are expressed in the query (see Re-writing Queries for rewriting alternatives). The first is dispatched to SqlDB, which uses configuration information in the properties file to compile the sub-query into an SQL query:

SELECT STRAIGHT_JOIN
       MicroArray_0.id AS g_id,
       MicroArray_0.name AS name_name,
Kinase_0.id AS kinase_id,
       Chemicals_0.id AS agin_id,
       Chemicals_0.chemical AS chemical_chemical

FROM MicroArray AS MicroArray_0
     INNER JOIN Kinase AS Kinase_0 ON MicroArray_0.experiment=Kinase_0.id
     INNER JOIN Chemicals AS Chemicals_0 ON Kinase_0.against=Chemicals_0.id
WHERE MicroArray_0.expression="up"
GROUP BY g_id,name_name,kinase_id,agin_id,chemical_chemical

SqlDB records the variables associated with each select and executes the query. Additional fields are selected and returned. These are for creating a URI to identify each tuple in the solution. Primary or unique keys are used to compose a URI that can be decomposed to extract the tuple in question. For example, the URI identifies the tuple in the Chemicals table with the id 3001.

g_idname_namekinase_idagin_idchemical_chemical
3001rq23 2001 1001lhrh-agonist
... ... .... ......

Iterating Across an Intermediate Result Set

Results from this query are used to constrain subsequent queries. A new query is created for each result, using the variable bindings from that result (see Multi-Binding Queries for thoughts on making this more efficient). The next term in the example requires local unification (it reads the data from an RDF file). The section after that (screening assay data) is handled by SqlDB, which generates a set of queries like:

SELECT STRAIGHT_JOIN
       ScreeningAssay_0.id AS a_id,
       ScreeningAssay_0.chemical AS chemical_chemical,
ScreeningAssay_0.name AS saProt_name,
ScreeningAssay_0.kd50 AS kd50_kd50

FROM ScreeningAssay AS ScreeningAssay_0
WHERE ScreeningAssay_0.name="KinaseAssay" 
  AND  ((ScreeningAssay_0.kd50>=".7") OR (ScreeningAssay_0.kd50<".2")) 
  AND  ScreeningAssay_0.chemical="lhrh-agonist"
GROUP BY a_id,chemical_chemical,saProt_name,kd50_kd50

This process is repeated for each subgraph in the query.

As we saw above with the expression="up" constraint, literal values in the graph are incorporated into the SQL constraints. The same is true of scalar constraints interspersed in the graph, For instance, ?kd50 >= .7 || ?kd50 < .2 is expressed in SQL as ((ScreeningAssay_0.kd50>=".7") OR (ScreeningAssay_0.kd50<".2")).

Optimizations

Currently SqlDB optimizes for subgraph query efficiency by constructing a single SQL query for the subgraph, moving the unification as close to the data as possible. A few areas for optimization remain.

SPARQLfed extensions to the SPARQL language allow a query to communicate all of the current relevent bindings.

Multi-binding Queries

FeDeRate keeps variable bindings in a structure called a ResultSet. SqlDB currently traverses the set of solutions in a ResultSet and generates a SQL query for each, constraining the query with the variable bindings found in that solution. This will be optimized in the future by again shifting the unification to the database by creating more complex queries with the compile constraints from a set of solutions. These constraints can be embedded in the query so that the results can be traversed and associated with the solution or solutions that they match and extend.

Re-writing Queries

A query planner with intimate knowledge of the data will be able to intelligently order the queries so that they start with a compromise between the smallest set of results and the least expensive to query. This query planning can re-order the sub-queries in the query. The Algae [Algae] query language is usefull for this as the product of the optimization may again be expressed in Algae, allowing us to completely separate that step from query execution.

References

[SqlDB] Optimized RDF Access to Relational Databases, Eric Prud'hommeaux (See http://www.w3.org/2004/04/30-RDF-RDB-access/ .)
[Algae] Algae RDF Query Language, Eric Prud'hommeaux (See http://www.w3.org/2004/05/06-Algae/ .)

$Date: 2011/03/28 12:56:54 $