W3C

Federated SPARQL

CVS Version:
$Id: Overview.html,v 1.1 2007/05/26 18:37:07 eric Exp $
Author:
Eric Prud'hommeaux, W3C

Abstract

This document describes a SPARQL extension to optimize federated queries.

Status of this Document

This documents experiments by the author. It is not endorsed by the W3C Team or Membership. It is hoped that the work described here will be pertinent to the life sciences work persued by W3C.

Introduction

SPARQL queries are not confined by datasource boundries. Queries over distributed data often entail querying one source and using the acquired information to constrain queries of the next source. Without extension, SPARQL entines express this acquired knowledge by rewriting the federated query with bindings produced by earlier queries. This requires the query to be issued repeatedly, once for each putative solution. SPARQLfed bundles an intermediate result set with a SPARQL query, allowing the remote engine to locally join its data against the current constraints.

Use Case: FeDeRate for Drug Research

As an example consider the five data sources listed in Case Study: FeDeRate for Drug Research. the initial GRAPH query

# Get a name and a chemical from the (SQL) MicroArray database.
GRAPH db:MicroArray.prop {
         ?g	ma:name		?name .
	 ?g	ma:expression	"up" .
	 ?g	ma:experiment	?kinase .
	 ?kinase ma:against	?agin .
	 ?agin	cs:chemical	?chemical }

}

is dispatched on the MicroArray database, producing an intermediate result set:

gnamekinaseaginchemical
g1name1kinase1agin1chemical1
g2name2kinase2agin2chemical2
g3name3kinase2agin2chemical3

The next GRAPH query

# The uniprot data (in RDF) has motif and pathway information.
GRAPH db:Uniprot.rdf {
         ?p	ma:name		?name .		# bound to ?ma.ma:name
	 ?p	up:motif	?motif .
	 ?p	up:pathway	"apoptosis" }

is constrained by the variable name (though it could easily by constrained by more of the variables introduced in the previous query). The SPARQL engine may either

  1. dispatch the next query unconstrained by the current results.
  2. issue the query three times, with each of (name1, name2, name3) substituted for ?name
    # The uniprot data (in RDF) has motif and pathway information.
    GRAPH db:Uniprot.rdf {
             ?p	ma:name		"name1" .		# bound to ?ma.ma:name
    	 ?p	up:motif	?motif .
    	 ?p	up:pathway	"apoptosis" }
    

Either approach faces inefficiencies, and the former can be a disastrously avaricious, retrieving all of the remote data in a database.

SPARQLfed Extension

The SPARQLfed extension modifies the SPARQL grammar, adding a BindingClause to the WhereClause:

WhereClause   ::= ("WHERE")? GroupGraphPattern (BindingClause)?
BindingClause ::= "BINDINGS" (Var)+ "{" (Binding)* "}"
Binding       ::= "(" (VarOrTerm)+ ")"

This enables the query engine to dispatch the above federation in one query:

# The uniprot data (in RDF) has motif and pathway information.
GRAPH db:Uniprot.rdf {
         ?p	ma:name		?name .		# bound to ?ma.ma:name
	 ?p	up:motif	?motif .
	 ?p	up:pathway	"apoptosis" }
 BINDINGS ?name { 
  ("name1")
  ("name2")
  ("name3") }

Implementations

SPARQLfed has been implemented in FeDeRate and is underway in the SPASQL MySQL port.

References

[SqlDB] Optimized RDF Access to Relational Databases, Eric Prud'hommeaux (See http://www.w3.org/2004/04/30-RDF-RDB-access/ .)
[Algae] Algae RDF Query Language, Eric Prud'hommeaux (See http://www.w3.org/2004/05/06-Algae/ .)

$Date: 2007/05/26 18:37:07 $