Feature:BasicFederatedQuery

From SPARQL Working Group
Revision as of 14:09, 4 August 2009 by SSchenk (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Feature: Basic Federated Query

Federated query is the ability to take a query and provide solutions based on information from many different sources.

A building block is the ability to have one query be able to issue a query on another SPARQL endpoint during query execution.


Feature description

Federated query is the ability to take a query and provide solutions based on information from many different sources. It is a hard problem in it's most general form and is the subject of continuing (and continuous) research.

A building block is the ability to have one query be able to issue a query on another SPARQL endpoint during query execution.

Different approaches are possible. One is to assume top-to-bottom order, so if a variable is bound above the invocation of the remote source then it is passed as a parameter. This is however a departure from SPARQL's declarative semantic and scope rules. It would be better if the federated subquery would leave all latitude to the implementation to decide on join order and would thus be more in the non-procedural spirit of SPARQL.

Ideally, federating queries should be transparent. With SPARQL, this comes at a high cost, specially if any of multiple end points may in principle match any of the triple patterns in a query. This means that information about colocation of data within one end point is hard to express. For this purpose, if a subquery in a SPARQL query is labeled with "SERVICE" then this means that the subquery should be sent as a whole, without attempts of dividing it into smaller fragments and that two SERVICE group patters or subqueries should not be merged into single request. Nevertheless these rules are not absolute, because optimizer may prove that some pattern is totally empty or produce a subset of the result of other subquery.

Open Questions

  • Allow SERVICE without argument?
  • Allow nesting of SERVICE?
  • Need to preserve order

Example

Look in the local database of my books and find the author for each one as specified by web-accessible query SPARQL service.

PREFIX : <http://example/>
PREFIX  dc:     <http://purl.org/dc/elements/1.1/>
SELECT ?a
FROM <mybooks.rdf>
{
  ?b dc:title ?title .
  SERVICE <http://sparql.org/books>
    { ?s dc:title ?title . ?s dc:creator ?a }
}

Return the people whom Orri knows at both semanticweb.com and myopenlink.com:

select ?contact1 where {
  SERVICE <http://www.semanticweb.com/sparql>
    {select ?contact1 where { ?me foaf:nick "Orri" . ?me foaf:knows ?f . ?f foaf:name ?contact1 }}
  SERVICE <http://www.myopenlink.com/sparql>
     {select ?contact2 where { ?me foaf:nick "Orri" . ?me foaf:knows ?f . ?f foaf:name ?contact2 }}
  filter (?contact1 = ?contact2)
 }

or, equivalent query,

select ?contact where {
  SERVICE <http://www.semanticweb.com/sparql>
    {?me1 foaf:nick "Orri" ; foaf:knows ?f1 . ?f1 foaf:name ?contact }
  SERVICE <http://www.myopenlink.com/sparql>
     {?me2 foaf:nick "Orri" ; foaf:knows ?f2 . ?f2 foaf:name ?contact }
 }

It is clear from the use case that it makes little sense to look for contacts of Orri's IRI at myopenlink.com in the repository at semanticweb.com if both sites were known to assign different person URI's. Without the explicit partitioning of the patterns, it would be difficult to infer this, so single group graph pattern like

select ?contact
FROM SERVICE <http://www.semanticweb.com/sparql>
FROM SERVICE <http://www.myopenlink.com/sparql>
where {
    ?me1 foaf:nick "Orri" ; foaf:knows ?f1 . ?f1 foaf:name ?contact .
    ?me2 foaf:nick "Orri" ; foaf:knows ?f2 . ?f2 foaf:name ?contact .
 }

is definitely less efficient.

The query optimizer would have all latitude in determining which site had fewer contacts and putting this first in a loop join or for retrieving both in parallel and doing a hash intersection or any other possible execution plan. The plans' relative merit is entirely dependent on the expected cardinalities and access latencies of the sites.

Existing Implementation(s)

  • ARQ.
  • RDF::Query.
  • Virtuoso contains internals of the SPARQL compiler so parts of original query can be printed as SPARQL requests for remote web service endpoints. The complete implementation is suspended, because SPARQL protocol extensions are required for reasonable scalability.
  • DistributedSPARQL Allows to define certain graphs as remote. GRAPH <remotegraph> {...} then is evaluated as a remote subquery.

Existing Specification / Documentation

SPARQL_Federation in Virtuoso.

Basic Federated Query in ARQ.

Compatibility

This has no effect on existing valid SPARQL/2008 queries.

Links to postponed Issues

Service Description

Related Features

SPARQL Protocol changes for parameters.

SPARQL Protocol Query Parameters.

Champions

Use cases

See example.

References

Querying Distributed RDF Data Sources with SPARQL.