Named Graphs in Semantic Aggregators

This uses Algae to illustrate the language requirements of a semantic aggregator (semantic google). I guess it should be pretty much a superset of how people query other aggregation databases.

Many RDQL derivatives offered access to quads. The implementations usually filled the fourth column with the source of the triple. This document explores the practicality of addressing such data with SPARQL. Hopefully it answers a few questions, or at least guides us to productive questions, but I make no promises.

Algae Named Graphs

All graph operations in Algae take an optional graph name as a parameter.

assert foo:graph1 (
  foo:node1 foo:p1 foo:node2 .
  foo:node2 foo:p2 foo:node3 )
read <file:./bar.rdf> foo:graph1 (inputLang="rdf")
ask foo:graph1 (
  ?n1 foo:p1 ?n2 .
  ?n2 ?p     ?n3 )

In addition, queries allow constraint on provenance URIs. The provenance URI associated with any statements has, so far, been an http: or file: URL where the document could be retrieved. The provenance was identified by the statement property attribution.

ask (
  ?n1 foo:p1 ?n2 {%ATTRIB == <http://example.com/bar.rdf>} .
  ?n2 ?p     ?n3 )

The provenance for any assertion and the database where that assertion was found are entirely orthogonal. The data in bar.rdf could be replicated to any database and the above query would work. Annotea uses provenance constraints for trust and document management.

SPARQL offers a graph operator that constrains either the provenance or the queried database. However, SPARQL graph names are tied to provenance by the FROM and FROM NAMED functions. The database may provide named graphs any way it chooses, but if it maintains FROM compatibility with other SPARQL implementations, it must:

If the database merges queries by default, it will have to take queries that have only FROM NAMED and turn those into constraints on the fourth column.

SELECT ?o
  FROM NAMED foo:graph1
 WHERE { GRAPH foo:graph1 { ?s ?p ?o } } 

turns into an Algae query:

ask ( ?s ?p ?o {%ATTRIB = foo:graph1} )
select (?o)

Some oddness arises in the following dance:

SELECT ?o
 WHERE { ?s esoteric:predicate ?o } 

returns no graph matches (perahps because the predicate is so esoteric). The user resorts to specifically importing the relevent data:

SELECT ?o
  FROM NAMED esoteric:graph1
 WHERE { ?s esoteric:predicate ?o } 

This imports the data into the database, and makes a quad-constrained query to get only the data that (just) came from esoteric:graph1. The user the queries again:

SELECT ?o
 WHERE { ?s esoteric:predicate ?o } 

and finds answers, meaning, the aggregator actually imported that data into the default graph as well as the named graph esoteric:graph1. Perhaps the query service could implement FROM NAMED (with no corresponding FROM) by reading, querying, and deleting those triples. This seems a little counter to the philosophy of an aggregator.

Expressivity Mismatch

The obvious expressivity mismatch is in replicated data. A query can make a provenance-constrained on two different databases and only change the database name. Named graphs have a scheme for nesting named graphs. The canonical example is "Bob said 'Joan said 'The moon is made of green cheese''." I haven't worked out whether this can be used to express provenance-constrained queries in a database. Seems likely.

Requirements on SPARQL

In order to model SPARQL named graphs in a conventional quad store, every triple in a named graph must also be allowed to appear in the default graph. In effect, the system must be able to ignore the differences between FROM and FROM NAMED.


$Log: Overview.html,v $
Revision 1.7  2005/06/02 07:39:17  eric
clarify requirements

Revision 1.6  2005/06/02 05:24:17  eric
responded to Andy's comments

Eric Prud'hommeaux, W3C Team Contact for the RDF Data Access Working Group <eric@w3.org>
$Date: 2005/06/02 07:39:17 $