SPARQL: SOURCE is suboptimal from Tim Berners-Lee on 2004-11-22 (public-rdf-dawg-comments@w3.org from November 2004)

From: Tim Berners-Lee <timbl@w3.org>
Date: Mon, 22 Nov 2004 12:15:39 -0500
To: public-rdf-dawg-comments@w3.org
Message-Id: <21AA7C98-3CAA-11D9-B85E-000A9580D8C0@w3.org>
Reading the draft of 2004-10-13

The current specification of SOURCE assumes a particular sort of 
application, which will not necessarily be more common than any other. 
As a result, SPARQL as a query language lacks the flexibility to do the 
general job of giving or querying metadata about the source of 
information.

SOURCE and FROM are muddled, and bite off part of  a general question 
without solving it in general.

. Behind the SOURCE feature is the implicit notion that the database 
being queried is a conjunction of graphs each corresponding to web 
resources.  The concept of the graph itself is not surfaced, but the 
URI of the graph is the thing bound to. Meanwhile, servers have the 
option of ignoring that structure and ignoring the binding of the 
SOURCE variable.   This seems to me fuzzy.

In fact, the database being queried may be generated in many ways, in 
particular a triple may have arisen from a combination of triples in 
different databases.

Random example 0:

foo.rdf:       mary  foaf:phone   1234.
bar.rdf:	    mary   owl:sameAs   maryJ

query includes

		 SOURCE  ?s  {  maryJ   foaf:phone   ?y }.

The natural result is to bind s to  a bnode expressing the virtual 
graph which was formed


	<foo.rdf>    log:semantics   ?f.
        <bar.rdf>    log:semantics   ?g.
       ( ?f ?g  )     log:conjunction  ?h.
         ?h            owldl:closure     ?s.

There are a lot of combinations possible here of course, and many 
complex things which will happen in the future.

That sort of graph could be returned in the query.  It could also be 
sent with the query to describe what has to be done.  If you like, it 
is a clear RDF expressionof the sort of thing which will otherwise get 
relegated to more and more complex non-RDF syntax or server command 
line out of band forms.


There is an assumption, in the SOURCE feature, that when multiple 
graphs exist, then  they are all believed.  This is IMHO a major and 
quite unnecessary flaw.  Many systems will need to be distrustful of 
most data.  So I'd like to be able to use the SOURCE feature, which 
overlaps with the FROM feature, so that *either* one is talking about 
explicitly mentioned resources as the source to be queried, *OR* there 
is a default knowledge base for the service.

When both are used, then the default KB can be a meta-kb which allows 
the kbs being processed to be constrained and defined.

The feature of returning NULL but continuing should be dropped. The 
whole idea of having things continuing when data when a requested 
feature wasn't implemented I think is asking for interoperability 
problems.

One way to clean it up is to make a SOURCE variable must be bound 
elsewhere.  this would mean that the set of resources which are queried 
becomes explicit.
Otherwise we have added two implicit things to the SPARQL service -- 
the implicit set of sources and the impliciit kb.

Random Example 1:

SELECT ?x ,...
WHERE
             ?y   roogle:search    "Mary".
             SOURCE ?y      {  ?x  firstName "Mary" ...

So the default KB is defined for this server to know about  
roogle:search which relates documents which contain strings to those 
strings.

Random Example 2:


SELECT ?x, ...
WHERE
       ?x    rdf:type QualifiedIndividual.
       ?x    address:countrycode "fr".
       ...
       ?x    foaf:personalProfile  ?p.
       SOURCE ?p       { ?x diet:preference  ?z   }
      ...

Here the main database if trusted. The mass of FOAF out there isn't. 
Just for one item, the query tests the person's personal profile to see 
what they declare themselves as a vegetarian.  The bulk of the query is 
on a trusted database, and by default only that database is trusted.  
This is an application where the idea that all known graphs are trusted 
by default breaks.

Conclusion:

The current specification of SOURCE assumes a particular sort of 
application, which will not necessarily be more common than any other. 
As a result, SPARQL as a query language lacks the flexibility to do the 
general job of giving or querying metadata about the source of 
information.

A better solution is to used RDF graphs for the metadata in query 
and/or in the returned information.
Received on Monday, 22 November 2004 17:15:43 UTC