major technical: runaway queries from Fred Zemke on 2006-01-12 (public-rdf-dawg-comments@w3.org from January 2006)

From: Fred Zemke <fred.zemke@oracle.com>
Date: Thu, 12 Jan 2006 13:37:34 -0800
To: public-rdf-dawg-comments@w3.org
Message-ID: <43C6CC1E.6040504@oracle.com>
The set of RDF terms is unbounded, including all RDF literals, whether
or not they appear in the graph.  In addition, the set of IRIs
may be an unbounded set (or perhaps you intend only those IRIs that
appear in the graph; this should be clarified).

Given that the set of RDF terms is unbounded,
what is the result if a variable in the SELECT list is completely
unconstrained by the WHERE clause?  For example
SELECT ?x WHERE { FILTER (?x = ?x) }
It seems that the answer to this query must consist of all IRIs (possibly
just all IRIs in the default graph), all RDF literals and all
blank nodes in the default graph. 

It's probably an undecidable problem
to characterize all such runaway queries (equivalent to the halting
problem?).  Perhaps we can expect that the user will not
be interested in formulating such queries.  However, they may still
arise accidentally, for example, through typos, as in this query:
SELECT ?name WHERE { [] v:isnamed ?nam }.  And, from an
implementation standpoint, it would be desirable if the definition of
a solution could be restricted so that all solutions in some sense
begin with triples in the graph, to avoid the theoretical need to enumerate
all conceivable IRIs and RDF literals.

Provisionally I propose the concept of "known RDF term" consisting of
all IRIs in the graph, all RDF literals in the graph, and all blank
nodes in the graph.  Using this notion, a pattern solution would be
a mapping of variables and SPARQL blank nodes to known RDF terms.

However, this would introduce a different problem.  Consider the
query: tell me the price plus tax of the items for sale in the graph.
Currently this can be written
SELECT ?z WHERE { [] v:price ?x . [] v:tax ?y . FILTER ( ?z = ?x + ?y ) }
If we restrict the range of variables to just values already in
the graph, it is quite possible that there will be a price plus tax
that is not a value in the graph. 

To solve this, I see three possible directions:

1. a LET capability as in XQuery.  The preceding query might then be
written SELECT ?z WHERE { [] v:price ?x . [] v:tax ?y . LET ?z = ?x + ?y }

2. An implicit LET capability that allows values to become known
RDF terms, using the FILTER clause as in the example above,
with adequate syntactic definitions to make this precise and sound.
Specifying such syntactic requirements is messy, but basically
it might be that there is an equality predicate with a single variable
on the left hand side, the equality predicate is the sole contents of
the FILTER clause or is only conjoined
using && to other predicates, and the variable on the left hand side
appears in only one such predicate.  Additional rules regarding
placement relative to UNION and OPTIONAL may also be necessary.

3. permit expressions in the SELECT list.  In that case the query can
be formulated SELECT ?x + ?y WHERE { [] v:price ?x . [] v:tax ?y }

Of the three alternatives, I think users will find the third the
most convenient.  My second choice would be a LET clause.  I think
implicit LET may be too complicated to specify correctly.

Finally, coming back to the problem of runaway queries caused by typos,
this could be ameliorated by either a warning or perhaps an outright
error if the query result involves an unbound variable.

Fred Zemke
Received on Thursday, 12 January 2006 21:37:41 UTC