Re: Comments on last-call SPARQL draft 20050721, sections 3 onwards [OK?]

Graham Klyne wrote:
> [Apologies for being late with these, but I'm hoping better late than 
> never...]
> 
> Reviewing:
>    http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050721/
> 
> Overview:  I find that the specification (or what I think it says) to be 
> generally sound and sensible, but I see a number of areas where the 
> explanations seem less clear than they might be.
> 
> I think this will be a very important specification for a range of RDF 
> users and developers, so I think making it as clear as possible is a 
> goal worth pursuing.
> 
> ...
> 
> General, definitions:
> 
> I am finding the "Definitions" given in the text are less helpful than I 
> feel they should be.  I discern two main reasons for this:
> 
> (a) although couched in a kind of formal language, they don't seem to be 
> constructed with the rigour I would associate with such language.  The 
> definitions seem to be incomplete and/or ambiguous (or open to different 
> interpretation), so the expected benefit of formality is not being 
> realized.  In the notes below, I pick out some problems I have identified.
> 
> (b) it's not easy to find definitions.  My (printed) copy of the 
> document contains no collected list of definitions, even though the 
> table of contents and change log indicate this should be present.  (ToC 
> has this between the references and the change log.)

There is a link to the
http://www.w3.org/2001/sw/DataAccess/rq23/defns.html

and a live transformation:
http://www.w3.org/2000/06/webdata/xslt?xslfile=http%3A%2F%2Fwww.w3.org%2F2001%2Fsw%2FDataAccess%2Frq23%2Fdefns.xsl&xmlfile=http%3A%2F%2Fwww.w3.org%2F2001%2Fsw%2FDataAccess%2Frq23%2F&auth=proxy&transform=Submit

> 
> (If I had the time, I'd like to try coding up the formal definitions in 
> Haskell, which I think would quickly flush out any problems, but I don't 
> see me having time in the next month.)
> 
> ...
> 
> General, presentation of concetps:
> 
> I have the feeling that this document has been drafted by people who 
> have experience of constructing query implementations (I know Eric and 
> Andy have), and that some of the important concepts and ideas are made 
> implictly rather than explicitly, and hence that some of the ideas are 
> not fully explained for a person approaching this topic afresh.  I have 
> tried to point out cases where I see them, but having myself implemented 
> RDF query systems I may easily have overlooked others.
> 
> An example of this might be section 8.3 (restriction by bound 
> variables): I think I understand what is being described based on my own 
> past experience, but I can't tell if I would otherwise be able to do so.
> 
> (I appreciate this comment doesn't readily admit a specific response, 
> and I don't expect one but, by mentioning it, maybe I can help sensitize 
> peope to some possible issues.)
> 
> ...
> 
> General, prefixes in IRI results:
> 
> I think there is an awkward tension between theoretical requirements for 
> correct appliion functioning, and practical usability issues, in the way 
> that IRIs are returned in query results.  In theory, all that is needed 
> is the IRI, but most SWeb applications I have seen go to some lengths to 
> preserve the prefixes used in the original data so that human-readable 
> qname values can be reconstructed.
> 
> As far as I can tell (also looking at 
> http://www.w3.org/TR/2005/WD-rdf-sparql-XMLres-20050801/), there is no 
> provision for returning prefixes.  I think that practical considerations 
> suggest that there should be an optional mechanism for query processors 
> to return prefix information with variable binding results.

You're right - there is no provision for returning prefixes.  This was not
identified as a requirement during the requirements phase of the working
group.  Prefixes are a feature of the serialization.

Any query processor is free to provide this functionality and a parser can
extract them locally.

> 
> ...
> 
> Section 3.1, "Matching integers" (and nearby) [editorial]:
> 
> "The pattern in the following query has a solution :x ..." is not 
> explicit that it refers to a solution when matched against the preceding 
> data.  An immediate fix would be to add "in the above RDF data" after 
> ":x", but maybe a more comprehensive approach would be to add a brief 
> paragraph, just after the sample data, along the lines of:
> 
> [[
> This RDF data is the target for query examples in the following sections.
> ]]
> 

Added the second suggestion.

> ...
> 
> Section 3.3, Boolean [editorial nit]:
> 
> It is my understanding (nad my dictionary agrees) that "Boolean" in 
> prose text should be capitalized, being named after Boole.
> 

It has entered common usage in computer science / IT so I prefer to make it
like "integer"

c.f.
http://www.w3.org/TR/xmlschema-2/
http://en.wikipedia.org/wiki/Boolean


> ...
> 
> Section 3.3, definition [editorial]:
> 
> I found the last part of this definition was hard to follow.  I suggest 
> something like: "For value constraint C, a solution S matches C if S(C) 
> is true, where S(C) is the Boolean-valued expression obtained by 
> substitution of variables mentioned in C."

Changed to:
"""
For value constraint C, a solution S matches C if S(C) is true, where S(C) is
the boolean-valued expression obtained by substitution of the variables
mentioned in C.
"""

> 
> ...
> 
> Section 3.3, error conditions [functionality query]:
> 
> Has the full impact of the stated handling of errors been considered in 
> depth?  While I think this is probably OK, I have a niggling concern 
> that there may be some classes of errors that may prove difficult to 
> catch in this way.  For sure, I think that an "error condition" that is 
> caused by unanticipated values in the target graph should be handled as 
> described here, but in other cases, when the error is clearly in the way 
> the query has been constructed, it would be acceptable to simply return 
> a failure.  For example, a regex filter containing an invalid regex.
> 
> My concern here, I think, is that it is not clear how broadly the term 
> "error condition" should be interpreted.

Error handling is covered in detail in section 11 and has been fleshed-up
since the LC document of July/August - many of the situations are directly a
consquence of using the defintions from XQuery/XPath Functions and Operators
(e.g. a regex filter containing an invalid regex).



> 
> ...
> 
> Section 4: 1st bullet, "Basic Graph Patterns" [editorial]
> 
> I think this should be cross-referenced to section 2.5.
> 
> I note the phrase is hyperlinked (or assume so, as it is underlined), 
> but as I am reviewing a paper copy of the document, I have no idea where 
> the hyperlink actually leads.

There is such a link - I added ", where a set of triple patterns must all
match" for symmetry with the other bullets.

> 
> ...
> 
> Section 4: 2nd bullet, "Group Pattern" [unclear]
> 
> I found the phrasing "must all match" was insufficient.  Suggest 
> something like: "where each of a set of graph patterns must match using 
> the same variable substitution".
> 

Added

> ...
> 
> Section 4, general [editorial]:
> 
> There seems to be deal of overlap between this section and section 2.5, 
> with maybe some muddling of the concepts (notably "basic graph pattern" 
> and "Group graph pattern" seem to be somewhat tangled).  For 
> specification purposes, I think it would be easier to treat a "basic 
> graph pattern" as a group of "triple patterns".

The document introduces simple queries then builds on this.  Basic graph
patterns are the building blocks for queries, matched by simple entailment.
Now, single triple patterns are not defined as matching separately.

> 
> Thus, I think that merging sections 2.5 and 4 could create a simpler, 
> easier to follow descritpion with less scope for misinterpretation.
> 
> It seems strange that the start of section 4 contains a bulleted list of 
> topics that are described in sectrions 2.5, 4, 5 and 6.  So my I would 
> expand previous suggestion to suggest a single section covering all of 
> these, starting with the list of various patterns described.  A 
> preceding section could deal with matching of single triples, literals, 
> bnodes, etc.

The document tries to start with simple queries and build from there to make
it accessible to people coming to SPARQL for the first time.

> 
> ...
> 
> Section 4.1, "For any solution ..." [editorial]:
> 
> I found this paragraph was potentially confusing, being an example of 
> the muddle I allude to in the preceding comment.
> 
> ...
> 
> Section 5.1, para 1 [query correctness]:
> 
> "... OPTIONAL keyword applied to a graph pattern."  Should this be "... 
> applied to a group pattern"?  I ask this because section 4.1 indicates 
> braces as introducing a group pattern.
> 

In the definitions, OPTIONAL can be applied to any pattern - the syntax
introduces the notion it is always a group.

> ...
> 
> Section 5.1, example [incomplete spec]:
> 
> What happens if the triple
>    _:a  foaf:mbox <mailto:alice@work.example> .
> is added to the example data?
> 
> I think this should lead to two solutions that bind "name" to "Alice", 
> but that's not clear to me from the description here.
> 

Good idea - added.

> ...
> 
> Section 5.4, formal definition [error?]:
> 
> I think this formal definition may be wrong or incomplete.
> 
> Preamble:  it refers to a "S is a solution", but I see no definition of 
> solution. (Section 2.4 has "Pattern Solution" and "Query Solution".  I'm 
> guessing the latter is meant.

added "pattern" to each unqualified "solution"

> 
> Consider the example data:
> [[
> _:a  rdf:type        foaf:Person .
> _:a  foaf:name       "Alice" .
> _:a  foaf:mbox       <mailto:alice@work.example> .
> ]]
> 
> and the query pattern from section 5.1:
> [[
> WHERE  { ?x foaf:name  ?name .
>           OPTIONAL { ?x  foaf:mbox  ?mbox }
>         }
> ]]
> This is an instance of OPT(A,B), where:
> A = { ?x foaf:name ?name }
> B = { ?x foaf:mbox ?mbox }
> 
> The substitution:
>    [ x/_:a, name/"Alice", mbox/<mailto:alice@work.example> ]
> is a solution for both A and B, hence is a solution for OPT(A,B).
> 
> But also consider the substitution:
>    [ x/_:a, name/"Alice", mbox/<mailto:alice@home.example> ]
> This is a solution for A but is not a solution for A and B, hence 
> according to the definition given it is a solution for OPT(A,B)
> 
> This means that the solution set should include:
>    [ x/_:a, name/"Alice", mbox/<mailto:alice@work.example> ]
>    [ x/_:a, name/"Alice", mbox/<mailto:alice@home.example> ]
> and any other possible substitution for mbox, which is clearly not what 
> is intended.


It is true that a whole range of spurious values for mbox meet the doesn't
match requirement for an OPTIONAL.  However, all solutions must be composed of
terms actually in the graph (this is a new part of the defintion of pattern
solution).  A solution matches a number of parts of the graph pattern, so it
isn't possible to put a "don't match" criterion as it would invalidate
matching elsewhere in the query pattern.

> 
> ...
> 
> Section 5.5, 1st para [editorial]:
> 
> I think this is confusing, or not making sense, as the inner optional 
> pattern is (syntactically) a part of the optional outer pattern.  Thus 
> it might be expected that a match of the outer pattern must also match 
> the inner pattern.
> 
> Suggest:
> [[
> Optional patterns can occur inside any group graph pattern, including a 
> group graph pattern which itself is optional, forming a nested pattern. 
> Any non-optional part of the outer optional graph pattern must be 
> matched if any variable bindings from the nested optional pattern are 
> returned.  Thus, for a nested optional pattern OPT(A,OPT(B,C)), B and 
> possibly C are matched only when A is matched.
> ]]
> 

Changed to:

"""
Optional patterns can occur inside any group graph pattern, including a group
graph pattern which itself is optional, forming a nested pattern. Any
non-optional part of the outer optional graph pattern must be matched if any
solution is given involving matching the nested optional pattern are returned.
"""
and the example already illustrates this.


> ...
> 
> Section 6 [editorial]:
> 
> I think it might be helpful to include a test case that shows that:
> 
>    OPT(A,B)
> and
>    UNION(A,{A B})
> 
> are *not* equivalent.
> 
I have added the test case:

http://www.w3.org/2001/sw/DataAccess/tests/data/Optional/q-opt-3.rq

Section 6.2 does say:
"""
Query results involving a pattern containing GP1 and GP2 will include separate
solutions for each match where GP1 and GP2 give rise to different sets of
bindings.
"""
and the separate solutions means that they aren't equivalent.

      > Section 6.2 [incomplete spec]:
      >
      > The formal definition does not explain what are the results from
      > matching a union pattern.

The text says
"""
A union graph pattern matches a graph G with solution S if there is some GPi
such that GPi matches G with solution S.
"""

      > Section 7, "Definition of RDF Dataset Graph Pattern" [incomplete spec]:
      >
      > This definition doesn't actually tell me what a "RDF Dataset Graph
      > Pattern" is.

changed
"of GRAPH(g, P)"
to
"of RDF Dataset Graph Pattern GRAPH(g, P)"



> ...
> 
> Section 7, "Definition of RDF Dataset Graph Pattern" [incomplete spec]:
> 
> This definition doesn't actually tell me what a "RDF Dataset Graph 
> Pattern" is.
> 
> ...
> 
> Section 7, [clarification]:
> 
>  From reading this, I think that the pattern
>     GRAPH ?g { (pat) }
> does not match if (pat) is matched only in the default graph.  Is this 
> what is intended?  I think a brief explanation and test case would be in 
> order here.

Correct - see for example test case:
http://www.w3.org/2001/sw/DataAccess/tests/data/source-named/untrusted-graph-q1.rq

> 
> ...
> 
> Section 7.1 [superfluous content]:
> 
> I think the following text is superfluous and serves no useful purpose 
> over the examples given.
> [[
> Two useful arrangements are:
> 
>      * to have information in the default graph that includes provenance 
> information about the named graphs
>      * to include the information in the named graphs in the default 
> graph as well.
> ]]
> 
> Suggest: remove this.

In discussions, we found that these two cases were the commonly wanted
situations so mentioning that in the document was thought to be helpful.

> 
> ...
> 
> Section 7.1, example 2 [clarification]:
> 
> I'm not sure what is meant by "contain the same information as before". 
>   I think it should be "contain the same triples as before".
> 

Changed s/information/triples/

> ...
> 
> Section 8, general [clarification]:
> 
> Following section 7, I'm not sure if this section adds anything other 
> than explanatory content.  If there is any additional normative content 
> here, I think it should be highlighted.  If it is purely explaantory, 
> then I think it would better be subsection(s) of sect 7, and the text 
> tweaked to show that it follows from what has been specified (e.g. a 
> subsection headed "Examples of Dataset Queries").
> 

It tries to split out what an RDF dataset is (with definitions), from
querying it.

> ...
> 
> Section 9, general [grumble]:
> 
> I still feel (as I mentioned once previously) that the FROM clauses 
> don't really belong in the query language, but in the protocol.  I think 
> of a query as being something like a regex that stands alone, 
> indpendently of the target data to which it is applied.
> 
> That said, I feel that the specification given is sufficiently flexible 
> that it doesn't force implementations to do anything onerous (and might 
> even be ignored if the Dataset is assembled by other means), so I won't 
> complain too loudly.
> 
> ...
> 
> Section 9.2, example [clarification?]:
> 
> Is the intent here that the default graph is empty, or unspecified.
> 

The example does not use the default graph (neither in the set nor in the query).

> ...
> 
> Section 10.1, "The effect of applying ..." [clarification]:
> 
> I feel the text here only imlicitly indicates that more than one 
> solution sequence modifier can be applied.  Also, I think a cross 
> reference the the syntax terms showing how multiple modifiers may be 
> included would be helpful.
> 
> Does this paragraph refer to the order in which the modifiers are given 
> above in the document text, or in the query itself?

Changed to:

"applied in the order given by the list."

> 
> ...
> 
> Section 10.1, "Order by", ordering of IRIs [clarification?]:
> 
> I'm wondering if anything needs to be said about the ordering of IRIs 
> that use different combining forms (cf section 2.1, and my comment in a 
> previous message).  It seems life would be easier, in theory at least, 
> if IRI ordering were based on a normalized form, so that, e.g., 
> different combining forms don't lead to effectively equivalent IRIs 
> having different ordering.
> 
> I see this is an awkward topic, and I don't feel I know the right answer.
> 
> ...
> 
> Section 10.2, References, result format [clarification]:
> 
> Is http://www.w3.org/TR/rdf-sparql-XMLres/ is missing from the normative 
> references?  It is linked directly from section 10.2, and appears in the 
> informative references.  Maybe it should be normative because it is 
> needed to fully implement a processor for the SPARQL query 
> specification, per section 10.2.
> 
> Hmmm... Reading more closely the text in 10.2 ("Result sets can be 
> accessed...", I think there is some confusion (maybe on my part) about 
> what the spec is describing:  a query language? a query protocol? a 
> query API?


A SPARQL processor may be used for local queries:
        http://www.w3.org/TR/rdf-dawg-uc/#r3.5
and in that case the XML results format is not needed.


> 
> Taking a cue from the specification title, I'd say the current 
> specification has it about right, but that the text in section 10.2 
> should maybe be a little bit more explciit about what is not being 
> described;  e.g. replace the 2 paras from "Results can be thought of as 
> ..." with:
> 
> [[
> This specification does not define exactly how such results are 
> returned, as a query may be used in different contexts (e.g. query 
> protcol, query API) for which different forms are appropriate.

I changed the last para to:

"""
Result sets can be accessed by the local API but also can be serialized into
either XML or an RDF graph. An XML format is described in SPARQL Query Results
XML Format [ref], and this gives:
"""

> 
> Results can be thought of as a table with one row per query solution, 
> and a column for each variable in the query. Some cells may be empty 
> because a variable is not bound in that particular solution.
> 
> The SPARQL Query Results XML Format [ref] form of the above result set 
> gives:
> ]]
> 
> ...
> 
> Section 10.3, para 2 [clarification]:
> 
> The reference to "a warning may be generated" has me wondering how it is 
> expected such a warning might be returned.  Does the protocol spec have 
> a means to return (a) warning(s) along with query results?

This is a comment is also in your comments on the protocol.

http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Sep/0062
http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Sep/0065

> 
> ...
> 
> Section 10.4, general [grumble]:
> 
> I don't see why DESCRIBE is included in this specification, since what 
> it returns it's so vague as to defy any prospect of interoperability.
> 
> I think it would be better to provide an extensibility mechanism for 
> additional result formats, which could be used by applications wishing 
> to use DESCRIBE functionality without having to hopelessly overload a 
> single query language element.  If a common resource description format 
> should be developed in future, it could then become standardized 
> extension to the query language.
> 
> ...
> 
> Section 11, para 1 [editorial]:
> 
> I found the introductory description of value testing was awkward and 
> convoluted, with its focus on "effective Boolean values", and the 
> subsequent and separate discussion of type errors.  The whole area of 
> handling type errors (which I think is a slight misnomer, as in SPARQL 
> terms theyt;re not really errros, just mismatches with predicted 
> results) seems to add an unnecessary layer of complexity to the 
> description of filters.

"effective Boolean values" comes from XQuery/XPath functions and operators.
There is a link to the reference link for this in the document.

> 
> I think it would be easier to follow a description forumulated in terms 
> of "satsifying" a value test, and go on to explain when expressions 
> containing type mismatches may or may not be satisfied.  Maybe, 
> introduce an "undefined" value that propagates through expressions in a 
> predicted fashion, which I think would lead to a simpler and more 
> complete explanation of how type mismatches effect query results.

There is now a truth table to explain what happens with errors and the
interaction with || and &&.

> 
> (Also: section 11.2)
> 
> ...
> 
> Section 11.1 [typo]:
> 
> s/constituant/constituent/

No longer in the document.

> 
> ...
> 
> Sect 11.2.3.1, "known to have the same value" [clarification]:

Due to reorganization this is now "11.4.10 RDFterm-equal"
http://www.w3.org/2001/sw/DataAccess/rq23/#func-RDFterm-equal

> 
> The discussion of sop:RDFterm-Equal seems to have some ambiguity, since 
> it depenends upon how much the query processor knows about the datatypes 
> concerned.  What happens if a datatype is used that the query processor 
> doesn't know how to test for equivalent values?

Yes - true - a query processor may know that the roman numerals 'X' is the
same as 10.  Following from additional semantic entailments a processor may
know that two terms are, in fact, value equal.

> 
> Reading this, I'm reminded of the introduction of D-interpretations 
> (datatyped interpretations) in the RDF semantics specification.
> 
> ...

The editor's draft has a short section that says that addition matching may
occur with D-entailment as with any other addition semantic conditions.
http://www.w3.org/2001/sw/DataAccess/rq23/#matchDEntail

It links to:
http://www.w3.org/TR/rdf-mt/#defDinterp

> 
> Section 11.2.3.6, sop:regex, general [implementation concern]:
> 
> I've mixed feelings about inclusion of this function, as it seems to 
> place a non-trivial complication to SPARQL processor implementations in 
> environments that don't already include a conforming REGEX 
> functionality.  Is this really essential to a significant majority of 
> applications?

All I can say is that from my experience of what users want, with RDQL and
SPARQL, it is the single most used piece of FILTER functionality.

> 
> ...
> 
> That's about it.  I hope it helps, and apologies again for being late 
> with my comments.
> 
> #g
> 

If this message addresses the comments raised, please let us know.  (If you
respond with [CLOSED] in the subject line it will allow the issue tracking
scripts to close this issue.)

Thank you for your time this detailed review and considered comments,

	Andy

Received on Wednesday, 4 January 2006 13:28:02 UTC