Comments on last-call SPARQL draft 20050721, sections 3 onwards

[Apologies for being late with these, but I'm hoping better late than 
never...]

Reviewing:
   http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050721/

Overview:  I find that the specification (or what I think it says) to be 
generally sound and sensible, but I see a number of areas where the 
explanations seem less clear than they might be.

I think this will be a very important specification for a range of RDF 
users and developers, so I think making it as clear as possible is a 
goal worth pursuing.

...

General, definitions:

I am finding the "Definitions" given in the text are less helpful than I 
feel they should be.  I discern two main reasons for this:

(a) although couched in a kind of formal language, they don't seem to be 
constructed with the rigour I would associate with such language.  The 
definitions seem to be incomplete and/or ambiguous (or open to different 
interpretation), so the expected benefit of formality is not being 
realized.  In the notes below, I pick out some problems I have identified.

(b) it's not easy to find definitions.  My (printed) copy of the 
document contains no collected list of definitions, even though the 
table of contents and change log indicate this should be present.  (ToC 
has this between the references and the change log.)

(If I had the time, I'd like to try coding up the formal definitions in 
Haskell, which I think would quickly flush out any problems, but I don't 
see me having time in the next month.)

...

General, presentation of concetps:

I have the feeling that this document has been drafted by people who 
have experience of constructing query implementations (I know Eric and 
Andy have), and that some of the important concepts and ideas are made 
implictly rather than explicitly, and hence that some of the ideas are 
not fully explained for a person approaching this topic afresh.  I have 
tried to point out cases where I see them, but having myself implemented 
RDF query systems I may easily have overlooked others.

An example of this might be section 8.3 (restriction by bound 
variables): I think I understand what is being described based on my own 
past experience, but I can't tell if I would otherwise be able to do so.

(I appreciate this comment doesn't readily admit a specific response, 
and I don't expect one but, by mentioning it, maybe I can help sensitize 
peope to some possible issues.)

...

General, prefixes in IRI results:

I think there is an awkward tension between theoretical requirements for 
correct appliion functioning, and practical usability issues, in the way 
that IRIs are returned in query results.  In theory, all that is needed 
is the IRI, but most SWeb applications I have seen go to some lengths to 
preserve the prefixes used in the original data so that human-readable 
qname values can be reconstructed.

As far as I can tell (also looking at 
http://www.w3.org/TR/2005/WD-rdf-sparql-XMLres-20050801/), there is no 
provision for returning prefixes.  I think that practical considerations 
suggest that there should be an optional mechanism for query processors 
to return prefix information with variable binding results.

...

Section 3.1, "Matching integers" (and nearby) [editorial]:

"The pattern in the following query has a solution :x ..." is not 
explicit that it refers to a solution when matched against the preceding 
data.  An immediate fix would be to add "in the above RDF data" after 
":x", but maybe a more comprehensive approach would be to add a brief 
paragraph, just after the sample data, along the lines of:

[[
This RDF data is the target for query examples in the following sections.
]]

...

Section 3.3, Boolean [editorial nit]:

It is my understanding (nad my dictionary agrees) that "Boolean" in 
prose text should be capitalized, being named after Boole.

...

Section 3.3, definition [editorial]:

I found the last part of this definition was hard to follow.  I suggest 
something like: "For value constraint C, a solution S matches C if S(C) 
is true, where S(C) is the Boolean-valued expression obtained by 
substitution of variables mentioned in C."

...

Section 3.3, error conditions [functionality query]:

Has the full impact of the stated handling of errors been considered in 
depth?  While I think this is probably OK, I have a niggling concern 
that there may be some classes of errors that may prove difficult to 
catch in this way.  For sure, I think that an "error condition" that is 
caused by unanticipated values in the target graph should be handled as 
described here, but in other cases, when the error is clearly in the way 
the query has been constructed, it would be acceptable to simply return 
a failure.  For example, a regex filter containing an invalid regex.

My concern here, I think, is that it is not clear how broadly the term 
"error condition" should be interpreted.

...

Section 4: 1st bullet, "Basic Graph Patterns" [editorial]

I think this should be cross-referenced to section 2.5.

I note the phrase is hyperlinked (or assume so, as it is underlined), 
but as I am reviewing a paper copy of the document, I have no idea where 
the hyperlink actually leads.

...

Section 4: 2nd bullet, "Group Pattern" [unclear]

I found the phrasing "must all match" was insufficient.  Suggest 
something like: "where each of a set of graph patterns must match using 
the same variable substitution".

...

Section 4, general [editorial]:

There seems to be deal of overlap between this section and section 2.5, 
with maybe some muddling of the concepts (notably "basic graph pattern" 
and "Group graph pattern" seem to be somewhat tangled).  For 
specification purposes, I think it would be easier to treat a "basic 
graph pattern" as a group of "triple patterns".

Thus, I think that merging sections 2.5 and 4 could create a simpler, 
easier to follow descritpion with less scope for misinterpretation.

It seems strange that the start of section 4 contains a bulleted list of 
topics that are described in sectrions 2.5, 4, 5 and 6.  So my I would 
expand previous suggestion to suggest a single section covering all of 
these, starting with the list of various patterns described.  A 
preceding section could deal with matching of single triples, literals, 
bnodes, etc.

...

Section 4.1, "For any solution ..." [editorial]:

I found this paragraph was potentially confusing, being an example of 
the muddle I allude to in the preceding comment.

...

Section 5.1, para 1 [query correctness]:

"... OPTIONAL keyword applied to a graph pattern."  Should this be "... 
applied to a group pattern"?  I ask this because section 4.1 indicates 
braces as introducing a group pattern.

...

Section 5.1, example [incomplete spec]:

What happens if the triple
   _:a  foaf:mbox <mailto:alice@work.example> .
is added to the example data?

I think this should lead to two solutions that bind "name" to "Alice", 
but that's not clear to me from the description here.

...

Section 5.4, formal definition [error?]:

I think this formal definition may be wrong or incomplete.

Preamble:  it refers to a "S is a solution", but I see no definition of 
solution. (Section 2.4 has "Pattern Solution" and "Query Solution".  I'm 
guessing the latter is meant.

Consider the example data:
[[
_:a  rdf:type        foaf:Person .
_:a  foaf:name       "Alice" .
_:a  foaf:mbox       <mailto:alice@work.example> .
]]

and the query pattern from section 5.1:
[[
WHERE  { ?x foaf:name  ?name .
          OPTIONAL { ?x  foaf:mbox  ?mbox }
        }
]]
This is an instance of OPT(A,B), where:
A = { ?x foaf:name ?name }
B = { ?x foaf:mbox ?mbox }

The substitution:
   [ x/_:a, name/"Alice", mbox/<mailto:alice@work.example> ]
is a solution for both A and B, hence is a solution for OPT(A,B).

But also consider the substitution:
   [ x/_:a, name/"Alice", mbox/<mailto:alice@home.example> ]
This is a solution for A but is not a solution for A and B, hence 
according to the definition given it is a solution for OPT(A,B)

This means that the solution set should include:
   [ x/_:a, name/"Alice", mbox/<mailto:alice@work.example> ]
   [ x/_:a, name/"Alice", mbox/<mailto:alice@home.example> ]
and any other possible substitution for mbox, which is clearly not what 
is intended.

...

Section 5.5, 1st para [editorial]:

I think this is confusing, or not making sense, as the inner optional 
pattern is (syntactically) a part of the optional outer pattern.  Thus 
it might be expected that a match of the outer pattern must also match 
the inner pattern.

Suggest:
[[
Optional patterns can occur inside any group graph pattern, including a 
group graph pattern which itself is optional, forming a nested pattern. 
Any non-optional part of the outer optional graph pattern must be 
matched if any variable bindings from the nested optional pattern are 
returned.  Thus, for a nested optional pattern OPT(A,OPT(B,C)), B and 
possibly C are matched only when A is matched.
]]

...

Section 6 [editorial]:

I think it might be helpful to include a test case that shows that:

   OPT(A,B)
and
   UNION(A,{A B})

are *not* equivalent.

...

Section 6.2 [incomplete spec]:

The formal definition does not explain what are the results from 
matching a union pattern.

...

Section 7, "Definition of RDF Dataset Graph Pattern" [incomplete spec]:

This definition doesn't actually tell me what a "RDF Dataset Graph 
Pattern" is.

...

Section 7, [clarification]:

 From reading this, I think that the pattern
    GRAPH ?g { (pat) }
does not match if (pat) is matched only in the default graph.  Is this 
what is intended?  I think a brief explanation and test case would be in 
order here.

...

Section 7.1 [superfluous content]:

I think the following text is superfluous and serves no useful purpose 
over the examples given.
[[
Two useful arrangements are:

     * to have information in the default graph that includes provenance 
information about the named graphs
     * to include the information in the named graphs in the default 
graph as well.
]]

Suggest: remove this.

...

Section 7.1, example 2 [clarification]:

I'm not sure what is meant by "contain the same information as before". 
  I think it should be "contain the same triples as before".

...

Section 8, general [clarification]:

Following section 7, I'm not sure if this section adds anything other 
than explanatory content.  If there is any additional normative content 
here, I think it should be highlighted.  If it is purely explaantory, 
then I think it would better be subsection(s) of sect 7, and the text 
tweaked to show that it follows from what has been specified (e.g. a 
subsection headed "Examples of Dataset Queries").

...

Section 9, general [grumble]:

I still feel (as I mentioned once previously) that the FROM clauses 
don't really belong in the query language, but in the protocol.  I think 
of a query as being something like a regex that stands alone, 
indpendently of the target data to which it is applied.

That said, I feel that the specification given is sufficiently flexible 
that it doesn't force implementations to do anything onerous (and might 
even be ignored if the Dataset is assembled by other means), so I won't 
complain too loudly.

...

Section 9.2, example [clarification?]:

Is the intent here that the default graph is empty, or unspecified.

...

Section 10.1, "The effect of applying ..." [clarification]:

I feel the text here only imlicitly indicates that more than one 
solution sequence modifier can be applied.  Also, I think a cross 
reference the the syntax terms showing how multiple modifiers may be 
included would be helpful.

Does this paragraph refer to the order in which the modifiers are given 
above in the document text, or in the query itself?

...

Section 10.1, "Order by", ordering of IRIs [clarification?]:

I'm wondering if anything needs to be said about the ordering of IRIs 
that use different combining forms (cf section 2.1, and my comment in a 
previous message).  It seems life would be easier, in theory at least, 
if IRI ordering were based on a normalized form, so that, e.g., 
different combining forms don't lead to effectively equivalent IRIs 
having different ordering.

I see this is an awkward topic, and I don't feel I know the right answer.

...

Section 10.2, References, result format [clarification]:

Is http://www.w3.org/TR/rdf-sparql-XMLres/ is missing from the normative 
references?  It is linked directly from section 10.2, and appears in the 
informative references.  Maybe it should be normative because it is 
needed to fully implement a processor for the SPARQL query 
specification, per section 10.2.

Hmmm... Reading more closely the text in 10.2 ("Result sets can be 
accessed...", I think there is some confusion (maybe on my part) about 
what the spec is describing:  a query language? a query protocol? a 
query API?

Taking a cue from the specification title, I'd say the current 
specification has it about right, but that the text in section 10.2 
should maybe be a little bit more explciit about what is not being 
described;  e.g. replace the 2 paras from "Results can be thought of as 
..." with:

[[
This specification does not define exactly how such results are 
returned, as a query may be used in different contexts (e.g. query 
protcol, query API) for which different forms are appropriate.

Results can be thought of as a table with one row per query solution, 
and a column for each variable in the query. Some cells may be empty 
because a variable is not bound in that particular solution.

The SPARQL Query Results XML Format [ref] form of the above result set 
gives:
]]

...

Section 10.3, para 2 [clarification]:

The reference to "a warning may be generated" has me wondering how it is 
expected such a warning might be returned.  Does the protocol spec have 
a means to return (a) warning(s) along with query results?

...

Section 10.4, general [grumble]:

I don't see why DESCRIBE is included in this specification, since what 
it returns it's so vague as to defy any prospect of interoperability.

I think it would be better to provide an extensibility mechanism for 
additional result formats, which could be used by applications wishing 
to use DESCRIBE functionality without having to hopelessly overload a 
single query language element.  If a common resource description format 
should be developed in future, it could then become standardized 
extension to the query language.

...

Section 11, para 1 [editorial]:

I found the introductory description of value testing was awkward and 
convoluted, with its focus on "effective Boolean values", and the 
subsequent and separate discussion of type errors.  The whole area of 
handling type errors (which I think is a slight misnomer, as in SPARQL 
terms theyt;re not really errros, just mismatches with predicted 
results) seems to add an unnecessary layer of complexity to the 
description of filters.

I think it would be easier to follow a description forumulated in terms 
of "satsifying" a value test, and go on to explain when expressions 
containing type mismatches may or may not be satisfied.  Maybe, 
introduce an "undefined" value that propagates through expressions in a 
predicted fashion, which I think would lead to a simpler and more 
complete explanation of how type mismatches effect query results.

(Also: section 11.2)

...

Section 11.1 [typo]:

s/constituant/constituent/

...

Sect 11.2.3.1, "known to have the same value" [clarification]:

The discussion of sop:RDFterm-Equal seems to have some ambiguity, since 
it depenends upon how much the query processor knows about the datatypes 
concerned.  What happens if a datatype is used that the query processor 
doesn't know how to test for equivalent values?

Reading this, I'm reminded of the introduction of D-interpretations 
(datatyped interpretations) in the RDF semantics specification.

...

Section 11.2.3.6, sop:regex, general [implementation concern]:

I've mixed feelings about inclusion of this function, as it seems to 
place a non-trivial complication to SPARQL processor implementations in 
environments that don't already include a conforming REGEX 
functionality.  Is this really essential to a significant majority of 
applications?

...

That's about it.  I hope it helps, and apologies again for being late 
with my comments.

#g

-- 
Graham Klyne
For email:
http://www.ninebynine.org/#Contact

Received on Saturday, 10 September 2005 12:56:07 UTC