comments on Section 1 and Section 2 of SPARQL Query Language for RDF from Peter F. Patel-Schneider on 2006-02-22 (public-rdf-dawg-comments@w3.org from February 2006)

From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
Date: Wed, 22 Feb 2006 18:56:54 -0500 (EST)
To: public-rdf-dawg-comments@w3.org
Message-Id: <20060222.185654.133907622.pfps@research.bell-labs.com>
Comments on Section 1 and Section 2 of

	SPARQL Query Language for RDF
	W3C Working Draft 20 February 2006
	http://www.w3.org/TR/2006/WD-rdf-sparql-query-20060220/


These are personal comments, from me, an interested expert.  They may not
reflect the views of any institution to which I am associated.


In general I found the first two sections of the document *very* hard to
understand.  The mixing of definitions, explanation, information, etc. confused
me over and over again.  I strongly suggest an organization something like:

  Introduction (informative)
  Formal development (normative)
    Underlying notions (normative)
    Patterns and matching (normative)
  SPARQL syntax (normative)
  Informal narrative (informative)
  Examples (informative)

I also found that things that didn't need to be explained were explained, and
things that did need to be explained were not explained.  A major example of
the latter is the role of the scoping graph.  Examples showing why E-matching
is defined the way it is would be particularly useful.


Because of the problems I see in Section 2, I do not feel that I can adequately
understand the remainder of the document.  

Because of these problems I do not feel that this document should be advanced
to the next stage in the W3C recommendation process without going through
another last-call stage.  (This could however be performed by terminating the
current last call, quickly fixing the document, and starting another last
call.)



Specific comments follow:

Section 1.

	An RDF graph is a set of triples; each triple consists of a
	<em>subject</em>, a <em>predicate</em> and an <em>object</em>. This is
	defined in RDF Concepts and Abstract Syntax.

C1.1: An unqualified "this" cannot be used at the beginning of the second sentence.

	The RDF graph may be virtual, in that it is not fully materialized,

C1.2: Defining virtual in terms of another term that is not itself defined is not
very useful.

	only doing the work needed for each query to execute.

C1.3: Who is doing what work here?

	SPARQL is a query language for getting information from such RDF
	graphs. 

C1.4: Surely a more formal tone is called for here.

	It provides facilities to:
	- extract information in the form of URIs, blank nodes, plain and typed
	literals.
	- extract RDF subgraphs.
	- construct new RDF graphs based on information in the queried graphs.

C1.5: I don't recognize the intent of SPARQL in any of these options.

	As a data access language, it is suitable for both local and remote
	use. 

C1.6: The "it" is rather too far from its referent.

	The companion SPARQL Protocol for RDF document describes the remote
	access protocol.

C1.7: What about the "local" access protocol?  Is there one?  If so, where is it?  If
not, why is there not one?

	<!-- Commented Document Outline -->

C1.8: There appears to be significant commented-out portions of the document.  Do
such parts of the document have any import?  If so, then they probably should
not be commented-out.  If not, then the commented-out portions should be
removed.


Section 2.

C2.15: In general, Section 2 switches modes much too much.  Which parts of
Section 2 are tutorial?  Which are definitional?  Which are explanatory?

	The SPARQL query language is based on matching graph patterns.

C2.1: What is a "matching graph pattern"?  I do not believe that it is defined
in the remainder of the document.  (Yes, yes, I know that the problem is
actually that the sentence itself is poorly constructed.)

	The simplest graph pattern is the triple pattern, which is like an RDF
	triple, but with the possibility of a variable instead of an RDF term
	in the subject, predicate or object positions.

C2.4: This should probably be stated more precisely, using, at least "and/or".

	Combining triple gives a basic graph pattern, where an exact match to a
	graph is needed to fulfill a pattern.

C2.2: Probably "triple" should be "triples".

C2.3: I do not believe that this matches the intent of SPARQL queries.

	The example below shows a SPARQL query to find the title of a book from
	the information in the given RDF graph.

C2.5: The use of "the given" here is not helpful.  I feel that it would be better
to use an indefinite article instead.


	The terms delimited by "<>" are IRI references [...].  They stand for
	IRIs, either directly, or relative to a base IRI.

C2.6: What is a term?  Which terms?  What does "stand for" mean here?  What
role does the base IRI play in this "stand for" relationship?

C2.7: The rules for IRIs are not adequately specified in Section 2.1.1.  Are
the two abbreviated mechanisms enclosed in "<>"?  Can a prefix expand to a
relative IRI?

	optional datatype IRI or prefixed name (introduced by ^^)

C2.8: Can this be a relative IRI?  Is it expanded using the rules of
Section 2.1.1?

	Variables in SPARQL queries have global scope; it is the same variable
	everywhere in the query that the same name is used

C2.9:  Wrong number agreement.

	Blank nodes are indicated by either the form _:a or use of [ ].

C2.10: Is _:a the *only* blank node allowed?  If not, which parts of these bits
of syntax can vary, and how?

	Triple Patterns are written as a list of subject, predicate, object; 

C2.11: The examples of triple patterns don't seem to be written this way.

	The following examples express the same query: 
	[several examples]
	Prefixes are syntactic: the prefix name does not affect the query, nor
	do prefix names in queries need to be the same prefixes as used in a
	serialization of the data. The following query is equivalent to the
	previous examples and will give the same results when applied to the
	same data:
	[one example]

C2.12: The first group of examples appears to exhibit more internal variability
than the single example adds.  Why, then, is the single example broken out?  Is
there something that I am missing here?


	The data format used in this document is

C2.13: What is the "data"?

C2.16: Section 2.1 claims to be about "Writing a Simple Query", but doesn't
seem to provide any guidance on this topic.

	2.2 Initial Definitions

C2.14: There appears to have been quite a number of definitions already?  How,
then, can this be an "initial" set of definitions?

	A query variable is a member of the set V where V is infinite and
	disjoint.

C2.20:  What is V?  Perhaps you mean V to be some arbitrary, but fixed set.

	Definition: Graph Pattern
	A Graph Pattern is one of:
	Basic Graph Pattern
	Group Graph Pattern
	Value Constraints
	Optional Graph Pattern
	Union Graph Pattern
	RDF Dataset Graph Pattern

C2.15: Are these all part of simple queries?  If not, what is this doing in
Section 2?  Ditto for the definition for SPARQL Query.

	Definition: SPARQL Query
	A SPARQL query is a tuple (GP, DS, SM, R) where:

C2.16: What, then, are the things in Section 2.1 that contain the SELECT
keyword?

	The following triple pattern has a subject variable (the variable
	book), a predicate dc:title and an object variable (the variable
	<title).

	 ?book dc:title ?title .

C2.17: dc:title does not appear to be valid as any second element of a triple
pattern.

	Definition: Triple Pattern
	A triple pattern is member of the set:
	(RDF-T union V) x (I union V) x (RDF-T union V)

C2.18:  How is the syntax above (?book dc:title ?title .) mapped into this set?

	This definition of Triple Pattern includes literal subjects.
	[...]
	This definition also allows blank nodes in the predicate position.

C2.19:  The referent is too far away for this construction.

	Definition: Pattern Solution
	A variable solution is a substitution function from a subset of V, the
	set of variables, to the set of RDF terms, RDF-T.  
	A pattern solution, S, is a variable substitution whose domain includes
	all the variables in V and whose range is a subset of the set of RDF
	terms.  
	The result of replacing every member v of V in a graph pattern P by
	S(v) is written S(P).  
	If v is not in the domain of S then S(v) is defined to be v.

C2.21: I thought that V was the set of variables.  Why then write "all the
variables in V"?

C2.22: Given that the domain of S is all the variables in V, i.e., all the
variables, then what use is the last sentence of the above definition?

	has a single triple pattern as the query pattern

C2.23:  What is the "query pattern" of a query?  Perhaps you mean the graph
pattern of the query?

	An E-entailment regime is a binary relation between subsets of RDF
	graphs.

C2.24: Perhaps you mean "between sets of RDF graphs"?

	Definition: Scoping Graph
	The Scoping Graph G' for RDF graph G, is an RDF Graph that is
	graph-equivalent to G

C2.25: FATAL: There can be many RDF graphs that are graph-equivalent to a
particular RDF graph.  Therefore the Scoping Graph is not adequately defined.

	The scoping graph makes the graph to be matched independent of the
	chosen blank node names.

C2.25a: Which chosen blank node names?  Why should this matter at all?  Aren't
the blank node names simply a notational convenience?

C2.25b: This needs to be proven.

	Definition: Basic Graph Pattern E-matching
	Given an entailment regime E, a basic graph pattern BGP, and RDF graph
	G, with scoping graph G', then BGP E-matches with pattern solution S on
	graph G with respect to scoping set B if:
        - BGP' is a basic graph pattern that is graph-equivalent to BGP
        - G' and BGP' do not share any blank node labels.
        - (G' union S(BGP')) is a well-formed RDF graph for E-entailment
        - G E-entails (G' union S(BGP'))
        - The RDF terms introduced by S all occur in B.

C2.26: Some of the elements of the point list are missing punctuation.

C2.27: FATAL: The status of B is not adequately provided.  Is B a parameter of
E-matching or is it somehow determined by the other parameters?  

	These definitions allow for future extensions to SPARQL.

C2.28:  Which definitions?

	This document defines SPARQL for simple entailment and the scoping set
	B is the set of all RDF terms in G'.

C2.29:  SPARQL for simple entailment?  Probably you mean something like "This
document only defines the simple entailment version of SPARQL".

C2.30:  The second half of this sentence does not make any sense.  Perhaps you
mean something like "The simple entailment version of SPARQL (hereafter
SPARQL) is based on BGP E-matching where the entailment regime (E) is always
simple entailment and the scoping set (B) is always the set of RDF terms in
G'.  

C2.31: FATAL: This still leaves SPARQL matching with the following parameters:
  1/ the graph pattern BGP
  2/ the RDF graph G
  3/ the scoping graph G' (which is not adequately defined)
  The problem with G' needs to be addressed.

	A pattern solution can then be defined as follows: to match a basic
	graph pattern under simple entailment, it is possible to proceed by
	finding a mapping from blank nodes and variables in the basic graph
	pattern to terms in the graph being matched; a pattern solution is then
	a mapping restricted to just the variables, possibly with blank nodes
	renamed. Moreover, a uniqueness property guarantees the
	interoperability between SPARQL systems: given a graph and a basic
	graph pattern, the set of all the pattern solutions is unique up to
	blank node renaming.

C2.32: Where is G' in this operation?

C2.33: It seems to me that SPARQL simple matching is entirely deterministic.
Given BGP, G, and G', the set of pattern solutions that make BGP match G with
scope G' is fixed.  I then don't understand the "unique up to blank node
renaming" above.

C2.34: If I am missing something here, and there indeed is something to be
shown, then it has to be proven.

	There is a blank node [..] in this dataset, identified by_:a. 

C2.34:  What is "dataset"?

C2.35:  Are there not two blank nodes in this dataset?

	In the SPARQL syntax, Basic Graph Patterns are sequences of triple
	patterns mixed with value constraints.

C2.36:  Why not say something like "value constraints can be mixed in sequences
of triples patterns.  The triple patterns form a BGP."?

	The results of a query is

C2.37: Why not "The result"?

C2.39: I believe that it would be very useful to show the four matches
generated by the basic query pattern in Section 2.6 (as well as the two matches
for the BGP in Section 2.5.3).

	Blank nodes in the results of a query are identical to those occurring
	in the dataset graphs

C2.38: This is very misleading.  SPARQL matching does indeed restrict the bnode
in query results to be bnodes from the RDF graph, but not in a useful way.  For
example,
  ?x ex:a ex:b .
matches against
  _:a ex:a _:b .
with two results for ?x, at least as far as I can determine.

C2.39: I believe that there are four matches for the BGP in Section 2.7.  Why
are only two results given?
Received on Wednesday, 22 February 2006 23:57:06 UTC