CommentResponse:DB-5

From SPARQL Working Group
Revision as of 22:45, 19 December 2011 by Pgearon (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Hello David,

 Regarding
 http://www.w3.org/TR/2011/WD-sparql11-update-20110512/
 It's great to see these documents in Last Call!


Thank you for your comments. We have addressed your concerns below.

 Comments:
 
 1. Please either add capability for virtual graphs or keep the COPY, ADD
 and MOVE shortcuts, to enable standard SPARQL to be used more
 efficiently as a rules language and in data production pipelines.  COPY,
 ADD and MOVE operations cost almost nothing to implement, and they help
 with efficiency.  By "virtual graph" I mean a graph that consists of the
 merge of a particular set of named graphs -- a very important capability
 for efficient data production pipelines.

The features of COPY, ADD and MOVE were considered "At Risk" until the working group was confident that they could be implemented without undue difficulty. Now that we have some reports of successful implementation, the "At Risk" designation has been removed.

The group feels that adding a feature like "virtual graphs" at this late stage of publication is not possible.

 2. This paragraph in sec 3.1.3 is a bit confusing:
 [[
 That is, the GroupGraphPattern in the WHERE clause will be matched
 against the dataset described by explicit USING or USING NAMED clauses,
 if specified, and against the graph store otherwise. Any graph name
 specified in a WITH clause will - for evaluating the WHERE clause -
 refer to the default graph to be used in the absence of USING or USING
 NAMED clauses. In the presence of one or more graphs referred to in
 USING clauses, the default graph will be the merge of these graphs,
 meaning that the graph in a WITH clause will be ignored while evaluating
 the WHERE clause. If there is no USING clause, but there is one or more
 USING NAMED clauses, then the dataset will include an empty graph for
 the default graph.
 ]]
 In particular, the sentence "Any graph name specified in a WITH clause
 will - for evaluating the WHERE clause - refer to the default graph to
 be used in the absence of USING or USING NAMED clauses." seems odd.  The
 graph specified in the WITH clause will refer to the *default* graph?  I
 would think it would be used *instead* of the default graph.  Isn't that
 the point of WITH?  Perhaps the term "default graph" is being used in an
 unusual way in this paragraph, to mean "the graph that will used in the
 absence of USING or USING NAMED"?  I think it would be misleading to
 call that a "default graph".  Normally the term "default graph" refers
 to the unnamed slot in a Graph Store, per the first paragraph in section
 2.  I think it would be best to use the term only in that way.

Unfortunately, the term "default graph" has two accepted meanings. The first is the graph that may be referred to without a name in a graph store (not necessarily an unnamed graph), while the second refers to the the graph that is referenced in a SPARQL WHERE clause when no GRAPH block has been specified. By default, these two are equivalent, but the latter is modified to be the merge of all graphs listed in FROM clauses in a query (USING in updates) or by specifying a default-graph-uri parameter in the SPARQL protocol.

We have changed the text to the following to clarify the use of WITH:

"That is, the GroupGraphPattern in the WHERE clause will be matched against the dataset described by explicit USING or USING NAMED clauses, if specified, and against the default graph provided by the Graph Store otherwise.

The WITH clause provides a convenience for when an operation primarily refers to a single graph. If a graph name is specified in a WITH clause, then - for the purposes of evaluating the WHERE clause - this will define a dataset containing a default graph with the specified name, but only in the absence of USING or USING NAMED clauses. In the presence of one or more graphs referred to in USING clauses and/or USING NAMED clauses, the WITH clause will be ignored while evaluating the WHERE clause."

 Part of the confusion may be related to the ambiguous use of the term
 "dataset".  For example, consider the sentence: "That is, the
 GroupGraphPattern in the WHERE clause will be matched against the
 dataset described by . . . ".  When I read this, I took the term
 "dataset" to mean:
 http://en.wikipedia.org/wiki/Data_set
 However, I am wondering if you actually meant "RDF Dataset" as defined
 here:
 http://www.w3.org/TR/sparql11-query/#rdfDataset
 If you meant the former, I suggest using the term "set of data", to
 avoid ambiguity.  If you meant the latter, I suggest using the term "RDF
 Dataset", and perhaps linking it to its definition.
 Also, I notice that:
 
 - There are many occurrences of the unqualified word "dataset".  I
 suggest checking them all, to see if they should be "RDF Dataset".

Existing documentation from SPARQL 1.0 already uses the both term "dataset" as an abbreviation for "RDF dataset", so we do not feel that it is necessary to use the complete term on every occasion. However, we have expanded the term each time that a paragraph first uses it. Despite a link to "Querying the Dataset" already being present in the preceding paragraph we have added the requested link.


 - Capitalization of the terms "RDF Dataset" and "Graph Store" is
 inconsistent -- sometimes written "RDF dataset" or "graph store".  It
 would help if it were consistently capitalized, as it helps the reader
 know that you are intending a specially defined term.

"RDF dataset" was consistently capitalized in the prose, however it has been updated to include a capitalized "D" to help the reader realize that it is a formal term. The abbreviated term "dataset" has remained unchanged. "Graph Store" has been updated.

 If I have understood the intent, it sounds like there are two sets of
 data involved in a DELETE/INSERT operation: one set is used in
 evaluating the WHERE clause, and the other is the target graph of the
 DELETE/INSERT, i.e., the graph that will be modified by the operation.
 If so, I think it would be helpful to state this up front, and make up a
 term for each of these sets, such as: "the set of data for the WHERE
 clause" and "the target graph".  Hmm, maybe the SPARQL 1.1 Query spec
 uses the term "active graph" for the former?
 http://www.w3.org/TR/sparql11-query/#rdfDataset 
 In any case, it would be helpful to define specific terms for these, and
 use them consistently.

The terms "RDF dataset" and dataset are now used in this text entirely in the context of the data that the WHERE clause will be matched against. DELETE and INSERT may each refer to multiple graphs, making a term like "target graph" difficult to manage. The changes made to this section may now address some of the confusion being posed here.

 Also, it may be clearer to reword this paragraph as a decision tree,
 since the logic that is being described is a bit complex for
 unstructured English prose: 
 
  If ___ then ___ . Otherwise, if ___ then ___ . Otherwise ___ .

The purpose of this section of text is to provide a description in prose. We hope that the changes have made the text clearer.

 3. In searching for the definition of the backslash "\" symbol in
 section 4.2, it looks like it is supposed to be set difference, but I do
 not see it listed in either of these tables of standard mathematical or
 logic symbols:
 http://en.wikipedia.org/wiki/List_of_mathematical_symbols 
 http://en.wikipedia.org/wiki/Table_of_logic_symbols
 However, I now see that that is because it is using a different unicode
 character, so a browser search did not find it:
 http://en.wikipedia.org/wiki/List_of_mathematical_symbols
 I suggest adding a brief note of clarification to section 4.2 stating
 that the backslash symbol ("\") indicates set difference.  Personally, I
 prefer the minus sign ("-") for set difference, though my tastes may be
 biased toward certain programming languages.

The character "\" has been replaced with the word "minus", and text has been provided to explain that this refers to "set difference".

 4. The difference between "USING" and "USING NAMED" is not explained,
 except in passing: "This describes a dataset in a manner similar to FROM
 and FROM NAMED clauses in the SPARQL1.1 Query Language."

We have replaced the phrase: "in a manner similar to FROM and FROM NAMED" with: "in the same way as FROM and FROM NAMED" and have provided a direct link to http://www.w3.org/TR/sparql11-query/#specifyingDataset

 5. As written, this in sec 3.1:
 http://www.w3.org/TR/sparql11-update/#graphUpdate
 [[
 Graph update operations change existing graphs in the Graph Store but do
 not explicitly delete nor create them. Non-empty inserts into
 non-existing graphs will, however, implicitly create those graphs, i.e.,
 an implementation *should* create graphs that do not exist before
 triples were inserted into them (there may be implementations providing
 an update service over a fixed set of graphs which in such case *must*
 return with failure for update requests that would create an unallowed
 graph), and *may* remove graphs that are left empty after triples are
 removed from them.
 ]]
 seems to say that an implementation that operates over a *variable*
 (non-fixed) set of graphs still has the option of not automatically
 creating graphs that do not exist.  
 
 I suggest rewording the above portion as:
 [[
 Graph update operations change existing graphs in the Graph Store but do
 not explicitly delete nor create them. Non-empty inserts into
 non-existing graphs will normally implicitly create those graphs, i.e.,
 an implementation fulfilling an update request *should* silently and
 automatically create graphs that do not exist before triples are
 inserted into them, and *must* return with failure if it fails to do so
 for any reason.  (For example, the implementation may have insufficient
 resources, or an implementation may only provide an update service over
 a fixed set of graphs.)  An implementation *may* remove graphs that are
 left empty after triples are removed from them.
 ]]

Done, with minor changes:

"Graph update operations change existing graphs in the Graph Store but do not explicitly delete nor create them. Non-empty inserts into non-existing graphs will, however, implicitly create those graphs, i.e., an implementation fulfilling an update request should silently an automatically create graphs that do not exist before triples are inserted into them, and must return with failure if it fails to do so for any reason. (For example, the implementation may have insufficient resources, or an implementation may only provide an update service over a fixed set of graphs and the implicitly created graph is not within this fixed set). An implementation may remove graphs that are left empty after triples are removed from them."

 6. Similarly, I suggest rewording the following in section 3.1.1:
 http://www.w3.org/TR/sparql11-update/#insertData 
 [[
 If no graph is described in the QuadData, then the default graph is
 presumed. If data is inserted into a graph that does not exist in the
 graph store, it *should* be created (there may be implementations
 providing an update service over a fixed set of graphs which in such
 case *must* return with failure for update requests that insert data
 into an unallowed graph).
 ]]
 to:
 [[
 If no graph is described in the QuadData, then the default graph is
 presumed.  If data is inserted into a graph that does not exist in the
 graph store, the update service SHOULD create that graph.  The service
 MUST return with failure if it fails to do so for any reason.
 ]]

Done, with minor modification. The text now reads as:

"The information how a graph store is accessed is defined in the protocol and graph store protocol specs. A graph store is accessible by either an update service (cf. protocol) or via the graph store protocol (cf. graph store protocol). In either case the graph store is hidden behind the service, making it accessible via the URI of a SPARQL update service or via a URI that responds to the graph store protocol."


 7. And similarly in section 3.1.3 I suggest changing:
 http://www.w3.org/TR/sparql11-update/#deleteInsert 
 [[
 If an operation tries to insert into a graph that does not exist, then
 the update service *should* create that graph.  The service MUST return
 with failure if it fails to do so for any reason.  If no data is to be
 inserted, then no graph will be created, even if applying the operation
 to a different dataset would result in data being inserted.
 ]]
 to: 
 [[
 If an operation tries to insert into a graph that does not exist, then
 that graph should be created; again, there may be implementations
 providing an update service over a fixed set of graphs which in such
 case must return with failure for update requests that would create an
 unallowed graph. If no data is to be inserted, then no graph will be
 created, even if applying the operation to a different dataset would
 result in data being inserted.
 ]]

Done.

 8. How is the URI of a Graph Store indicated?  The concept of a Graph
 Store is central to the SPARQL 1.1 Update spec, and hence one should be
 able to use a URI to refer to a particular Graph Store, but the spec
 doesn't say how this is done.
 
 The SPARQL 1.1 Service Description spec contains no sd:GraphStore
 class.  
 
 The SPARQL 1.1 Graph Store HTTP Protocol spec sometimes mentions a Graph
 Store, but does not make clear how the intended Graph Store is
 identified.  It does say: "A compliant implementation of this
 specification SHOULD accept HTTP requests directed at its Graph Store".
 But what if a service hosts multiple Graph Stores?  
 
 According to
 http://www.w3.org/TR/sparql11-update/#graphStore 
 a Graph Store "is a mutable container of RDF graphs managed by a single
 service" which "contains one (unnamed) slot holding a default graph and
 zero or more named slots holding named graphs".
 
 Language in section 2.1
 http://www.w3.org/TR/sparql11-update/#graphStoreQueryServices
 "There is no presumption that the graph store managed by an update
 service . . . " suggests that an update service can only have *one*
 Graph Store, but: (a) I do not see this stated explicitly anywhere; (b)
 it would be useful for an update service to be able to have more than
 one Graph Store; and (b) what is the point of defining the notion of an
 "update service" if it is one-to-one with a Graph Store?  AFAICT, doing
 so just adds an unnecessarily layer and confusion.
 
 The SPARQL 1.1 Service Description spec does define the notion of an
 sd:DataSet, which is close to the notion of a Graph Store, but (if I
 understand the definition of Graph Store in
 http://www.w3.org/TR/sparql11-update/#graphStore )
 a Graph Store is mutable, whereas an sd:DataSet is not.

Graph stores are referred to by URI, but beyond this the implementation is free to choose. This has been left unspecified intentionally to allow each implementation to specify the details individually.

TODO: Greg? Chimezie?

 The reason one would want to have an update service that contains more
 than one Graph Store is that it would allow operations on collections of
 graphs to be performed efficiently.  For example, an RDF data pipeline
 may need to generate one collection of graphs from another, all within
 the same update service.  In other words, the content of one Graph Store
 is generated from the content of another Graph Store.  This is important
 because for efficiency, it is helpful to be able to subdivide large
 graphs into collections of smaller graphs.  An example might be a
 collection of 200,000 patient graphs.  There may be *multiple*
 collections of these patient graphs, A, B and C, where collection C is
 derived from collection B which is derived from collection A in a
 pipeline.  Since each patient graph within each of these collections is
 relatively independent, it is far more efficient when one in A is
 updated to only update the corresponding graphs in B and C, rather than
 regenerating the entire B and C collections.  It would be very
 convenient if each of these collections could be stored in a
 sd:GraphStore (presuming such a class is defined) within the same update
 service so that appropriate update operations could be selectively
 performed on them, with the assurance (for efficiency) that they are
 within the same update service.  
 
 Oddly, there is a distinction between a Graph Store (which is mutable)
 and an RDF Dataset (which is not), but there is no corresponding
 distinction made with graphs.  They are treated as mutable in the SPARQL
 1.1 Update spec: they can be the subject of an INSERT or DELETE
 operation.
 
 Actually, in reading the definition of RDF Dataset
 http://www.w3.org/TR/sparql11-query/#rdfDataset
 I do not see anything that would prevent it from changing over time.
 Certainly an RDF Dataset contains a particular set of graphs at the
 moment when it is queried, but I see no prohibition against that same
 RDF Dataset containing a different set of graphs at a different time.
 Hence, it looks to me like the notion of Graph Store could be dropped in
 favor of using the term "RDF Datastore" universally throughout both the
 Query and Update documents.  I think this would make more sense than
 using two different terms: both queries and updates would operate on RDF
 Datasets.  

While queries operate on a dataset that is defined as a merge of multiple graphs, any updates must necessarily modify a single graph at a time. So it is not possible to state that updates operate on RDF Datasets.

While a single INSERT or DELETE template may refer to multiple graphs, the triples being specified are always for individual graphs. So to remove the same triples from graphs <foo> and <bar> there is no way to do it with a single pattern in a template, but rather both graphs must be mentioned explicitly with that template. ie.:

DELETE { GRAPH <foo> { ... } GRAPH <bar> { ... }} ...


 9. Typo: s/needs not be authoritative/need not be authoritative/

Done.


We would be grateful if you would acknowledge that your comment has been answered by sending a reply to this mailing list.

Paul, on behalf of the SPARQL WG