Re: Comments about the semantics of property paths from jorge perez on 2011-01-25 (public-rdf-dawg-comments@w3.org from January 2011)

From: jorge perez <jorge.perez.rojas@gmail.com>
Date: Tue, 25 Jan 2011 15:55:08 -0300
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: public-rdf-dawg-comments@w3.org
Message-ID: <AANLkTimZF+sefGJOFoV7XarT=ft6-=mZtAambJYKtgzv@mail.gmail.com>
Hello Andy,

Thanks for your email. Yes, my comments have been answered. An
additional comment is below.

On Tue, Jan 25, 2011 at 1:08 PM, Andy Seaborne
<andy.seaborne@epimorphics.com> wrote:
> Hi Jorge,
>
> XPath is designed for XML processing where XML nodes and values are treated
> in different ways. XPath evaluation returns distinct XML nodes, but
> duplicate values. One evaluation of an XPath expression can't mix XML nodes
> and values - see the numbered list in [1]. "XQuery 1.0 and XPath 2.0
> Functions and Operators" has an operation fn:distinct-values to make values
> unique in a sequence [2].
>
> An RDF graph does not have this distinction of nodes and values. Graph nodes
> (vertexes) are IRIs, blank nodes or literals with no separation. Repetition
> of literals is significant, consider SUM applied to a purchase order where
> two items have the same price, so multiple paths to the same endpoint do
> matter.

Aggregation is actually another reason of why multiple paths to the
same endpoint *do not* have to be considered.

Consider a network of friends, and assume that you want to obtain the
SUM of the age of all your network (friends of your friends). Then a
very natural way to do this is with the query (simplified syntax)

SUM (?A)
:me (:friend)+/:age ?A

The query is navigating to all the friends of my friends, then to the
age value of every one, and then taking the SUM. Isn't this natural?
But, consider the following data

:me :friend :f1
:me :friend :f2
:f1 :friend :f2
:f1 :age 20
:f2 :age 20

I would expect 40 as the result of the above query, but the expression

:me (:fiend)+/:age ?A

returns

?A
20 (for the path :me->:f1)
20 (for the path :me->:f2)
20 (for the path :me->:f1->:f2)

and thus, the answer of the SUM would be 60. How do you explain the
result of this query to a user? Notice that using DISTINCT does not
solve the problem, since with DISTINCT you would obtain 20 as the SUM
which is also wrong.

Is there a way to correctly answer the above query with the current
design of property paths?

Thanks,
- jorge

>
> SPARQL property paths do not apply uniqueness to property paths and the
> property path expression is, where appropriate, the same the expansion in
> terms of triple patterns. It is not a matter of efficiency because the
> answers concerning duplicate literal values would be rather unexpected if
> only distinct values were returned.
>
> This leaves the ArbitraryLengthPath operation for the use of "+" in paths.
> This traverses cycles once by terminating the search on encountering an edge
> already traversed for that evaluation of ArbitraryLengthPath. In an earlier
> design, cycle termination was by detecting visiting nodes but the WG
> considers the edge traversal a better choice. The new design is one more
> step of evaluation on a cycle than the first design and leaves better
> prospects for future standardization.
>
> SPARQL has the keyword DISTINCT so an application can choose between
> duplicates and no duplicates. A query engine can exploit this if it chooses
> to; use with sub-queries mean that solution modifiers can be applied to
> specific parts of the query such as a path.
>
> An implementation is free to implement evaluation in anyway it chooses
> proved it results in the same answers. The WG felt that using an algorithm
> was the most helpful way to specify the feature, especially to implementers.
>
> Property paths have been implemented in a number of systems (see [3] for a
> partial list) and found to be useful.
>
> We would be grateful if you would acknowledge that your comment has been
> answered by sending a reply to this mailing list.
>
> Andy
> On behalf of the SPARQL working group.
>
> [1] http://www.w3.org/TR/xpath20/#id-path-expressions
> [2] http://www.w3.org/TR/xpath-functions/#func-distinct-values
> [3] http://esw.w3.org/SPARQL/Extensions/Paths
>
> On 15/12/10 18:34, jorge perez wrote:
>>
>> Hello Andy,
>>
>> Thank you very much for your response and for considering my comments,
>> and sorry for the late reply.
>>
>> There is a couple of comments that you have not answered.
>>
>> ""
>> As a separate but very important issue, notice that the XPath language
>> does not consider duplicate paths when evaluating expressions (XPath
>> is evaluated in the "there exists" way that I mentioned before). Thus,
>> counting paths in SPARQL would be somewhat in contradiction with
>> previously proposed path languages considered by the W3C.
>> ""
>>
>> I think that if this W3C Recommendation is in discordance with a
>> previous Recommendation about a similar topic, then DAWG should have
>> strong reasons for that, and make them clear in the specification. The
>> specification should also advice the reader about this issue.
>>
>> Besides that comment, you have said nothing about efficiency of
>> evaluation. Notice that this not related to a particular way of
>> implementing the language. It is about the huge efficiency impact that
>> any implementation will suffer in practice. You have not acknowledge
>> that in your response. Have you consider this as an issue?
>>
>> Another comment that is not covered by your response is whether there
>> exists a use case that demand counting different paths. In your
>> response, it seems that the reason for counting paths is to make
>> easier the job of the implementors (by reusing algebra operators).
>> Opposite to what the group think, I think that not counting paths
>> gives the implementor more freedom since paths could be implemented in
>> several different ways, being just one of them by reusing algebra
>> operators. Can you please clarify whether there are use cases about
>> this? This would help a lot.
>>
>> If you respond to the comments above I can consider my comments answered.
>>
>> I have a couple of additional words. Please do not consider them as a
>> formal objection to the process, but just as my opinion.
>>
>> I still strongly disagree with your design decisions about property
>> paths. In particular, I insist that it is a mistake to define the
>> semantics in the presence of cycles in a non-standard way and by
>> forcing a particular algorithm to evaluate them. In your response you
>> say that there can be corner cases, but it is not only a problem of
>> corner cases. From my point of view it will become a problem of
>> adoption of the standard. In this point I think that the group should
>> not neglect that there is a lot of related (theoretical and practical)
>> work in this area that have handled cycles in a completely different
>> way.
>>
>> To conclude, I do think that the property-paths material in the
>> current specification is far from being mature. Considering that the
>> group is in a tight schedule, I think that it would be better to not
>> include property paths in this round of standardization, than
>> including them in their current form.
>>
>> Thank you very much for considering my comments.
>> - jorge
>>
>> On Thu, Dec 2, 2010 at 8:20 AM, Andy Seaborne
>> <andy.seaborne@epimorphics.com>  wrote:
>>>
>>> Jorge,
>>>
>>> Thank very much for your comments.
>>>
>>> The working group considered a number of factors in designing the
>>> property
>>> path features. In addition to the points you raise, the WG also included
>>> consideration that, while this working group is not adding a path
>>> datatype
>>> (needed to inquire about any path matched later in the query), nor the
>>> specific case of access to path length, the WG should leave open as many
>>> possibilities here for future work. Another factor in the design is the
>>> relationship of some property path expressions to triple pattern forms.
>>>
>>> Although not specifying returning the path length of a match, nor
>>> specifying
>>> returning the matched path itself, the WG felt that, on balance, the
>>> design
>>> in the working draft gave maximum scope for any later standardization
>>> work.
>>> The issue of path length particularly was considered as a feature for
>>> this
>>> round of work but, when considered against all the other work items the
>>> WG
>>> has taken on, it didn't make the final list of work items. This lead to
>>> the
>>> conclusion that counting path possibilities, not a "there exists"
>>> condition,
>>> was the better choice for this round of standardization. Adding access
>>> the
>>> the path matched is better served if all paths are considered.
>>>
>>> Another consideration was the relationship of property paths and existing
>>> queries using triple patterns.
>>>
>>> { ?x :p{2} ?y }
>>>
>>> and
>>>
>>> { ?x :p ?Z . ?Z :p ?y }, with ?Z projected away.
>>>
>>> The WG decided to make these equivalent, including in terms of numbers of
>>> solutions. This gives the semantics of many path forms in terms of SPARQL
>>> graph pattern operators. This was felt to be intuitive and to utilize the
>>> capabilities of query engines: rather that requiring yet another
>>> mechanism,
>>> the equivalence means that join-technology (for example) can be used to
>>> solve the pattern.
>>>
>>> This then leaves the issue of cycles in the "+" operator. The design is
>>> one
>>> in which the cycles in "+" operator are handled by traversing a directed
>>> edge (triple in the data) once. This will be explained in the final
>>> version
>>> of the query specification - there is a placeholder for it in the current
>>> editors working draft. The current working draft has been clarified to
>>> use
>>> "multiset-union" for the union in the ArbitraryLengthPath definition.
>>>
>>> This overall design is a tradeoff of implementation, future
>>> possibilities,
>>> and equivalence of patterns on graphs. The WG is aware that there can be
>>> corner cases can arise where different intuitions are not compatible. On
>>> balance, the WG feels that the current design is most suitable for this
>>> round of standardization.
>>>
>>> Again, that you for your helpful comments.
>>>
>>> We would be grateful if you would acknowledge that your comment has been
>>> answered by sending a reply to this mailing list.
>>>
>>> Andy
>>> on behalf of the SPARQL Working Group.
>>>
>
Received on Tuesday, 25 January 2011 18:55:42 UTC