Re: Proposed definition of ExprMultiSet from Steve Harris on 2010-03-08 (public-rdf-dawg@w3.org from January to March 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Mon, 8 Mar 2010 14:20:05 +0000
To: Andy Seaborne <andy.seaborne@talis.com>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <D2B51557-E3B8-4E0B-BA49-86F50A2D752A@garlik.com>
On 7 Mar 2010, at 22:57, Andy Seaborne wrote:

> Overall - we seem to have the start of a possible design and so this  
> message is about details.
>
> On 07/03/2010 9:33 PM, Steve Harris wrote:
>> On 7 Mar 2010, at 17:42, Andy Seaborne wrote:
>>
>>> ISSUE-53
>>>
>>> I propose the following to define ExprMultiSet:
>>>
>>> -------
>>>
>>> Let Ω be a partition.
>>>
>>> ExprMultiSet(Ω) =
>>> { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
>>> UNION
>>> { e | μ in Ω such that eval(μ(expr)) is undefined }
>>>
>>> where "e" is some symbol that is distinct from all RDF terms.
>>>
>>> card[x]:
>>> if DISTINCT:
>>> card[x] = 1 if there exists μ in Ω such that x = eval(μ(expr))
>>> card[x] = 0 otherwise
>>> else
>>> card[x] = count of μ in Ω such that x = eval(μ(expr))
>>
>> I find the reuse of the term ExprMultiset as a function very  
>> confusing,
>> but I think I understand the proposal.
>
> It's just trying to write the ExprMultiset based on Ω for which  
> there is no notation.  I suppose it should involve μ.  It only about  
> whether you like to write definitions with free terms or not.
>
> "ExprMultiset based on Ω, expr = ... "

I believe that was handled in the definition of Aggregation()  
previously, but possibly there's some term missing.

>> The current draft is not as clear as it should be but:
>> AGGREGATE(ExprMultiset) on Ω results in Aggregation(GroupClause,
>> ExprMultiset, AGGREGATE, Ω)
>>
>> So, by my understanding the end result of this proposal is:
>>
>> Aggregation(GroupClause, ExprMultiset, func, Ω) =
>> { merge(k, func(S) | (k, Ω') in Partition(GroupClause, Ω) }
>>
>> where
>> S = { eval(exp,μ') | exp in ExprMultiset, μ' in Ω' such that
>> eval(exp,μ') is defined }
>> UNION
>> { e | exp in ExprMultiset, μ' in Ω' such that eval(exp,μ') is  
>> undefined }
>>
>> But perhaps I've missed the point?
>
> Don't think so.
>
> Minor: The merge needs to involve a variable for func(S) but that's  
> a separate issue and wasn;t in the published version either.   
> func(S) is a value, not a binding.

I think that's what I was trying to achieve with k, but it's  
definitely the bit I'm fuzziest on, it's a while ago I was hacking on  
that expression.

> Minor: I wrote "eval(μ(expr))" when I remember to make the change  
> from eval(expr,μ) because you'd used it earlier.

I was just taking your eval(expr,μ) because it's meaning seemed a bit  
clearer to me. I was never happy with the µ(expr) notation.

>>> --------
>>>
>>> "e" just records error evaluations.
>>>
>>> This is the most flexible definition. An alternative is
>>>
>>> ExprMultiset(Ω) =
>>> { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined }
>>>
>>> which is hard-coding dropping errors and unbounds during evaluation.
>>> But the aggregate can't know there were some errors.
>>
>> Right. Do we have a usecase where this is important? I don't remember
>> offhand whether SQL passes NULLs to aggregates, other than  
>> COUNT(*), but
>> I think it doesn't.
>
> It gives an account of COUNT(*) because it is the size of the MultiSet
>  = number of rows
>  = number of values + number of errors
>
> SQL can treat nulls and errors differently in a way we can't.

Well, if we have "e", I suspect it's possible for some definitions of  
the functions.

>>> Another possibility is that a yes/no flag indicating a error was  
>>> seen.
>>> But this might as well be the count of errors, which is equivalent  
>>> to
>>> the flexible definition given.
>>
>> Yes, somewhat. It complicates the definition of many of the  
>> aggregates
>> to some degree, but that's not a huge burden.
>>
>>> By the way, this is in no way a recipe for implementation.  
>>> Aggregation
>>> can be done over all groups in parallel during query execution.
>>>
>>>
>>>
>>> For the last publication, it was noted
>>>
>>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html
>>>
>>> Unbound and error are the same. The current design so far has it  
>>> that
>>> any error means that the multiset is invalid and that group is not
>>> considered.
>>
>> Right, this would tie us to a particular definition of COUNT(*),  
>> where
>> unbounds and errors are both counted. I don't have any reason to  
>> prefer
>> one definition over another.
>
> Sorry - don't understand.
>
> COUNT(*) is the number of rows in the partition, no change there.

Indeed. The question is whether error rows are in the partition, c.f.  
discussion on if the row should have unbound values, or be removed  
below.

> That can be explained in the main defns as number of values + number  
> of errors  because that is the total size of the multiset.
>
>>> We didn't have time to propose a solid design to address ISSUE-53 -
>>> the potential design at the time of publication was that any error
>>> when calculating the ExprMultiset from a partition meant that
>>>
>>> SUM of {1, 2, unbound} is an error.
>>> COUNT of {1, 2, unbound} is an error.
>>>
>>> I don't think that is a useful form for COUNT(?x). It does seem to
>>> mean that COUNT(?x) is either COUNT(*) or error; it can't be  
>>> anything
>>> else.
>>
>> This is assuming that we don't take something like your second
>> definition, I think.
>
> Yes - and the main flexible definition works. I'm referring to the  
> version hinted at in the last publication at this point.  I hadn't  
> realised that COUNT(?x) didn't work in the way it was going at the  
> time of publication.
>
>>> COUNT(?x) can not be zero because zero arises when there are no ?x  
>>> but
>>> there are solutions in the partition. If there are no solutions in  
>>> the
>>> partition then there is no group key and no grouping happens.
>>>
>>> For each aggregate we can decide what happens about unbounds and  
>>> errors.
>>>
>>> I would like to see:
>>>
>>> COUNT(*) = size of multiset.
>>> COUNT(DISTINCT *) = size of set after removing any e (i.e. skip  
>>> undefs).
>>
>> I find the punning of * (or DISTINCT) here a bit unnatural.
>
> I'm confused. I wasn't meaning any punning.  I'm just using the  
> SPARQL syntax as proposed.

It's the (DISTINCT *) combination but I find uncomfortable, I can't  
see any particular reason why it would mean what you suggest. Or, in  
fact, anything at all. SQL's reuse of the * token in this place is a  
little odd, but I think it's quite ingrained in users.

> ... COUNT examples ...
>
>>>
>>> -- Query 3:
>>>
>>> Change line 3 to:
>>> SELECT ?x (sum(?v) AS ?C)
>>>
>>> -----------
>>> | x | C |
>>> ===========
>>> | :x1 | 3 |
>>> | :x2 | 9 |
>>> | :x3 | 5 |
>>> | :x4 | 0 |
>>> -----------
>>>
>>> The :x4 row is zero because there were no valid numbers to add  
>>> together.
>>
>> Arguably SUM({}) is an error, c.f. MIN({}). I can live with 0 though.
>
> Happy with error as well especially if error cause unbound and not  
> removal of the row with that key.
>
>>
>> I think the above all match what I would expect, but...
>>
>>> -- Different query OPTIONAL part - now has ?p
>>>
>>> 1 PREFIX : <http://example/>
>>> 2
>>> 3 SELECT ?x (sum(?v) AS ?C)
>>> 4 WHERE
>>> 5 { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
>>> 6 OPTIONAL
>>> 7 { ?x ?any ?v}
>>> 8 }
>>> 9 GROUP BY ?x
>>> 10 ORDER BY str(?x)
>>>
>>> -----------
>>> | x | C |
>>> ===========
>>> | :x1 | 3 |
>>> | :x2 | 9 |
>>> | :x3 | 5 |
>>> | :x4 | 0 |
>>> -----------
>>>
>>> The case where ?v is "Z2 and "x" have been skipped.
>>
>> For this one I would expect:
>>
>> -----------
>> | x | C |
>> ===========
>> | :x1 | 3 |
>> | :x2 | 9 |
>> -----------
>
> You're right - there's errors in the eval of fn:numeric-add.
>
> We need to define what happens if agg(..) is an error.
>
> An alternative is,
>
> -----------
> | x | C |
> ===========
> | :x1 | 3 |
> | :x2 | 9 |
> | :x3 |   |
> | :x4 |   |
> -----------
>
> which retains the group row (to distinguish from no key).

Yes, I wrote that first, then changed it. I think that the version  
with the removed rows is closer to the rest of SPARQL, as error rows  
are removed in (sub-)queries. It's also possible to prevent this, in  
many cases at least, with creative use of COALESCE, but I know you're  
not a fan.

> On reflection, that's my preferred design.

I'm still on the fence, I can see upsides to both. It would be nice if  
there was a way to at least force the removal of rows which errored, I  
guess wrapping it in a subquery and doing FILTER(?C) would do the job.  
Still seems a little at odds with the rest of SPARQL though.

>> I would expect the 3,9,5,0 result from
>> SELECT ?x (sum(xsd:decimal(?v)) AS ?C)
>> or, more explicitly
>> SELECT ?x (sum(COALESCE(xsd:decimal(?v), 0)) AS ?C)
>
> This results in 3.0, 9.0 5.0 (decimals) and 0^^xsd:integer.

Indeed. Me being sloppy.

>> But, I can see an argument that RDF data has a tendency to be  
>> scruffy,
>
> :-)
>
>> so maybe users would expect this? However, it seems dangerous/ 
>> misleading.
>
> Agreed - let's record this as a decision to be made separately from  
> the overall aggregate design.

+1

- Steve

-- 
Steve Harris, Garlik Limited
2 Sheen Road, Richmond, TW9 1AE, UK
+44 20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD
Received on Monday, 8 March 2010 14:20:39 UTC