5183
2007-10-12 17:25:17 +0000
[FO] Effect of type promotion in fn:distinct-values
2009-10-16 20:47:47 +0000
1
1
1
Unclassified
XPath / XQuery / XSLT
Functions and Operators 1.0
Recommendation
PC
Windows XP
CLOSED
FIXED
http://www.w3.org/TR/xpath-functions/#func-distinct-values
P5
normal
---
1
zongaro
mike
public-qt-comments
oldest_to_newest
17161
0
zongaro
2007-10-12 17:25:17 +0000
Consider the following expression. The values of type xs:float and xs:double both compare equal to the value of type xs:decimal, due to the loss of precision in type promotion, but the float and double values are not equal to one another. What is the correct result? Or is it implementation-dependent whether the result is 1 or 2?
count(distinct-values(
(xs:float('1.0'),
xs:decimal('1.0000000000100000000001',
xs:double( '1.00000000001')))
See also the related XSLT 2.0 bug 2916.[1]
[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=2916
17162
1
mike
2007-10-12 18:09:55 +0000
Valid point. The analogy with min() and max() would suggest converting all the values to the least common supertype. That's the solution I would recommend if we can satisfy ourselves that it's implementable with acceptable performance.
17236
2
mike
2007-10-16 14:41:50 +0000
This is a tricky problem. The rule I suggested in comment #1, of converting to the least common supertype, suffers the disadvantage that when you process a sequence in order, you cannot always tell whether a value is a "new value" (distinct from all previous values) immediately.
I believe Saxon converts all numeric values to double unconditionally before deciding whether the value is unique. That's not very satisfactory either.
The only rule I can really think of is (a) take the sequence in some implementation-defined order, (b) for each value, if the value is equal to some previous value (that has been deemed distinct), discard it, otherwise include it in the result of the function. The effect of this rule is that not only is the order of the result implementation-defined (as now), but in pathological cases different implementations may find a different number of distinct values.
I think that's effectively teh following, excluding pathologicals such as NaN:
declare function distinct-values($arg) {
let $s := unordered($arg)
return
if (exists($s))
then ($s[1], distinct-values(remove($s,1)[. ne $s[1]]))
else ()
}
17356
3
john.snelson
2007-10-23 15:31:41 +0000
Both XQilla and Berkeley DB XML follow the algorithm Michael describes in comment #2.
17377
4
mike
2007-10-24 08:13:11 +0000
I would like to propose a solution along the following lines (detailed wording to follow later).
1. The function partitions the items in the atomized input into a number of groups such that:
a. Within a group, every pair of items in the group are mutually equal (that is, A eq B, or A and B are both NaN)
b. Given two distinct groups, there is at least one pair of values chosen one from each group such that the two values are unequal (A ne B unless one is NaN)
c. Note that this does not guarantee that there is no pair of values that are equal to each other but assigned to different groups, because of the transitivity issue
d. Note also that in the general case there may be more than one possible partitioning that meets these rules.
2. The function then selects one item from each group, chosen arbitrarily, except [discuss?] that the item that is chosen from one group must not be equal to the item that is chosen from any other group.
I think this can be implemented by an algorithm that processes the items in input order, that makes an immediate decision for each item whether to include it in the result or not, and that retains in memory (a) the items that have been returned in the output, and (b) for each value that has been returned in the output, at most one value of each primitive data type that has not been returned itself, but is equal to a value that has been returned.
For xsl:for-each-group, given that we guarantee the order of groups and the order of items within a group, we could be a bit more prescriptive: we could prescribe an algorithm that processes the items in order and allocates each one to an existing group if it is equal to every item in that group, and that starts a new group otherwise. This algorithm (I believe!) meets the rules for distinct values given above, and gives a more predictable result.
17513
5
zongaro
2007-10-30 14:47:14 +0000
Michael, in your proposed solution in comment 4, does the implementation of the function have to choose the partition into groups in the first step in such a way that it is actually able to select exactly one item from each group in the second step? Going back to my initial example,
fn:distinct-values(
(xs:float('1.0'),
xs:decimal('1.0000000000100000000001',
xs:double( '1.00000000001')))
For convenience, I'll refer to the float value as Fl, the decimal value as De and the double value as Do. So we have these possible partitions into groups:
1. {Fl}, {De}, {Do}
2. {Fl,De}, {Do}
3. {Fl}, {De, Do}
If the implementation doesn't have to be able to select precisely one item from each group, it could select (Fl,Do) or (De) in the first case, which means it's still implementation-dependent whether there is one item or two items in the result.
Since that doesn't solve the problem, I'll assume you meant that it must be able to select precisely one item from each group. In that way, with either the second or third partition, the number of items in the result will be two.
17516
6
mike
2007-10-30 15:08:43 +0000
>So we have these possible partitions into groups:
1. {Fl}, {De}, {Do}
2. {Fl,De}, {Do}
3. {Fl}, {De, Do}
I don't think solution (1) satisfies rule 2: De and Do don't contain a pair of values that are unequal. So I think (2) and (3) are the only possible partitionings. I tried to devise the rules on the basis that it would always be possible to select one value from each group in the second step; I can't easily prove that I have succeeded.
Mike
17524
7
zongaro
2007-10-30 19:24:45 +0000
My apologies for that blunder!
17879
8
mike
2007-11-28 18:21:31 +0000
After sitting on this for a while, I propose to resolve this by adding the paragraph:
If the input sequence contains values of different numeric types that differ from each other by small amounts, then the eq operator is not transitive, because of rounding effects occurring during type promotion. In the situation where the input contains three values A, B, and C such that A eq B, B eq C, but A ne C, then the number of items in the result of the function (as well as the choice of which items are returned) is implementation dependent, subject only to the constraints that (a) no two items in the result sequence compare equal to each other, and (b) every input item that does not appear in the result sequence compares equal to some item that does appear in the result sequence.
For example, this arises when computing
distinct-values(
(xs:float('1.0'),
xs:decimal('1.0000000000100000000001',
xs:double( '1.00000000001'))
because the values of type xs:float and xs:double both compare equal to the value of type xs:decimal but not equal to each other.
For xsl:for-each-group, we need to add similar wording. I will raise a separate bug report on this.
18774
9
mike
2008-02-05 17:56:57 +0000
Discussed on 5 Feb 2008, agreed to accept the proposal in comment #8
23746
10
mike
2009-02-14 19:06:55 +0000
The resolution in comment #8 has now (after an inordinate delay) been copied into draft Erratum E44.
28418
11
mike
2009-10-16 20:47:47 +0000
Marking this as closed, since the erratum has been issued and applied to the master text for both the 1.02e and 1.1 documents.