5183 2007-10-12 17:25:17 +0000 [FO] Effect of type promotion in fn:distinct-values 2009-10-16 20:47:47 +0000 1 1 1 Unclassified XPath / XQuery / XSLT Functions and Operators 1.0 Recommendation PC Windows XP CLOSED FIXED http://www.w3.org/TR/xpath-functions/#func-distinct-values P5 normal --- 1 zongaro mike public-qt-comments oldest_to_newest 17161 0 zongaro 2007-10-12 17:25:17 +0000 Consider the following expression. The values of type xs:float and xs:double both compare equal to the value of type xs:decimal, due to the loss of precision in type promotion, but the float and double values are not equal to one another. What is the correct result? Or is it implementation-dependent whether the result is 1 or 2? count(distinct-values( (xs:float('1.0'), xs:decimal('1.0000000000100000000001', xs:double( '1.00000000001'))) See also the related XSLT 2.0 bug 2916.[1] [1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=2916 17162 1 mike 2007-10-12 18:09:55 +0000 Valid point. The analogy with min() and max() would suggest converting all the values to the least common supertype. That's the solution I would recommend if we can satisfy ourselves that it's implementable with acceptable performance. 17236 2 mike 2007-10-16 14:41:50 +0000 This is a tricky problem. The rule I suggested in comment #1, of converting to the least common supertype, suffers the disadvantage that when you process a sequence in order, you cannot always tell whether a value is a "new value" (distinct from all previous values) immediately. I believe Saxon converts all numeric values to double unconditionally before deciding whether the value is unique. That's not very satisfactory either. The only rule I can really think of is (a) take the sequence in some implementation-defined order, (b) for each value, if the value is equal to some previous value (that has been deemed distinct), discard it, otherwise include it in the result of the function. The effect of this rule is that not only is the order of the result implementation-defined (as now), but in pathological cases different implementations may find a different number of distinct values. I think that's effectively teh following, excluding pathologicals such as NaN: declare function distinct-values($arg) { let $s := unordered($arg) return if (exists($s)) then ($s[1], distinct-values(remove($s,1)[. ne $s[1]])) else () } 17356 3 john.snelson 2007-10-23 15:31:41 +0000 Both XQilla and Berkeley DB XML follow the algorithm Michael describes in comment #2. 17377 4 mike 2007-10-24 08:13:11 +0000 I would like to propose a solution along the following lines (detailed wording to follow later). 1. The function partitions the items in the atomized input into a number of groups such that: a. Within a group, every pair of items in the group are mutually equal (that is, A eq B, or A and B are both NaN) b. Given two distinct groups, there is at least one pair of values chosen one from each group such that the two values are unequal (A ne B unless one is NaN) c. Note that this does not guarantee that there is no pair of values that are equal to each other but assigned to different groups, because of the transitivity issue d. Note also that in the general case there may be more than one possible partitioning that meets these rules. 2. The function then selects one item from each group, chosen arbitrarily, except [discuss?] that the item that is chosen from one group must not be equal to the item that is chosen from any other group. I think this can be implemented by an algorithm that processes the items in input order, that makes an immediate decision for each item whether to include it in the result or not, and that retains in memory (a) the items that have been returned in the output, and (b) for each value that has been returned in the output, at most one value of each primitive data type that has not been returned itself, but is equal to a value that has been returned. For xsl:for-each-group, given that we guarantee the order of groups and the order of items within a group, we could be a bit more prescriptive: we could prescribe an algorithm that processes the items in order and allocates each one to an existing group if it is equal to every item in that group, and that starts a new group otherwise. This algorithm (I believe!) meets the rules for distinct values given above, and gives a more predictable result. 17513 5 zongaro 2007-10-30 14:47:14 +0000 Michael, in your proposed solution in comment 4, does the implementation of the function have to choose the partition into groups in the first step in such a way that it is actually able to select exactly one item from each group in the second step? Going back to my initial example, fn:distinct-values( (xs:float('1.0'), xs:decimal('1.0000000000100000000001', xs:double( '1.00000000001'))) For convenience, I'll refer to the float value as Fl, the decimal value as De and the double value as Do. So we have these possible partitions into groups: 1. {Fl}, {De}, {Do} 2. {Fl,De}, {Do} 3. {Fl}, {De, Do} If the implementation doesn't have to be able to select precisely one item from each group, it could select (Fl,Do) or (De) in the first case, which means it's still implementation-dependent whether there is one item or two items in the result. Since that doesn't solve the problem, I'll assume you meant that it must be able to select precisely one item from each group. In that way, with either the second or third partition, the number of items in the result will be two. 17516 6 mike 2007-10-30 15:08:43 +0000 >So we have these possible partitions into groups: 1. {Fl}, {De}, {Do} 2. {Fl,De}, {Do} 3. {Fl}, {De, Do} I don't think solution (1) satisfies rule 2: De and Do don't contain a pair of values that are unequal. So I think (2) and (3) are the only possible partitionings. I tried to devise the rules on the basis that it would always be possible to select one value from each group in the second step; I can't easily prove that I have succeeded. Mike 17524 7 zongaro 2007-10-30 19:24:45 +0000 My apologies for that blunder! 17879 8 mike 2007-11-28 18:21:31 +0000 After sitting on this for a while, I propose to resolve this by adding the paragraph: If the input sequence contains values of different numeric types that differ from each other by small amounts, then the eq operator is not transitive, because of rounding effects occurring during type promotion. In the situation where the input contains three values A, B, and C such that A eq B, B eq C, but A ne C, then the number of items in the result of the function (as well as the choice of which items are returned) is implementation dependent, subject only to the constraints that (a) no two items in the result sequence compare equal to each other, and (b) every input item that does not appear in the result sequence compares equal to some item that does appear in the result sequence. For example, this arises when computing distinct-values( (xs:float('1.0'), xs:decimal('1.0000000000100000000001', xs:double( '1.00000000001')) because the values of type xs:float and xs:double both compare equal to the value of type xs:decimal but not equal to each other. For xsl:for-each-group, we need to add similar wording. I will raise a separate bug report on this. 18774 9 mike 2008-02-05 17:56:57 +0000 Discussed on 5 Feb 2008, agreed to accept the proposal in comment #8 23746 10 mike 2009-02-14 19:06:55 +0000 The resolution in comment #8 has now (after an inordinate delay) been copied into draft Erratum E44. 28418 11 mike 2009-10-16 20:47:47 +0000 Marking this as closed, since the erratum has been issued and applied to the master text for both the 1.02e and 1.1 documents.