This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5183 - [FO] Effect of type promotion in fn:distinct-values
Summary: [FO] Effect of type promotion in fn:distinct-values
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 1.0 (show other bugs)
Version: Recommendation
Hardware: PC Windows XP
: P5 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL: http://www.w3.org/TR/xpath-functions/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-12 17:25 UTC by Henry Zongaro
Modified: 2009-10-16 20:47 UTC (History)
0 users

See Also:


Attachments

Description Henry Zongaro 2007-10-12 17:25:17 UTC
Consider the following expression.  The values of type xs:float and xs:double both compare equal to the value of type xs:decimal, due to the loss of precision in type promotion, but the float and double values are not equal to one another.  What is the correct result?  Or is it implementation-dependent whether the result is 1 or 2?

count(distinct-values(
         (xs:float('1.0'),
          xs:decimal('1.0000000000100000000001',
          xs:double( '1.00000000001')))


See also the related XSLT 2.0 bug 2916.[1]

[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=2916
Comment 1 Michael Kay 2007-10-12 18:09:55 UTC
Valid point. The analogy with min() and max() would suggest converting all the values to the least common supertype. That's the solution I would recommend if we can satisfy ourselves that it's implementable with acceptable performance.
Comment 2 Michael Kay 2007-10-16 14:41:50 UTC
This is a tricky problem. The rule I suggested in comment #1, of converting to the least common supertype, suffers the disadvantage that when you process a sequence in order, you cannot always tell whether a value is a "new value" (distinct from all previous values) immediately.

I believe Saxon converts all numeric values to double unconditionally before deciding whether the value is unique. That's not very satisfactory either. 

The only rule I can really think of is (a) take the sequence in some implementation-defined order, (b) for each value, if the value is equal to some previous value (that has been deemed distinct), discard it, otherwise include it in the result of the function. The effect of this rule is that not only is the order of the result implementation-defined (as now), but in pathological cases different implementations may find a different number of distinct values.

I think that's effectively teh following, excluding pathologicals such as NaN:

declare function distinct-values($arg) {
  let $s := unordered($arg)
  return
    if (exists($s))
    then ($s[1], distinct-values(remove($s,1)[. ne $s[1]]))
    else ()
}


     

Comment 3 John Snelson 2007-10-23 15:31:41 UTC
Both XQilla and Berkeley DB XML follow the algorithm Michael describes in comment #2.
Comment 4 Michael Kay 2007-10-24 08:13:11 UTC
I would like to propose a solution along the following lines (detailed wording to follow later).

1. The function partitions the items in the atomized input into a number of groups such that:

  a. Within a group, every pair of items in the group are mutually equal (that is, A eq B, or A and B are both NaN)

  b. Given two distinct groups, there is at least one pair of values chosen one from each group such that the two values are unequal (A ne B unless one is NaN)

  c. Note that this does not guarantee that there is no pair of values that are equal to each other but assigned to different groups, because of the transitivity issue

  d. Note also that in the general case there may be more than one possible partitioning that meets these rules.

2. The function then selects one item from each group, chosen arbitrarily, except [discuss?] that the item that is chosen from one group must not be equal to the item that is chosen from any other group.

I think this can be implemented by an algorithm that processes the items in input order, that makes an immediate decision for each item whether to include it in the result or not, and that retains in memory (a) the items that have been returned in the output, and (b) for each value that has been returned in the output, at most one value of each primitive data type that has not been returned itself, but is equal to a value that has been returned.

For xsl:for-each-group, given that we guarantee the order of groups and the order of items within a group, we could be a bit more prescriptive: we could prescribe an algorithm that processes the items in order and allocates each one to an existing group if it is equal to every item in that group, and that starts a new group otherwise. This algorithm (I believe!) meets the rules for distinct values given above, and gives a more predictable result.
Comment 5 Henry Zongaro 2007-10-30 14:47:14 UTC
Michael, in your proposed solution in comment 4, does the implementation of the function have to choose the partition into groups in the first step in such a way that it is actually able to select exactly one item from each group in the second step?  Going back to my initial example,

fn:distinct-values(
     (xs:float('1.0'),
      xs:decimal('1.0000000000100000000001',
      xs:double( '1.00000000001')))

For convenience, I'll refer to the float value as Fl, the decimal value as De and the double value as Do.  So we have these possible partitions into groups:
1. {Fl}, {De}, {Do}
2. {Fl,De}, {Do}
3. {Fl}, {De, Do}

If the implementation doesn't have to be able to select precisely one item from each group, it could select (Fl,Do) or (De) in the first case, which means it's still implementation-dependent whether there is one item or two items in the result.

Since that doesn't solve the problem, I'll assume you meant that it must be able to select precisely one item from each group.  In that way, with either the second or third partition, the number of items in the result will be two.
Comment 6 Michael Kay 2007-10-30 15:08:43 UTC
>So we have these possible partitions into groups:
1. {Fl}, {De}, {Do}
2. {Fl,De}, {Do}
3. {Fl}, {De, Do}

I don't think solution (1) satisfies rule 2: De and Do don't contain a pair of values that are unequal. So I think (2) and (3) are the only possible partitionings. I tried to devise the rules on the basis that it would always be possible to select one value from each group in the second step; I can't easily prove that I have succeeded.

Mike
Comment 7 Henry Zongaro 2007-10-30 19:24:45 UTC
My apologies for that blunder!
Comment 8 Michael Kay 2007-11-28 18:21:31 UTC
After sitting on this for a while, I propose to resolve this by adding the paragraph:

If the input sequence contains values of different numeric types that differ from each other by small amounts, then the eq operator is not transitive, because of rounding effects occurring during type promotion. In the situation where the input contains three values A, B, and C such that A eq B, B eq C, but A ne C, then the number of items in the result of the function (as well as the choice of which items are returned) is implementation dependent, subject only to the constraints that (a) no two items in the result sequence compare equal to each other, and (b) every input item that does not appear in the result sequence compares equal to some item that does appear in the result sequence. 

For example, this arises when computing 

      distinct-values(
         (xs:float('1.0'),
          xs:decimal('1.0000000000100000000001',
          xs:double( '1.00000000001'))

because the values of type xs:float and xs:double both compare equal to the value of type xs:decimal but not equal to each other.

For xsl:for-each-group, we need to add similar wording. I will raise a separate bug report on this. 
Comment 9 Michael Kay 2008-02-05 17:56:57 UTC
Discussed on 5 Feb 2008, agreed to accept the proposal in comment #8
Comment 10 Michael Kay 2009-02-14 19:06:55 UTC
The resolution in comment #8 has now (after an inordinate delay) been copied into draft Erratum E44. 
Comment 11 Michael Kay 2009-10-16 20:47:47 UTC
Marking this as closed, since the erratum has been issued and applied to the master text for both the 1.02e and 1.1 documents.