28795 – [F+O] 3.1 Non-transitive equality for numerics in maps

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28795 - [F+O] 3.1 Non-transitive equality for numerics in maps

Summary: [F+O] 3.1 Non-transitive equality for numerics in maps

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 3.1 (show other bugs)
Version:	Last Call drafts
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-06-11 15:32 UTC by Michael Kay
Modified:	2016-12-16 19:55 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Michael Kay 2015-06-11 15:32:45 UTC

Comment 1 Michael Kay 2015-06-11 15:54:30 UTC

Bug #28632 originally concerned timezones in maps, but it morphed into a discussion about handling numerics where the "eq" operator is non-transitive as a result of numeric promotion. I want to push back on the decision in that area.

Any off-the-shelf implementation of maps is going to assume transitive equality. Handling the semantics we decided upon, where remove() can remove two entries that are both equal to the supplied key but not equal to each other, requires considerable complexity, and straying from the algorithms and data structure text books. This is complex, expensive, and error prone, and I don't think it is justified by any usability benefit.

There is an alternative, which is to define key matching so that two numeric values are equal only if they are mathematically equal, with no rounding or approximation. Our problems with transitivity arise because "promotion" from decimal to double involves rounding. For the definition of "same key" we should say that two numerics of different type compare equal only if they are exactly equal. Note that every float and double is exactly equal to some decimal if you have enough digits of precision available, so in effect this means reversing the order of promotion to float->double->decimal and using an implementation of decimal with enough digits so conversion from a double is lossless.

Is the problem for maps worse than the existing problem for distinct-values and for grouping? Those cases too are likely to be supported by some off-the-shelf map implementation. In my case the additional challenge is partly because an efficient implementation of maps where updates don't destroy the original is already a lot more complex than an implementation with in-situ update, which is all you need for distinct-values and grouping. Those functions also don't need a remove() operation.

(If we define a new equality relation then we could also consider using it, of course, for distinct-values() and grouping. I'm not proposing that at the moment - it has compatibility implications, though I find it impossible to believe anyone is actually relying on the current non-transitive behaviour.)

Comment 2 Michael Kay 2015-06-13 19:37:01 UTC

Here's a suggested definition of the same-key relation:

The same-key relation is defined as follows. Two atomic values A and B are the same key if the expression (A eq B) is true, with the following exceptions and caveats:

* If evaluation of the expression (A eq B) fails with a dynamic error or type error, then A and B are not the same key.

* If A and B are both instances of one of the types xs:string, xs:untypedAtomic, or xs:anyURI, then they are the same key if codepoint-compare(A, B) eq 0.

* If A and B are both instances of one of the types xs:dateTime, xs:date, xs:time, xs:gYear, xs:gYearMonth, xs:gMonth, xs:bMonthDay, or xs:gDay, then:

  ** if they are instances of different primitive types, then they are not the same key
  ** if one of the values has a timezone and the other does not, then they are not the same key
  ** if both have a timezone, they are the same key if and only if (A eq B)
  ** if neither has a timezone, they are the same key if and only if (A eq B) NOTE: although the rules for comparison in this case make reference to the implicit timezone in the dynamic context, the result is the same regardless of the implicit timezone.

* If A and B are both instances of one of the types xs:float, xs:double, or xs:decimal (which includes xs:integer), then [subject to bug 28795]:

  ** If A is the xs:double or xs:float value NaN, and B  is the xs:double or xs:float value NaN, then A and B are equal.  
  ** Positive infinity equals positive infinity, and negative infinity equals negative infinity.
  ** Except for NaN and positive or negative infinity, the two values A and B are equal if and only if they represent the same mathematical quantity. Effectively both A and B are converted to instances of xs:decimal using an implementation of xs:decimal whose value space provides sufficient precision and scale to contain the value space of xs:double (and therefore xs:float) precisely; the values are then compared within this value space. 
  **  It follows that positive and negative zero are equal.

NOTE:
These rules are chosen (and differ from the rules for the comparison operator “eq”) to achieve three aims:
** Comparing two atomic values never raises an error
** Equality is commutative and transitive
** Equality comparison is context-free (the result does not depend on anything in the static or dynamic context)

Comment 3 Josh Spiegel 2015-06-15 17:17:44 UTC

I agree that the specifications should not make it difficult (or impossible) for implementations to use off-the-shelf components.  Do you think this proposal could be implemented by a hash table?

> If we define a new equality relation then we could also consider using it, of course, for distinct-values() and grouping. I'm not proposing that at the moment - it has compatibility implications, though I find it impossible to believe anyone is actually relying on the current non-transitive behaviour.

XQTS is relying on it.  For example, I think the result of fn-distinct-values-1 would change.  I am not in favor of backward incompatible changes to distinct-values() or group by. 

> Is the problem for maps worse than the existing problem for distinct-values and for grouping? 

Here are the relevant lines from distinct-values:

"In the situation where the input contains three values A, B, and C such that A eq B, B eq C, but A ne C, then the number of items in the result of the function (as well as the choice of which items are returned) is ·implementation-dependent·, subject only to the constraints that (a) no two items in the result sequence compare equal to each other, and (b) every input item that does not appear in the result sequence compares equal to some item that does appear in the result sequence."

I think it is worse for maps because of the associated values.  When merging maps, I think the intuition is that the last value in the input sequence makes it to the result.  This gets awkward when A, B, and C are the keys of map entries.

Comment 4 Jonathan Robie 2015-07-15 18:29:37 UTC

same-key will be defined in F&O, and referred to from XQuery/XPath. Constraints will be defined in the XDM.

Changes are required in:

- XDM
- F&O
- XPath
- XQuery

Comment 5 Jonathan Robie 2015-07-15 18:33:30 UTC

In F&O, this will be op:same-key()

Comment 6 Michael Kay 2015-07-16 19:30:05 UTC

In the data model, change the text in 17.1 starting with the definition of "same key" and ending with the Note as follows:

[Definition] Within a map, no two entries have the same key. Two atomic values K1 and K2 are the same key for this purpose if the relation op:same-key(K1, K2) holds. This relationship is defined in [F+O].

Note:
The definition of op:same-key is chosen with the following objectives:
(a) the relationship is transitive (this applies particularly to the way numeric values of different types are compared).
(b) the relationship has no dependencies on the static or dynamic context. (For example, it does not depend on collations or on the implicit timezone).
(c) the function is error-free: any two atomic values can be compared, giving a true or false result.