29722 – [FO31] fn:sort, array:sort

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29722 - [FO31] fn:sort, array:sort

Summary: [FO31] fn:sort, array:sort

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 3.1 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-07-07 15:29 UTC by Tim Mills
Modified:	2016-12-16 19:55 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Tim Mills 2016-07-07 15:29:50 UTC

These are marked as context-independent, but depend on deep-equal which depends on collations, and implicit timezone.

Given the text

"In addition there are cases where this function may be more flexible than the built-in sorting capability for XQuery or XSLT, for example when the sort key or collation is chosen dynamically, or when the sort key is a sequence of items rather than a single item." 

I'm surprised that the collation can't be specified.

Comment 1 Michael Kay 2016-07-09 08:22:36 UTC

I think we decided that the requirement for collation-based sorting could be achieved using the collation-key() function, and there are examples showing how to do this.

Comment 2 Michael Kay 2016-07-10 09:33:06 UTC

I agree that the handling of collations in these functions isn't very satisfactory. As currently specified, because they invoke deep-equal#2, the default collation kicks in automatically for string-valued keys; including the case where the key is actually the output of a call to collation-key(), which isn't very satisfactory.

Two possible ways we could go:

(a) add a collation argument to the functions, and specify the result in terms of a call to deep-equal() supplying that collation.

(b) define the call on deep-equal to use codepoint collation, and rely on use of collation-key() if the user wants to do collation-based sorting.

Comment 3 Michael Kay 2016-07-20 10:23:30 UTC

ACTION A-650-01: Action on Mike Kay to add a collation argument to fn:sort and array:sort (see bug 29722).  

I will interpret this as an action to produce a proposal, since I don't think the solution is entirely self-evident.

fn:sort currently has two signatures:

fn:sort($input as item()*) as item()*
fn:sort($input as item()*, $key as function(item()) as xs:anyAtomicType*) as item()*

Adding a collation argument to both clearly doesn't work, so I propose to add it only to fn:sort#2. This is OK, because fn:sort(x) is equivalent to fn:sort(x, data#1), which means that fn:sort#1 is really just a convenience function.

So I propose to add a third signature:

fn:sort($input as item()*, $key as function(item(), $collation as xs:string) as xs:anyAtomicType*) as item()*

The 1- and 2- argument forms are now simple shortcuts: the second argument defaults to data#1 and the third argument defaults to the default collation in the static context. It's slightly awkward that the function isn't the last argument as it usually is in higher-order functions, but we can't have everything.

We need to add to the spec:

The collation used by this function is determined according to the rules in 5.3.5 Choosing a collation.

And then the detail becomes:

The result of the function is obtained as follows:

For each item in the sequence $input, the function supplied as $key is evaluated with that item as its argument. The resulting values are the sort keys of the items in the input sequence.

The result sequence contains the same items as the input sequence $input, but generally in a different order.

The order of items in the result is such that, given two items $A and $B:

Let $collation be the explicit or implicit collation in use.

If (fn:deep-equal($key($A), $key($B), $collation), then the relative order of $A and $B in the output is the same as their relative order in the input (that is, the sort is stable)

Otherwise, if (deep-less-than($key($A), $key($B), $collation), then $A precedes $B in the output. The function deep-less-than is defined as the boolean result of the expression:

if (fn:empty($A)) then fn:exists($B)
else if (fn:deep-equal($A[1], $B[1], $collation)) then deep-less-than(fn:tail($A), fn:tail($B), $collation)
else if ($A[1] ne $A[1] (:that is, $A[1] is NaN:)) then fn:true()
else if (is-string($A[1]) and is-string($B[1]) then fn:compare($A[1], $B[1], $collation) < 0
else $A[1] lt $B[1]

where the function is-string($X) returns true if and only if $X is an instance of xs:string, xs:anyURI, or xs:untypedAtomic.

This ordering of sequences is referred to by mathematicians as "lexicographic ordering".


The changes for array:sort are virtually identical.

Comment 4 Michael Kay 2016-07-26 19:58:36 UTC

The WG agreed to modify the proposal so as to have three function signatures:

fn:sort($input)

fn:sort($input, $collation as xs:string?)

fn:sort($input, $collation as xs:string?, $key as function(*))

where omitting the collation argument (or supplying an empty sequence) would invoke the default collation.

Comment 5 Michael Kay 2016-07-26 20:10:39 UTC

Test cases have been updated to use the new signatures (but there are no serious tests for an actual collation yet).

Comment 6 Michael Kay 2016-07-26 22:44:07 UTC

The spec has been updated.

Comment 7 Botond Biró 2016-09-08 12:11:37 UTC

(In reply to Michael Kay from comment #6)
> The spec has been updated.

As of 6th of September the array:sort#2 function signature seems to be broken:
array:sort($array as item()*, $collation as xs:string?) as item()*
The type of the first argument and the return type should be array(*)
Should I open a separate bug for this?

Additionally I would suggest some changes in the signatures of the sort functions (in case they are not carved in stone yet) - namely to move back the $key in the 2nd position and add the $collation as last argument with the occurrence exactly one.

fn:sort($input) // same as sort#3($input, fn:data#1, default-collation())

fn:sort($input, $key as function(*)) // same as sort#3($input, $key, default-collation())

fn:sort($input, $key as function(*), $collation as xs:string) //note the occurrence on collation type is exactly one

1. In most (if not all) other functions accepting a collation string, it is added as last in the argument list (ex. fn:compare, fn:contains-token, etc.)
Having them similarly in the sort functions would be a bit more consistent.
2. with the signatures suggested above one could avoid incompatibilities(err:XPTY0004) with expressions relying on previous spec implementations
3. I think that specifying a different key function is more likely than switching only to a different collation - but maybe this is my personal preference/use case only

If you think it's worth considering, please reopen the bug.

Thank you,

Comment 8 Michael Kay 2016-09-08 17:31:53 UTC

Reopened as requested. Thanks for pointing out the typo in the signatures.

As regards the order of arguments, I think it's a subjective call: we're not going to please everyone. 

When we have to make a decision to improve usability or retain compatibility with previous drafts at WD status, improving usability typically wins: implementors and users of WD specs are expected to be aware of the risk of things changing. So this is likely to be decided on usability arguments.

Comment 9 Michael Kay 2016-09-12 09:11:42 UTC

Let's try to summarize the arguments in favour of different options.

(A) Although it's not absolutely essential, I think we should stick with the convention that when we have several signatures for a function, the function with arity N is "compatible" with the function with arity N+1 in that it's essentially equivalent to calling the arity N+1 function with the last argument taking a default value.

If we stick with this principle, then 

(i) if we have a function sort(sequence, collation, comparison-function) then we can also have an arity=2 function sort(sequence, collation) (with comparison-function defaulting to data#1) 

(ii) if we have a function sort(sequence, comparison-function, collation) then we can also have an arity=2 function sort(sequence, comparison-function) (with collation defaulting to the default collation from the static context).

(B) In all existing functions that take a collation, the collation argument comes last and can be omitted, with a well defined default.

(C) In all existing functions that take a function argument, the function argument comes last; in some cases it is mandatory and in others it can be omitted.

The reason we put the function argument last in existing functions was primarily because of the readability of calls like

fn:filter($x, function ($i) {
  if (x) 
  then typeswitch (x/y)
       case a:
         something
       case b:
         something else
  else zimbo
})

Basically, putting further arguments after the function argument in such cases is like putting a lengthy parenthetical remark in English text: it requires the reader to reset a mental stack and remember where they were.

And I think that's the primary argument for doing fn:sort this way.

The counter-argument is probably that the collation is more likely to be defaulted than the comparison function. While this may be true, I think that allowing the collation to be supplied as "()" is acceptable.

So I'm inclined to defend the way we currently have it. Certainly, any arguments for the two approaches are finely balanced and I don't think there's a convincing case for change.

Comment 10 Andrew Coleman 2016-09-14 12:20:47 UTC

The Working Group discussed this bug on 2016-09-13.  There was some sympathy with the request to change the order of the arguments, but was unanimous in the opinion that this change should not be made at this late stage of the process.  

Action A-653-03 was raised to fix the typo.

Comment 11 Botond Biró 2016-09-14 12:48:00 UTC

Thank you for the consideration and for sharing the reasoning. The fact that not only the collations but also the functinons are positioned last in the param list was a point I missed - shows the subjective bias for supportive arguments :)

Comment 12 Michael Kay 2016-09-20 21:25:03 UTC

The typo in the function signature of array:sort() has been fixed.