This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29355 - Modernize sequence filtering
Summary: Modernize sequence filtering
Status: NEW
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Requirements for Future Versions (show other bugs)
Version: Working drafts
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-03 21:22 UTC by Benito van der Zander
Modified: 2016-12-07 18:38 UTC (History)
2 users (show)

See Also:


Attachments

Description Benito van der Zander 2016-01-03 21:22:49 UTC
The syntax to get a subsequence in XQuery looks extremely dated.  
If you compare it too languages that became recently popular, you see that they almost all support a syntax like sequence[from:to] or sequence[from:to:step] ( e.g. array[2:5] ) to get a subsequence like Python/Go/jq, or sequence[from..to] like Perl/D/Ruby/Swift/Rust.  

A more modern sequence filtering would:

1. Allow negative numbers

$sequence[-1] to get the last value

$sequence[-2] to get the one before that ...

$sequence[$i] to get $sequence [last() + $i + 1] for $i < 0

2. Allow ranges

$sequence[ $list-of-numbers ]

as abbreviation for something like

$sequence[ position() = ($list-of-numbers ! (if (. > 0) then . else if (. < 0) last() + . + 1 else xs:error("..") ) )  ]

Then you can at least do $sequence[2 to 4]


3. Add a by operator to skip the i-th elements. E.g. by 2 to get every second, as in (("a", "b", "c", "d") by 2) returning ("a", "c")


With 2. and 3. you can write $sequence[2 to 10 by 3] to get the (2,5,8)-th elements.

Except for 1. it should not break any existing script
Comment 1 Abel Braaksma 2016-01-07 15:30:27 UTC
> $sequence[-1] to get the last value
This would introduce a backwards-compatibility issue. It is currently legal syntax and code and examples exists that exploit the fact that zero and negative values return the empty sequence.

> Then you can at least do $sequence[2 to 4]
This I'd like, but I believe with XPath 2.0 there was a strong reason to disallow sequences of more-than-one in predicates (FORG0006, EBV not defined for sequences > 1).

Note: you can already do the following, which gives what you want and is shorter than your alternative syntax:

$sequence[position() = 2 to 4]

> $sequence[2 to 10 by 3]
This would change the RangeExpr production (or, as you suggest as operator, the infix operator productions). If it doesn't cause any EBNF conflicts (and the WG agrees) this is viable, I think. I agree that it adds clarity to a certain class of use-cases.

Given the previous rewrite, this can be rewritten as:

$sequence[position() = (2 to 10)[position() mod 3 = 1]]
Comment 2 Benito van der Zander 2016-01-07 16:01:22 UTC
>This would introduce a backwards-compatibility issue. It is currently legal syntax and code and examples exists that exploit the fact that zero and negative values return the empty sequence.

Yeah that is sad :(


But it could be done for arrays. 
It is surprisingly inconsistent that they allow ?(2 to 4), but raise errors on invalid indices

>This I'd like, but I believe with XPath 2.0 there was a strong reason to disallow sequences of more-than-one in predicates 

But is that reason still true? Especially with those arrays

I do not see an issue, if more-than-one sequences are restricted to type integer+

>Note: you can already do the following, which gives what you want and is shorter than your alternative syntax: $sequence[position() = 2 to 4]

For positive indices
Comment 3 Abel Braaksma 2016-01-10 14:58:48 UTC
FYI, by mail, David Carlisle sent this answer to public-qt-comments@w3.org:

On 03/01/2016 21:22, bugzilla@jessica.w3.org wrote:
> With 2. and 3. you can write $sequence[2 to 10 by 3] to get the 
> (2,5,8)-th elements.

The existing xpath 2 syntax

$sequence[position() le 10][position() mod 3 =2]

seems simpler (but more general) and doesn't require new syntax.

David
Comment 4 Benito van der Zander 2016-01-10 16:13:52 UTC
Because you are used to this syntax. I just got a mail from someone who wrote something similar as ! (if (op:numeric-mod(position(),3)=2) then . else '')

And because it starts with 2. What would you do for $sequence[3 to 10 by 3] ?


$sequence[position() = 3 to 10][position() mod 3 =1] ?

That is not simpler
Comment 5 Michael Kay 2016-01-10 22:11:18 UTC
The decision not to allow SEQ[2 to 5] was made about 10 years ago so my recollection of the exact rationale may be faulty. 

The biggest problem with A[B] is its overloading as both a filter expression and a subscripting expression. To make it work as a filter expression, B must be evaluated once for each item in A, with that item as the context item. That rule can't depend on the type of B, because we can't assume static typing. Hence the convoluted rule whereby A[3] is interpreted as the filter expression A[position() = 3].

If you extend this kind of logic to handle predicates with a type other than a single integer, then things get increasingly complicated, because you have to define how things like A[. to 15] should be interpreted: evaluating the predicate once for each item in A no longer works.

It's very intuitive what A[2 to 5] should mean, but defining the semantics in a way that covers all edge cases and still remains comprehensible (and implementable) for more complex cases proves quite difficult.
Comment 6 Benito van der Zander 2016-01-10 22:52:31 UTC
>The decision not to allow SEQ[2 to 5] was made about 10 years ago so my recollection of the exact rationale may be faulty. 

That is why it is time to modernize it. 10 years ago it was probably a great syntax


> because you have to define how things like A[. to 15] should be interpreted: evaluating the predicate once for each item in A no longer works.

As A[position() = . to 15]  ?



If B returns a sequences anyway, it becomes position() = B and then it is also clear how non-integers in the sequence are handled
Comment 7 Benito van der Zander 2016-01-11 18:55:25 UTC
Or perhaps the problem was a conflict with A[child] to check for the existence of children.

If you had an XPath/XQuery preprocessor and a standard interpreter, you could change every occurrence of A[B] to  A[let $b := B return if (count($b) le 1 or head($b) instance of node()) then $b else position() = $b ] which should give a reasonable, well-defined modernized behaviour.
Comment 8 Benito van der Zander 2016-12-07 18:29:36 UTC
And the filter predicate is even an Expr not an ExprSingle.

So you can write $sequence[1,2,3] and it is a valid syntax that just does not evaluate to anything. Yet it looks as if it might return the first three elemenrs
Comment 9 Michael Kay 2016-12-07 18:38:41 UTC
But $sequence[@x, @y, @z] makes perfect sense, it selects those items in the sequence that have at least one of the attributes x, y, and z.

I think overloading a[b] to do both filtering and positional selection was a bad idea, but given that it was done, it's hard to identify changes that would make things better without breaking compatibility.