This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
I had some trouble finding out what fn:replace("abcd", "a(?:b(c))", "$1") was supposed to return because of the normal parens nested withing non-capturing parens. I.e., is it returning "cd" or "d"? According to the spec, the correct answer is "cd", which probably makes sense (though an argument against it is: if you are *inside* non-capturing parentheses, you are not capturing anymore, so capturing parentheses here do not capture). While I don't think the spec is wrong here (assuming my interpretation is correct), I would like to ask the WG to consider a line of clarification here. Proposal, in 5.6.1 Regular expression syntax, 3rd bullet, add a Note along those lines: "If parenthesized sub-expression are encountered nested withing non-capturing parentheses, these will act as capturing parentheses and numbering continues as explained in the paragraphs above."
Hmm, reading more, I see that we say in the same section: "The sub-expressions are numbered according to the position of the opening parenthesis in left-to-right order within the top-level regular expression:" but searching for "top-level regular expression" does not yield any results. What is it? Does this include or exclude nested expressions? Since each paren can start a new regex, this does not seem to make sense. Should it perhaps mean "when not inside a character class"?
At the teleconference on 2016-02-02, the WG agreed that the expression should return "cd" and that clarification text and an example should be added to the spec.
This issue was fixed with a new text added to the spec, see: https://lists.w3.org/Archives/Public/public-xsl-query/2016Feb/0017.html However, this leaves one other location where the ambiguity is still in place, i.e. in the section on back-references: <quote> A back-reference is an additional kind of atom. The construct \N where N is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the resulting number NN is such that the back-reference is preceded by NN or more unescaped opening parentheses. </quote> "NN or more unescaped opening parentheses" is too broad for the same reasons as what originated this bug. I suggest to point here to the section above, something like "NN or more opening [#LINK capturing parentheses]", where the link points to the section on "Sub-expressions (groups)".
Proposed resolution: (0) Replace "subexpression" by "sub-expression" throughout. (The two spellings are currently used interchangeably) (a) In the paragraph that starts "Sub-expressions (groups) within the regular expression are recognized." add at the end "A left parenthesis is recognized as a capturing left parenthesis provided it is not followed by "?:" (see below), is not within a character group (square brackets), and is not escaped with a backslash. The sub-expression enclosed by a capturing left parenthesis and its matching right parenthesis is referred to as a capturing sub-expression. (b) Change the sentence The presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be counted by operations that number the groups within a regular expression, for example the fn:replace function. to The presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be treated as a capturing left parenthesis, which means it is not counted by constructs that number the groups within a regular expression, such as back-references and the fn:replace function. (c) In the paragraph starting "Back-references are allowed", replace the phrase "unescaped opening parentheses" by "capturing left parentheses". (d) In the paragraph starting "A back-reference matches", replace the phrase "unescaped left parentheses" by "capturing left parenthesis". (e) In the specification of fn:replace and fn:analyze-string, replace "parenthesized sub-expression" (several times) by "capturing sub-expression".
A left parenthesis is recognized as a capturing left parenthesis provided it is not followed by "?:" should read "not immediately followed by"
(In reply to Michael Kay from comment #4) > Proposed resolution: I think the proposed resolution is sound and fixes the remaining issue. I agree with Liam's comment. In the same vein you might consider "is not escaped by a preceding backslash that is not itself escaped", but if we are trying to be too precise there may be many other places that need updating and I believe the consensus on this section is to allow some leniency as long is the leniency does not introduce ambiguities.
At the meeting on 2016-03-08, the WG agreed to adopt the proposal in comment 4.
The changes have been applied.