This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29415 - [FO31] Regex: capturing parentheses inside non-capturing parentheses
Summary: [FO31] Regex: capturing parentheses inside non-capturing parentheses
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 3.1 (show other bugs)
Version: Candidate Recommendation
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-01 18:38 UTC by Abel Braaksma
Modified: 2016-03-22 09:31 UTC (History)
2 users (show)

See Also:


Attachments

Description Abel Braaksma 2016-02-01 18:38:38 UTC
I had some trouble finding out what fn:replace("abcd", "a(?:b(c))", "$1") was supposed to return because of the normal parens nested withing non-capturing parens. I.e., is it returning "cd" or "d"?

According to the spec, the correct answer is "cd", which probably makes sense (though an argument against it is: if you are *inside* non-capturing parentheses, you are not capturing anymore, so capturing parentheses here do not capture).

While I don't think the spec is wrong here (assuming my interpretation is correct), I would like to ask the WG to consider a line of clarification here.

Proposal, in 5.6.1 Regular expression syntax, 3rd bullet, add a Note along those lines:

   "If parenthesized sub-expression are encountered nested withing non-capturing 
   parentheses, these will act as capturing parentheses and numbering continues 
   as explained in the paragraphs above."
Comment 1 Abel Braaksma 2016-02-01 19:14:23 UTC
Hmm, reading more, I see that we say in the same section:

    "The sub-expressions are numbered according to the position of the opening 
    parenthesis in left-to-right order within the top-level regular expression:"

but searching for "top-level regular expression" does not yield any results. What is it? Does this include or exclude nested expressions? Since each paren can start a new regex, this does not seem to make sense.

Should it perhaps mean "when not inside a character class"?
Comment 2 Andrew Coleman 2016-02-04 15:08:41 UTC
At the teleconference on 2016-02-02, the WG agreed that the expression should return "cd" and that clarification text and an example should be added to the spec.
Comment 3 Abel Braaksma 2016-02-09 20:36:47 UTC
This issue was fixed with a new text added to the spec, see: https://lists.w3.org/Archives/Public/public-xsl-query/2016Feb/0017.html

However, this leaves one other location where the ambiguity is still in place, i.e. in the section on back-references:

<quote>
A back-reference is an additional kind of atom. The construct \N where N is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the resulting number NN is such that the back-reference is preceded by NN or more unescaped opening parentheses.
</quote>

"NN or more unescaped opening parentheses" is too broad for the same reasons as what originated this bug. 

I suggest to point here to the section above, something like "NN or more opening [#LINK capturing parentheses]", where the link points to the section on "Sub-expressions (groups)".
Comment 4 Michael Kay 2016-03-08 20:49:18 UTC
Proposed resolution:

(0) Replace "subexpression" by "sub-expression" throughout. (The two spellings are currently used interchangeably)

(a) In the paragraph that starts "Sub-expressions (groups) within the regular expression are recognized." add at the end "A left parenthesis is recognized as a capturing left parenthesis provided it is not followed by "?:" (see below), is not within a character group (square brackets), and is not escaped with a backslash. The sub-expression enclosed by a capturing left parenthesis and its matching right parenthesis is referred to as a capturing sub-expression.

(b) Change the sentence

The presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be counted by operations that number the groups within a regular expression, for example the fn:replace function.

to

The presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be treated as a capturing left parenthesis, which means it is not counted by constructs that number the groups within a regular expression, such as back-references and the fn:replace function.

(c) In the paragraph starting "Back-references are allowed", replace the phrase "unescaped opening parentheses" by "capturing left parentheses".

(d) In the paragraph starting "A back-reference matches", replace the phrase "unescaped left parentheses" by "capturing left parenthesis".

(e) In the specification of fn:replace and fn:analyze-string, replace "parenthesized sub-expression" (several times) by "capturing sub-expression".
Comment 5 Liam R E Quin 2016-03-08 21:56:24 UTC
A left parenthesis is recognized as a capturing left parenthesis provided it is not followed by "?:"

should read "not immediately followed by"
Comment 6 Abel Braaksma 2016-03-09 11:55:01 UTC
(In reply to Michael Kay from comment #4)
> Proposed resolution:
I think the proposed resolution is sound and fixes the remaining issue. I agree with Liam's comment. In the same vein you might consider "is not escaped by a preceding backslash that is not itself escaped", but if we are trying to be too precise there may be many other places that need updating and I believe the consensus on this section is to allow some leniency as long is the leniency does not introduce ambiguities.
Comment 7 Andrew Coleman 2016-03-11 14:04:34 UTC
At the meeting on 2016-03-08, the WG agreed to adopt the proposal in comment 4.
Comment 8 Michael Kay 2016-03-21 17:57:09 UTC
The changes have been applied.