<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>29415</bug_id>
          
          <creation_ts>2016-02-01 18:38:38 +0000</creation_ts>
          <short_desc>[FO31] Regex: capturing parentheses inside non-capturing parentheses</short_desc>
          <delta_ts>2016-03-22 09:31:02 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>Functions and Operators 3.1</component>
          <version>Candidate Recommendation</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows NT</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Abel Braaksma">abel.braaksma</reporter>
          <assigned_to name="Michael Kay">mike</assigned_to>
          <cc>andrew_coleman</cc>
    
    <cc>liam</cc>
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>124796</commentid>
    <comment_count>0</comment_count>
    <who name="Abel Braaksma">abel.braaksma</who>
    <bug_when>2016-02-01 18:38:38 +0000</bug_when>
    <thetext>I had some trouble finding out what fn:replace(&quot;abcd&quot;, &quot;a(?:b(c))&quot;, &quot;$1&quot;) was supposed to return because of the normal parens nested withing non-capturing parens. I.e., is it returning &quot;cd&quot; or &quot;d&quot;?

According to the spec, the correct answer is &quot;cd&quot;, which probably makes sense (though an argument against it is: if you are *inside* non-capturing parentheses, you are not capturing anymore, so capturing parentheses here do not capture).

While I don&apos;t think the spec is wrong here (assuming my interpretation is correct), I would like to ask the WG to consider a line of clarification here.

Proposal, in 5.6.1 Regular expression syntax, 3rd bullet, add a Note along those lines:

   &quot;If parenthesized sub-expression are encountered nested withing non-capturing 
   parentheses, these will act as capturing parentheses and numbering continues 
   as explained in the paragraphs above.&quot;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>124797</commentid>
    <comment_count>1</comment_count>
    <who name="Abel Braaksma">abel.braaksma</who>
    <bug_when>2016-02-01 19:14:23 +0000</bug_when>
    <thetext>Hmm, reading more, I see that we say in the same section:

    &quot;The sub-expressions are numbered according to the position of the opening 
    parenthesis in left-to-right order within the top-level regular expression:&quot;

but searching for &quot;top-level regular expression&quot; does not yield any results. What is it? Does this include or exclude nested expressions? Since each paren can start a new regex, this does not seem to make sense.

Should it perhaps mean &quot;when not inside a character class&quot;?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>124839</commentid>
    <comment_count>2</comment_count>
    <who name="Andrew Coleman">andrew_coleman</who>
    <bug_when>2016-02-04 15:08:41 +0000</bug_when>
    <thetext>At the teleconference on 2016-02-02, the WG agreed that the expression should return &quot;cd&quot; and that clarification text and an example should be added to the spec.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>124943</commentid>
    <comment_count>3</comment_count>
    <who name="Abel Braaksma">abel.braaksma</who>
    <bug_when>2016-02-09 20:36:47 +0000</bug_when>
    <thetext>This issue was fixed with a new text added to the spec, see: https://lists.w3.org/Archives/Public/public-xsl-query/2016Feb/0017.html

However, this leaves one other location where the ambiguity is still in place, i.e. in the section on back-references:

&lt;quote&gt;
A back-reference is an additional kind of atom. The construct \N where N is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the resulting number NN is such that the back-reference is preceded by NN or more unescaped opening parentheses.
&lt;/quote&gt;

&quot;NN or more unescaped opening parentheses&quot; is too broad for the same reasons as what originated this bug. 

I suggest to point here to the section above, something like &quot;NN or more opening [#LINK capturing parentheses]&quot;, where the link points to the section on &quot;Sub-expressions (groups)&quot;.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125390</commentid>
    <comment_count>4</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2016-03-08 20:49:18 +0000</bug_when>
    <thetext>Proposed resolution:

(0) Replace &quot;subexpression&quot; by &quot;sub-expression&quot; throughout. (The two spellings are currently used interchangeably)

(a) In the paragraph that starts &quot;Sub-expressions (groups) within the regular expression are recognized.&quot; add at the end &quot;A left parenthesis is recognized as a capturing left parenthesis provided it is not followed by &quot;?:&quot; (see below), is not within a character group (square brackets), and is not escaped with a backslash. The sub-expression enclosed by a capturing left parenthesis and its matching right parenthesis is referred to as a capturing sub-expression.

(b) Change the sentence

The presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be counted by operations that number the groups within a regular expression, for example the fn:replace function.

to

The presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be treated as a capturing left parenthesis, which means it is not counted by constructs that number the groups within a regular expression, such as back-references and the fn:replace function.

(c) In the paragraph starting &quot;Back-references are allowed&quot;, replace the phrase &quot;unescaped opening parentheses&quot; by &quot;capturing left parentheses&quot;.

(d) In the paragraph starting &quot;A back-reference matches&quot;, replace the phrase &quot;unescaped left parentheses&quot; by &quot;capturing left parenthesis&quot;.

(e) In the specification of fn:replace and fn:analyze-string, replace &quot;parenthesized sub-expression&quot; (several times) by &quot;capturing sub-expression&quot;.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125391</commentid>
    <comment_count>5</comment_count>
    <who name="Liam R E Quin">liam</who>
    <bug_when>2016-03-08 21:56:24 +0000</bug_when>
    <thetext>A left parenthesis is recognized as a capturing left parenthesis provided it is not followed by &quot;?:&quot;

should read &quot;not immediately followed by&quot;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125399</commentid>
    <comment_count>6</comment_count>
    <who name="Abel Braaksma">abel.braaksma</who>
    <bug_when>2016-03-09 11:55:01 +0000</bug_when>
    <thetext>(In reply to Michael Kay from comment #4)
&gt; Proposed resolution:
I think the proposed resolution is sound and fixes the remaining issue. I agree with Liam&apos;s comment. In the same vein you might consider &quot;is not escaped by a preceding backslash that is not itself escaped&quot;, but if we are trying to be too precise there may be many other places that need updating and I believe the consensus on this section is to allow some leniency as long is the leniency does not introduce ambiguities.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125442</commentid>
    <comment_count>7</comment_count>
    <who name="Andrew Coleman">andrew_coleman</who>
    <bug_when>2016-03-11 14:04:34 +0000</bug_when>
    <thetext>At the meeting on 2016-03-08, the WG agreed to adopt the proposal in comment 4.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125575</commentid>
    <comment_count>8</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2016-03-21 17:57:09 +0000</bug_when>
    <thetext>The changes have been applied.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>