5348 – [FO] Back-references: "sufficiently many"

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5348 - [FO] Back-references: "sufficiently many"

Summary: [FO] Back-references: "sufficiently many"

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Recommendation
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-01-05 11:00 UTC by Michael Kay
Modified:	2016-02-04 01:06 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Michael Kay 2008-01-05 11:00:19 UTC

In the specification for back-references in regular expressions (repeated unchanged in Erratum E4), we use the phrase:

<quote>
The construct \n where n is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the back-reference is preceded by sufficiently many capturing subexpressions.
</quote>

So what happens if the regular expression uses \11, and it is preceded by 12 capturing subexpressions, but there is no subexpression 11 because the closing paren for group 11 has not yet been encountered? That is:

(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11(12)(13)\11)

Is \11 intepreted as a reference to the non-existent group 11, or as a reference to group 1 followed by the digit 1?

I think it should be the latter. This involves changing the text to:

...these digits are taken to be part of the back-reference if and only if the back-reference is preceded by a capturing subexpression with the relevant number  (so \12 is treated as a reference to captured subexpression 12 if the back-reference is preceded by the closing parenthesis that matches the 12th opening parenthesis).

The error condition described in erratum E4 as:

<quote>
 The regular expression is invalid if this subexpression does not exist or if its closing right parenthesis occurs after the back-reference.
</quote>

can then occur only for a single-digit back-reference.

Editorially, it might be appropriate to reorder the sentences in the resulting paragraph.

Comment 1 Michael Kay 2008-03-11 15:44:59 UTC

In action A-358-06 I was asked to review what Perl does about this.

There is of course no formal specification of Perl. The man page 

http://www.perl.com/doc/manual/html/pod/perlre.html

states: "Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parentheses before the backreference."

This clearly doesn't make sense. If \10 occurs after the 10th left paren, but before the right paren that matches the 10th left paren, then it cannot "refer back" to that substring.

The Java 5 statement is informal but more defensible: "In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit." But it still has the problem our text has, that you can be in the middle of subexpression 10 even though there have been 15 completed subexpressions.

Pragmatically, with Java 5:

Pattern.matches("(X)(\11)", "XX1") 

    true - the backreference is to subexp 1

Pattern.matches("(X)(2)?(3)?(4)?(5)?(6)?(7)?(8)?(9)?(10)?(Y)(\\11)", "XYY")

    true - the backreference is to subexp 11

Pattern.matches("(X)(2)?(3)?(4)?(5)?(6)?(7)?(8)?(9)?(10)?((Y)(\\11))", "XYX1")

    false. Here the back-reference \11 appears within the 11th subexpression. I can't find any string that matches this regex. It seems to be treating it as a reference to subexpression 11, which can never be matched, rather than treating it as a reference to subexpression 1.

Pattern.matches("(X)(2)?(3)?(4)?(5)?(6)?(7)?(8)?(9)?(10)?((Y)(\\12))", "XYY")

   true. The back-reference \12 is recognized as referring to subexp 12, although it actually appears within subexp 11.

So the (provisional) conclusion for Java is that it does what Perl says: it recognizes the backreference \11 if there are 11 open parens; if there haven't been 11 close parens then the back-reference will never match anything.

I don't immediately have the ability to test what Perl does.

Comment 2 Liam R E Quin 2008-03-19 02:30:28 UTC

The pattern (a(\1)) is legal in Perl, but can't ever actually match anything in practice as far as I can tell.

I have not been able to come up with something that (a\1) matches either, nor
for that matter (\1).  And Perl developers agree that it won't match anything, for what that's worth.

In the example in comment 1, then, we have a pathological expression that won't ever match anything, but, the \11 is taken as a reference to the eleventh open parenthesis, not as a \1 followed by a digit 1, as in Java.

You can see this by running:
perl -Mre=debug -e '"1234567891011121311" =~ /(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11(12)(13)\11)/'


An error in this case might be more useful to our users, I suspect, although I wouldn't want to require it at this point, partly as it's a (useful) breaking change, and partly because it might be hard for implimentors to detect if they pass the patterns to libpcre or java.

I propose making no change to the documents.

Comment 3 Michael Kay 2008-04-22 17:13:17 UTC

On 22 April 2008 the WG decided as follows: a backreference \11 that occurs after the 11th open parenthesis, but before the closing paren that matches the 11th open paren, should be an error. The editor was asked to propose wording to achieve this. 

The proposed change affects text that currently appears in Erratum FO.E4:

http://www.w3.org/XML/2007/qt-errata/xpath-functions-errata.html#E4

The proposed revision of that paragraph, with changes highlighted using (*...*) for new text and (:...:) for deleted text, and (^...^) for moved text, is:

Back-references are allowed outside a character class expression. A back-reference is an additional kind of atom.  The construct \N where N is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the resulting number NN is such that the back-reference is preceded by (* NN or more opening parentheses *) (:sufficiently many capturing subexpressions:). (^The regular expression is invalid if (*a back-reference refers to a (:this:) subexpression (*that*) does not exist or (*whose*)(:if its:) closing right parenthesis occurs after the back-reference.^) 

Continue with unchanged text, moved into a new para:

A back-reference matches the string that was matched by the nth capturing subexpression within the regular expression, that is, the parenthesized subexpression whose opening left parenthesis is the nth unescaped left parenthesis within the regular expression.  For example, the regular expression ('|").*\1 matches a sequence of characters delimited either by an apostrophe at the start and end, or by a quotation mark at the start and end.

Comment 4 Michael Kay 2008-04-22 18:11:46 UTC

This change has been drafted as erratum E24, superseding erratum E4 which affects the same text.

Comment 5 Andy Agrawal 2009-03-21 02:15:04 UTC

The way this is worded seems to imply that at most 99 backreferences are allowed. Is this the intent?

The sentence I'm referring to is:

"The construct \N where N is a
single digit is always recognized as a back-reference; if this is followed by
further digits, these digits are taken to be part of the back-reference if and
only if the resulting number NN is such that the back-reference is preceded by
NN or more opening parentheses"

Is there any reason to impose a limit of 99? Of course, I can't imagine any real use cases that would need more than 99 backreferences, but why impose an arbitrary limit in the language itself?

Comment 6 Andy Agrawal 2009-03-21 03:19:53 UTC

The problem with making \11 illegal in the scenario mentioned is that you can't express \1 followed by the literal "1". Wouldn't it just be best to treat \11 as a backreference if there have been 11 closed subexpressions before, otherwise treat it as \1 followed by the literal "1" ?

Comment 7 Michael Kay 2009-03-23 20:32:48 UTC

>The way this is worded seems to imply that at most 99 backreferences are
allowed. Is this the intent?

No, this wasn't the intent. NN was intended as a convenient variable-name representing a number of two or more digits. I think if there were a restriction to two digits, we would not rely on this choice of variable-name to say so.

>The problem with making \11 illegal

Yes, whoever designed this syntax deserves to be shot. But we made a policy decision to be compatible with Perl, warts and all.

Michael Kay
(personal response)

P.S. This bug is marked CLOSED. If you feel there is an issue that the WG needs to address, please either re-open the bug or open a new one. CLOSED bugs don't get onto the agenda.

Comment 8 Abel Braaksma 2016-02-04 01:06:46 UTC

> The problem with making \11 illegal in the scenario mentioned is that you 
> can't express \1 followed by the literal "1".
Actually, you can: "(?:\1)1" does exactly that and is the recommended way explained in "Mastering Regular Expressions" by Friedl, to remove the ambiguity.

PS: this is no intend to reopen the bug, I just stumbled upon this while working on test case fn-matches-38, which was written in reference to this bug.