4106 – [FO] regex syntax: position of backreference

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 4106 - [FO] regex syntax: position of backreference

Summary: [FO] regex syntax: position of backreference

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Proposed Recommendation
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-12-21 23:35 UTC by Michael Kay
Modified:	2008-08-06 18:37 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Michael Kay 2006-12-21 23:35:45 UTC

In the syntax of regular expressions, a backreference is currently allowed as a charClassEsc. As such it can appear either outside square brackets, for example 

(abc)\1

or within square brackets

(abc)[\1]

However, it doesn't make sense to allow a backreference within square brackets, because constructs allowed within square brackets always match a single character (perhaps one of a set of possible characters, but never a sequence of more than one character), while a back-reference will in general match a sequence of characters.

I think that backreferences should appear in the syntax at the level of an atom:

[9] atom ::= Char | charClass | ( '(' regExp ')' ) | backReference

I have not been able to find a similar restriction documented for REs in Perl, Java, or .NET. However, none of these languages attempt to define the syntax of regular expressions using a BNF grammar, or to give a rigorous exposition of the semantics. Experiments with Java suggest that "(abc)[\1]" is accepted as a valid regular expression, but its semantics appear to be undefined: I am unable to identify any string that it matches.

Comment 1 Abel Braaksma 2006-12-22 10:50:11 UTC

Perhaps it's good to consider what Perl does here:

Perl handles backreferences inside a character class as octal escapes:
perl /[\7]/ matches U+07
perl /[\1]/ matches U+01
perl /[\11]/ matches the tab U+09

outside a character class, an octal escape must be of at least two digits to be treated as an octal escape. Otherwise, if it does not match a parenthesis, it will raise a runtime error. (which I still believe is the better approach and should've been adopted by XSLT, but it's too late to change the specs).

Of course, in XSLT there is no such thing as an octal escape, it uses entity references.

Comment 2 Abel Braaksma 2006-12-22 10:59:19 UTC

Considering the construct 

(low)[\1]

and taken the current specs, one might be tempted to treat it as matching the string "low" followed by any of the characters in the that string: [low] (expanded backreference). Taken the string:

helloworld

then the regular expression would:
not match: "hel"
match:     "lowo"
not match: "rld"

This could be a potentially very powerful construct, but I for one have not encountered it in any regular expression flavor (yet). For regular expression engines, it may involve a lot of re-engineering to allow for expandable character classes.

A better approach would be, I believe, to disallow the construct in its entirety.

Comment 3 Michael Kay 2007-02-13 16:42:35 UTC

The WG has approved this change in principle, subject to detailed wording.

Here is proposed text for the change to F+O.

Within section 7.6.1, the bullet point starting "Back-References are allowed" should change to read as follows:

# Back-references are allowed outside a character class expression. A back-reference is an additional kind of atom. The construct \n where n is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the back-reference is preceded by sufficiently many capturing subexpressions. A back-reference matches the string that was matched by the nth capturing subexpression within the regular expression, that is, the parenthesized subexpression whose opening left parenthesis is the nth unescaped left parenthesis within the regular expression. The closing right parenthesis of this subexpression must occur before the back-reference. For example, the regular expression ('|").*\1 matches a sequence of characters delimited either by an apostrophe at the start and end, or by a quotation mark at the start and end.

If no string is matched by the nth capturing subexpression, the back-reference is interpreted as matching a zero-length string.

Back-references change the following production:

[9] atom ::= Char | charClass | ( '(' regExp ')' )

to

[9] atom ::= Char | charClass | ( '(' regExp ')' ) | backReference

[9a] backReference ::= "\" [1-9][0-9]*

Note: within a character class expression, "\" followed by a digit is an error. Some other regular expression languages interpret this as an octal character reference.

Comment 4 Jim Melton 2007-02-26 00:35:55 UTC

The fix for this bug does not appear in the Recommendation of 23 January 2007. 
It will be considered for a future publication (either an Errata document or
some possible future version of the specification).

Comment 5 Michael Kay 2007-02-27 17:08:06 UTC

The change proposed in comment #3 was accepted, with one minor change, changing the Note at the end to:

Note: within a character class expression, "\" followed by a digit is invalid.
Some other regular expression languages interpret this as an octal character
reference.

[changes "an error" to "invalid" because that's the terminology used in the description of errors such as FORX0002]

This will be published in due course as an Erratum.

Comment 6 Jim Melton 2007-04-29 22:53:18 UTC

http://www.w3.org/Bugs/Public/show_bug.cgi?id=4106#c5 indicates the final resolution of this bug, so I am marking it FIXED.  Because the author of the bug and the author of the solution are the same person, I am also marking the bug CLOSED.