1851 – [F&O] back references to a group that was captured 0 times?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1851 - [F&O] back references to a group that was captured 0 times?

Summary: [F&O] back references to a group that was captured 0 times?

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Last Call drafts
Hardware:	PC Windows 2000

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ashok Malhotra
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-08-16 19:51 UTC by Fred Zemke
Modified:	2005-10-12 22:27 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Fred Zemke 2005-08-16 19:51:11 UTC

Section 7.6.1.1 first bullet after Rule [4] describes the semantics of
sub-expressions (frequently called capture groups in other literature on
regular expressions) in fn:replace().  The last sentence describes what 
is captured by a parenthesized sub-group with a quantifier such as "*",
namely the last substring that matched.

Presumably this sentence also applies to the next bullet, about back 
references, in which case this sentence is applicable to more than just
fn:replace.  This should be clarified.

Assuming that this sentence does apply to back references, then I salute
you, because you have specified something that Posix and Perl neglected
to specify.

But there is still something to be specified: what if the back reference
matches a repeating group that is matched 0 times?  Example:
pattern "(c)*b\1", string to be searched "abc".  The (c)* must match
the zero-length string immediately before "b" in the searched string.  
Then the "b" in the pattern matches the "b" in the searched string.
Now, what does the back reference \1 match?  The sentence that I cited 
does not answer this question, since there is no "last" match when 
there are no matches at all.

My informants tell me that the intended behavior is that \1 has no match;
therefore fn:matches would return false for this example.

On the other hand, consider the pattern "((c)*)b\1" with the same string
to be searched.  In this example, (c)* matches zero times, ((c)*) 
matches the zero-length string one time, and consequently \1 is tasked
with matching the zero-length string, which it can do immediately 
following "b" in the searched string.  Consequently fn:matches would 
return true.

Comment 1 Michael Kay 2005-08-17 13:36:50 UTC

First a meta-comment: as you observe, many specifications of regular expression
semantics are amazingly informal, and we already do very well compared with
other languages such as Perl and Java. It's good to get the specification
precise, but there may be a point at which it's better to leave things a little
fuzzy at the edges to allow implementors to do whatever the underlying library
does. If users are managing to write Perl and Java without precise guarantees of
behaviour in edge cases, perhaps this isn't a problem we need to solve. There's
also a danger that if we overspecify, some of the implementors who are less
concerned about 100% conformance may simply ignore us.

You're right that captured subgroups now apply not only to replace(), but also
to back-references (and also to xsl:analyze-string in XSLT).

My expectation is that the captured substring for a subexpression that's matched
zero times is the zero-length string. In XSLT we say this quite explicitly (see
http://www.w3.org/TR/xslt20/#regex-group). This also applies to cases such as

((a)|(b)) 

Michael Kay

Comment 2 Ashok Malhotra 2005-09-27 15:42:23 UTC

The WGs decided on 9/27 to fix this bug by adding the following sentence to the
fourth bullet in section 7.6.1:

If no strings were matched by the nth capturing subexpression the back-reference
is accepted as matching a zero-length string.

The WGs also decided that back-references should be spelt with a hyphen.

Comment 3 Weiran Zhang 2005-10-05 01:54:38 UTC

I think this may not be the best behavior. In effect, this is imposing a default
value of zero-length string on back-references, which makes it impossible to
differentiate among sub-expressions that actually matched the zero-length
string, sub-expressions that didn't participate in the match, and invalid
sub-expression references. A pattern like '(a)*\1' will match anything under
this interpretation. Conceptually it is also confusing to the user since
non-empty subexpressions should not match an empty string. None of the popular
implementations I know of (Perl, Java, etc.) behaves this way.