This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Section 7.6.1.1 first bullet after Rule [4] describes the semantics of sub-expressions (frequently called capture groups in other literature on regular expressions) in fn:replace(). The last sentence describes what is captured by a parenthesized sub-group with a quantifier such as "*", namely the last substring that matched. Presumably this sentence also applies to the next bullet, about back references, in which case this sentence is applicable to more than just fn:replace. This should be clarified. Assuming that this sentence does apply to back references, then I salute you, because you have specified something that Posix and Perl neglected to specify. But there is still something to be specified: what if the back reference matches a repeating group that is matched 0 times? Example: pattern "(c)*b\1", string to be searched "abc". The (c)* must match the zero-length string immediately before "b" in the searched string. Then the "b" in the pattern matches the "b" in the searched string. Now, what does the back reference \1 match? The sentence that I cited does not answer this question, since there is no "last" match when there are no matches at all. My informants tell me that the intended behavior is that \1 has no match; therefore fn:matches would return false for this example. On the other hand, consider the pattern "((c)*)b\1" with the same string to be searched. In this example, (c)* matches zero times, ((c)*) matches the zero-length string one time, and consequently \1 is tasked with matching the zero-length string, which it can do immediately following "b" in the searched string. Consequently fn:matches would return true.
First a meta-comment: as you observe, many specifications of regular expression semantics are amazingly informal, and we already do very well compared with other languages such as Perl and Java. It's good to get the specification precise, but there may be a point at which it's better to leave things a little fuzzy at the edges to allow implementors to do whatever the underlying library does. If users are managing to write Perl and Java without precise guarantees of behaviour in edge cases, perhaps this isn't a problem we need to solve. There's also a danger that if we overspecify, some of the implementors who are less concerned about 100% conformance may simply ignore us. You're right that captured subgroups now apply not only to replace(), but also to back-references (and also to xsl:analyze-string in XSLT). My expectation is that the captured substring for a subexpression that's matched zero times is the zero-length string. In XSLT we say this quite explicitly (see http://www.w3.org/TR/xslt20/#regex-group). This also applies to cases such as ((a)|(b)) Michael Kay
The WGs decided on 9/27 to fix this bug by adding the following sentence to the fourth bullet in section 7.6.1: If no strings were matched by the nth capturing subexpression the back-reference is accepted as matching a zero-length string. The WGs also decided that back-references should be spelt with a hyphen.
I think this may not be the best behavior. In effect, this is imposing a default value of zero-length string on back-references, which makes it impossible to differentiate among sub-expressions that actually matched the zero-length string, sub-expressions that didn't participate in the match, and invalid sub-expression references. A pattern like '(a)*\1' will match anything under this interpretation. Conceptually it is also confusing to the user since non-empty subexpressions should not match an empty string. None of the popular implementations I know of (Perl, Java, etc.) behaves this way.