29253 – [XQ3TS] fn-matches-50

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29253 - [XQ3TS] fn-matches-50

Summary: [XQ3TS] fn-matches-50

Status:	RESOLVED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XQuery 3 & XPath 3 Test Suite (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	O'Neil Delpratt
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-10-30 17:18 UTC by Tim Mills
Modified:	2015-12-02 14:24 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Tim Mills 2015-10-30 17:18:46 UTC

I'd like to query the tests

   <test id="p303" regex="^((.)?a\2)+$" input="babadad" result="n"/>

and

   <test id="p908" regex="(\w)?(abc)\1b" input="abcab" result="y"/><!-- Changed MHK. Was "n" -->


In the former, 

^((.)?a\2)+$" 

1st iteration: \2 = 'b', \1 = 'bab'
2nd iteration: \2 = '',  \1 = 'a'
3rd iteration: \2 = 'd', \1 = 'dad'

thus the whole string is matched.

In the latter

If (\w)? matches 'a', then we're trying to match 'a(abc)ab' against 'abcab'
If (\w)? doesn't match 'a', then we're trying to match '(abc)b' against 'abcab'.

Comment 1 Michael Kay 2015-11-04 10:25:30 UTC

I have yet to find a specification of how captured groups work that isn't completely fuzzy around the edges, and our spec is no exception. It's useful to have evidence of what other respected regex engines do (e.g. Perl, PCRE) as prima facie evidence that there is a problem. 

I tried the online tester at https://regex101.com/ on p303 and this gives false (our expected result) for Python and PCRE, but true for Javascript. It also gives false in the Java JDK engine.

I can certainly see the logic for why it should be considered to match. The counter-argument is probably that each captured group only has one value, it does not have different values for different iterations of some outer expression. But in the absence of a more formal specification, it's difficult to give an authoritative answer.

For p908, the online tester gives false for all three engines, and also for Java. The only engine I can find that returns true is the Saxon one (a derivative of Jakarta). So I'll concede that one! The Jakarta engine has a rather moronic interpretation of the rule that a captured group returns what was matched by the last matching attempt, sometimes returning this even when that matching attempt was not part of a complete match of the entire regex.

Comment 2 Tim Mills 2015-11-04 10:39:20 UTC

I think the main difference between implementations is whether in a regex such as:

(.)?a\1

when the "zero case" of the ? quantifier is matched, whether \1 is unbound (and therefore can never match anything) or \1 is bound to the empty string.  I think in F&O, it's explicitly the case that \1 is bound to the empty string.

e.g. in the online tool, try

a(.)?

matching

a

It reports "No match groups were extracted.".

Comment 3 Michael Kay 2015-11-04 10:56:33 UTC

Surely (.?) can be bound to an empty string, but (.)? can't? I would have thought a group should never be bound to a string that doesn't match the regex fragment inside the parentheses.

Comment 4 Tim Mills 2015-11-04 11:14:03 UTC

As a bit of background, we were formerly passing these tests due to a bug in our regex implementation. 

e.g.  (\w)?(abc)\1b  matching against "abcab"

It first attempted to match (\w)? to 'a', setting \1 = 'a', but then failed to match 'abc' to 'bca'.  Then it backtracked, but forgot to clear the capture group.

So it matched

(\w)? -> matches empty string, but reported \1 as 'a'
'abc' -> matches 'abc'
\1 -> matches the incorrect capture of 'a'.
'b' -> matches 'b'

Comment 5 Tim Mills 2015-11-04 11:25:40 UTC

(In reply to Michael Kay from comment #3)
> Surely (.?) can be bound to an empty string, but (.)? can't? I would have
> thought a group should never be bound to a string that doesn't match the
> regex fragment inside the parentheses.

The specification says

"If no string is matched by the Nth capturing subexpression, the back-reference is interpreted as matching a zero-length string."

So (.?) would match an empty string, and bind the \1 to the empty string.

But (.)? can match the empty string, but no string is matched by thecapturing subexpression, yet the back-reference \1 is still bound to the empty string.

Comment 6 Michael Kay 2015-12-02 14:24:27 UTC

The WG agreed to change the expected results of these two tests, and this has been done.