15545 – [QT3TS] Possible error in re00972?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 15545 - [QT3TS] Possible error in re00972?

Summary: [QT3TS] Possible error in re00972?

Status:	RESOLVED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XQuery 3 & XPath 3 Test Suite (show other bugs)
Version:	Member-only Editors Drafts
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jim Melton
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-01-13 16:09 UTC by Tim Mills
Modified:	2012-05-23 13:55 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Tim Mills 2012-01-13 16:09:11 UTC

This test is as follows:

(every $s in tokenize('33a33', ',') satisfies matches($s, '^(?:(\d*){0,2}a\1)$')
) and (every $s in tokenize('33a34', ',') satisfies not(matches($s, '^(?:(\d*){0
,2}a\1)$')))

Part of this involves checking that

matches('33a33', '^(?:(\d*){0,2}a\1)$') 

is true.

If I understand the spec correctly, (\d*) can be matched 0 to 2 times.
\d* matches '33' once, then matches '' (the empty string) on a second pass.  The spec states that:

"If a sub-expression matches more than one substring (because it is within a construct that allows repetition), then only the last substring that it matched will be captured."

thus, the back reference \1 has the value '', not the (presumably expected) value '33'.

If my understanding is correct, there are related problems in the following tests.

re00973	  
re00974	  
re00975	  
re00976

Comment 1 Michael Kay 2012-01-13 17:12:00 UTC

I think I've just checked in some changes to these tests. This was a bit naughty, I was under the impression that they were still under development and had not yet been committed. I came to the conclusion that the spec here is underdefined: with a construct such as

(a*)*

that can match the input "aaaa" in various ways, we aren't prescriptive about what the contents of \1 should be. One can argue that the inner loop should be executed four times and the outer loop once, but the spec makes no attempt to mandate that. Your theory that the outer loop is executed twice, matching "aaaa" the first time and "" the second time, is equally defensible. In fact there's nothing in the spec to say that the implementation must terminate...

Comment 2 Tim Mills 2012-01-17 13:21:06 UTC

Test re00976 still remains a problem.

(every $s in tokenize('22a22z', ',') satisfies matches($s, '^(?:(\d*){2,}?a\1z)$
')) and (every $s in tokenize('22a22', ',') satisfies not(matches($s, '^(?:(\d*)
{2,}?a\1z)$')))

Here, (\d*){2,}? causes two passes of matching \d*.  The first matches '22', the second matches '', hence the matching fails.

For what it's worth, my interpretation of the specification is that \d* must match the longest possible substring, inferred from the text:

''Without the " ? ", the regular expression matches the longest possible substring.''

Comment 3 O'Neil Delpratt 2012-05-18 09:54:37 UTC

Needs WG discussion, whether there is a spec issue here.

Comment 4 Michael Kay 2012-05-23 13:55:04 UTC

Spec bug 17160 has been raised to implement the WG decision that we should be a little liberal in this area.

I have modified the test case to allow either a match or non-match.