This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29416 - [QT3] re00054, a test with character class expression [^-z], should throw FORX0002
Summary: [QT3] re00054, a test with character class expression [^-z], should throw FOR...
Status: CLOSED INVALID
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: XQuery 3 & XPath 3 Test Suite (show other bugs)
Version: Candidate Recommendation
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: O'Neil Delpratt
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-02 03:09 UTC by Abel Braaksma
Modified: 2016-03-09 11:58 UTC (History)
1 user (show)

See Also:


Attachments

Description Abel Braaksma 2016-02-02 03:09:46 UTC
The whole expression in this test is "^(?:[^-z]+)$" (without quotes).

I am reporting this because 
a) either the rules are not clear or ambiguous and the test is correct in one reading of the spec
b) the test is not correct

In XSD 1.0, the production rules of [17] charRange apply. In the accompanying text, the author states that the rules are ambiguous and then goes on that they are not:

1) The [, ], - and \ characters are not valid character ranges; 
A: this does not apply

2) The ^ character is only valid at the beginning of a ·positive character group· if it is part of a ·negative character group·
A: this applies, and gets the meaning of negating the character group

3) The - character is a valid character range only at the beginning or end of a ·positive character group·. 
A: ambiguous in this case, as the production rules do not allow this here.

[14]: posCharGroup ::= ( charRange | charClassEsc )+ 
[17]: charRange	   ::= seRange | XmlCharIncDash 
[18]: seRange	   ::= charOrEsc '-' charOrEsc
[20]: charOrEsc	   ::= XmlChar | SingleCharEsc
[21]: XmlChar	   ::= [^\#x2D#x5B#x5D]
[22]: XmlCharIncDash ::= [^\#x5B#x5D]

Following this production rules, in part, we get:
4) it is a posCharGroup
5) it is a charRange
6) that range is "^" to "z"

Now back at rule (2) above. The "^" is only valid in this position if it is also part of a negative character group. 

All in all, I think if the intended meaning was "from '^' to 'z'" then it should have been written as [\^-z], if it was "not from  '^' to 'z'" then it should have been written as [^^-z]. 

If the intention was "from ^ to z" then it should have been written as [\^-z]
If the intention was "not from ^ to z" then [^^-z] appears to be allowed (though [^\^-z] makes more sense to me)
If the intention was "either ^, - or z", then [\^\-z]
If the intention was "not - or z", then [^\-z]

I think that the expression as written does not fit the production rules or description and should raise FORX0002.
Comment 1 Michael Kay 2016-02-02 13:04:41 UTC
The rules for character ranges in XSD 1.0 are known to be a complete mess. XSD 1.1 indicates what the WG intended. Although we don't require support for XSD 1.1, in cases like this referring to the XSD 1.1 spec is the best way of sorting out the ambiguity.

The fact that the Schema WG chose to fix this bug in XSD 1.1 but not to issue a correction for 1.0 shouldn't inhibit us, I think, from having tests that assume the corrected interpretation.

The XSD 1.1 rules make it clear that [^-z] means "any character other than hyphen or "z"".
Comment 2 Abel Braaksma 2016-02-02 15:15:11 UTC
> The XSD 1.1 rules make it clear that [^-z] means "any character other than 
> hyphen or "z"".

Ok, thanks.

Is there room for a Note in the XP31 spec, along the lines of "Where XSD 1.0 show ambiguity for character classes and ranges, refer to XSD 1.1. It is recommended that implementers take the production rule changes of XSD 1.1 in favor of XSD 1.0 where such ambiguities arise."?
Comment 3 Michael Kay 2016-02-02 15:31:54 UTC
F+O 5.6.1 does say:

Note:

In [Schema 1.1 Part 2] the rules for the interpretation of hyphens within square brackets in a regular expression have been clarified; and the semantics of regular expressions are no longer tied to a specific version of Unicode.
Comment 4 O'Neil Delpratt 2016-03-04 10:56:44 UTC
I am marking this one as resolved as invalid. I could not find a better option in the list.
Comment 5 Abel Braaksma 2016-03-09 11:58:15 UTC
(In reply to O'Neil Delpratt from comment #4)
> I am marking this one as resolved as invalid. I could not find a better
> option in the list.
I agree. Let's leave it at that. I've closed the bug.