This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1889 - Regex [+-] syntax
Summary: Regex [+-] syntax
Status: CLOSED FIXED
Alias: None
Product: XML Schema
Classification: Unclassified
Component: Datatypes: XSD Part 2 (show other bugs)
Version: unspecified
Hardware: All All
: P1 normal
Target Milestone: ---
Assignee: C. M. Sperberg-McQueen
QA Contact: XML Schema comments list
URL:
Whiteboard: cluster: regex
Keywords: resolved
Depends on:
Blocks:
 
Reported: 2005-08-26 13:31 UTC by Sandy Gao
Modified: 2008-05-31 07:44 UTC (History)
1 user (show)

See Also:


Attachments

Description Sandy Gao 2005-08-26 13:31:45 UTC
The text defining regular expressions in Appendix F Schema Part 2 Second
Edition (28 Oct 2004) seems to be inconsistent between the BNF and the
accompanying prose. See
http://lists.w3.org/Archives/Public/www-xml-schema-comments/2005JulSep/0030.html
Comment 1 C. M. Sperberg-McQueen 2005-09-07 21:59:40 UTC
It's embarrassing that so many inconsistencies should be present,
in such close proximity to material we worked over several times.
But Mike Kay's analysis appears to be correct, and I believe the
WG should classify this as an error and arrange for a 
corrigendum.
Comment 2 C. M. Sperberg-McQueen 2006-01-15 00:15:48 UTC
The WG classified this issue as a requirement at its telcon of 13 January 2006
and instructed the editors to prepare a proposal with the obvious fix.
Comment 3 C. M. Sperberg-McQueen 2006-09-21 00:00:28 UTC
At the face to face meeting of January 2006 in St. Petersburg,
the Working Group decided not to take further action on this
issue in XML Schema 1.1.  (This issue was not discussed
separately; it was one of those which were dispatched by a
blanket decision that all other open issues would be closed
without action, unless raised again in last-call comments.)  Some
members of the Working Group expressed regret over not being able
to resolve all the issues dealt with in this way, but on the
whole the Working Group felt it better not to delay Datatypes 1.1
in order to resolve all of them.

This issue should have been marked as RESOLVED /WONTFIX at that
time, but apparently was not.  I am marking it that way now, to
reduce confusion.
Comment 4 Michael Kay 2006-09-21 07:32:44 UTC
I fail to understand how five-and-a-half years after the XML Schema specification came out, the WG has failed to resolve a simple technical problem that has been known for nearly all that time, and can now deem that the problem will be allowed to remain in the next release of the specification. This isn't something that's difficult to resolve because of environment dependencies or implementation difficulties or political hassles or because it's at the boundaries of computer science. It's a simple straightforward bug. Schema implementors and schema authors have been tripping over this issue, even W3C working groups have been publishing schemas that work with some processors and not others. Moreover, the QT specifications are impacted because they refer normatively to the regex definitions in Schema Part 2. Closing this as WONTFIX seems to show a wanton disregard for quality. If it's not the purpose of a 1.1 release to fix such problems, what is the purpose?
Comment 5 Michael Kay 2006-11-17 00:49:17 UTC
Since there doesn't seem to be much effort going into resolving this, and since it accounts for a significant proportion of the problems I am having in matching the published test suite results, let me propose a solution.

PROPOSAL

(a) leave the grammar unchanged

(b) in each of the definitions in App. F, where the term being defined is spelt differently from the corresponding metasymbol, add a cross-reference. For example: "Definition: A regular expression (regExp) is composed from zero or more ·branch·es, separated by | characters." This is to remove any ambiguity about whether the term "XML Character" is a reference to the metasymbol XMLChar or to some other concept with a similar name...

(c) expand the definition of Character Range:

[Definition:] A character range (charRange) R identifies a set of characters C(R) containing all XML characters with UCS code points in a specified range. 

(d) replace the text below rule 22 as follows:

There are two forms of character range: a ·start-end range·, and a ·single-character range·. A character or ·single character escape· is taken as the start of a ·start-end range· if (a) it is valid as such, and (b) it is immediately followed by a hyphen. Otherwise (if it is valid as such) it is taken as a ·single-character range·.

[Definition:] A ·start-end range· (seRange) s-e identifies the set that contains all XML characters with UCS code points greater than or equal to the code point of s, but not greater than the code point of e.

For s-e to be a valid character range, it must satisfy the following rules in addition to those implied by the grammar:

    * If s is the first character in a ·character class expression·, then s is not ^
    * The code point of e is greater than or equal to the code point of s; 

Note:  The code point of a ·single character escape· is the code point of the single character in the set of characters that it identifies. 

[Definition:] A ·single XML character· (XMLChar) is a ·character range· that identifies the set of characters containing only itself. For a character to be a valid ·character range·, it must satisfy the following rules in addition to those implied by the grammar:

    * The ^ character is only valid at the beginning of a ·positive character group· if it is part of a ·negative character group·
    * The - character is a valid ·character range· only 
      (a) at the beginning of a ·positive character group·, or
      (b) if immediately followed by a ']' character 

Note: An unescaped - character is handled as follows. If it appears at the start of a ·positive character group· or immediately before a ']' character then it is taken as representing a literal hyphen. If it appears immediately before a '[' character it is taken as representing a subtraction operator (regardless whether what follows is a valid ·character class expression·). If it appears immediately after a character or character escape that is valid as the start of a ·start-end range·, then it causes that character or character escape to be treated as the start of a ·start-end range·. If it appears anywhere else (for example, after another hyphen, or after the end of a ·start-end range· but not followed by '['), then it is an error.

NOTE ON PROPOSAL

Some regex implementations are more permissive than this. For example, they allow - as the start or end of a start-end range, and they allow constructs such as [0-9-A-Z] meaning zero-to-nine, hyphen, or A-Z. 
Comment 6 Frans Englich 2007-05-23 09:10:47 UTC
As implementor and user of Schema, I strongly prefer to see problems like this fixed in front of having an earlier release.

Schema is a central piece of technology and its quirks and ambiguities creates much grief. What makes me happy about the efforts going into 1.1, is that they now can be fixed.
Comment 7 Dave Peterson 2007-07-27 15:58:55 UTC
(In reply to comment #0)
> The text defining regular expressions in Appendix F Schema Part 2 Second
> Edition (28 Oct 2004) seems to be inconsistent between the BNF and the
> accompanying prose. See
> http://lists.w3.org/Archives/Public/www-xml-schema-comments/2005JulSep/0030.html

Inasmuch as the comment-list discussion of the referenced comment includes the point that '+' is a metacharacter and hence must always be escaped when a real point is intended, I intend to include that point as part of this bug and fix it as part of this bug's fix.
Comment 8 Michael Kay 2007-07-27 21:40:26 UTC
In reply to comment #7

(a) I haven't been able to locate the comment-list discussion that you refer to

(b) I can't see anywhere in the spec that suggests that because a character is a metacharacter, it needs to be escaped when used in a charGroup; so I don't see where the problem with "+" arises.
Comment 9 Dave Peterson 2007-07-28 03:41:13 UTC
(In reply to comment #8)
> In reply to comment #7

> (b) I can't see anywhere in the spec that suggests that because a character is
> a metacharacter, it needs to be escaped when used in a charGroup; so I don't
> see where the problem with "+" arises.

@*^$&# convoluted descriptions.  I believe you are right.  The definition of metacharacter says metacharacters have special meanings in REs, but that is not true when most special characters are used within character groups.  Fortunately we haven't acted on this believe-to-be-an-error, so we will ignore it unless someone else convinces the editors that this new (to me, at least) interpretation is wrong.
Comment 10 C. M. Sperberg-McQueen 2008-05-30 04:01:21 UTC
A wording proposal intended to resolve this issue (and some other 
regex-related issues) is at 
http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.b1889.html
(member-only link).
Comment 11 Dave Peterson 2008-05-30 18:05:06 UTC
(In reply to comment #7)

> Inasmuch as the comment-list discussion of the referenced comment includes the
> point that '+' is a metacharacter and hence must always be escaped when a real
> point is intended, I intend to include that point as part of this bug and fix
> it as part of this bug's fix.

It is a fact that *outside a character range expression* an autonymous ("self-naming", not "autonomous") use of a metacharacter character must be escaped.  But escaping '+' is also covered in 3659, at least for date/time datatypes, so I propose that be used to track plus signs needing escaping.  The rest of this bug is covered by the proposed fix, which was being condidered today.
Comment 12 C. M. Sperberg-McQueen 2008-05-31 01:53:50 UTC
The wording proposal mentioned in comment #10 was adopted by the WG
at its call today.  We believe it resolves the issue in full, and I am
accordingly marking the issue as resolved.

Michael, as the originator of the issue, please indicate your acquiescence
in the resolution of the issue by changing the issue status to CLOSED,
or indicate dissent by reopening it, in the usual way.   Since you were
on the call and didn't object, I assume you assent, but for form's sake I'll
ask for this additional sign.  If you don't respond within the next
two weeks, we'll assume that silence implies consent.