11125 – Regex grammar for 1.1 renders some 1.0 regexes invalid

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11125 - Regex grammar for 1.1 renders some 1.0 regexes invalid

Summary: Regex grammar for 1.1 renders some 1.0 regexes invalid

Status:	RESOLVED FIXED

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Datatypes: XSD Part 2 (show other bugs)
Version:	1.1 only
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	David Ezell
QA Contact:	XML Schema comments list

URL:
Whiteboard:
Keywords:	resolved

Depends on:
Blocks:

Reported:	2010-10-22 16:34 UTC by David Ezell
Modified:	2011-03-19 21:44 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description David Ezell 2010-10-22 16:34:02 UTC

From the telcon 2010-10-22:

Originating with Michael Kay's email at: http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2010Oct/0006.html

<dezell> MK: the rules weren't clear before (whether or not +- is allowed) or backtracking, etc.
<dezell> MK: now they are very clear, and many existing schemas may break.
<dezell> DE: will people use a "compatibility" mode.
<dezell> MSM: not great.
<dezell> DE: have we made a mistake by making the rules too explicit?
<dezell> MSM: no no no.  We need to figure out ways to get clarity without breaking things.
<dezell> MK: won't change the grammar, only the prose following rule 81 in the grammar.
<dezell> MSM: I think we need a story about the topic in the Note, about what happens with more than one hyphen.
<dezell> MK: we could change the rule to say if you can't parse as part of a character range then backtrack.
<dezell> ...: I think that's consistent with most regex systems.
<dezell> SG: so I'm not in favor of forcing backtracking if we can achieve the goal without it.
<dezell> MSM: 1) we need tests that illuminate this segment of the grammar
<dezell> ...: should show how existing regex libraries and tools work.
<dezell> ...: 2) then we should study current behavior.
<dezell> ...: 3) then we can decide what we need to do.

Comment 1 Michael Kay 2010-10-22 17:19:57 UTC

I asserted during the telcon that the rules for 1.0 second edition were unclear, and Dave Petersen disputed this.

I thought there was an open bug on this, but I can't find it.

1.02e tries to solve the problem with the rules:

(A) The [, ], - and \ characters are not valid character ranges; 
(B) The ^ character is only valid at the beginning of a ·positive character group· if it is part of a ·negative character group· 
(C) The - character is a valid character range only at the beginning or end of a ·positive character group·.

The problem is that rules (A) and (C) flatly contradict each other. If we assume that (C) is meant to take priority, then [+-] and [-+] are both allowed, but [a-z-+] is not. This is probably a reasonable way to define the rule. If this is the rule that we want, then in 1.0 2e we should delete "-" from the list of characters in rule (A), and in 1.1 we should change the paragraph that follows production 81 from

<old>
If a charGroupPart starts with a singleChar and this is immediately followed by a hyphen, and if the hyphen is part of the character group (that is, it is not being treated as a subtraction operator because it is followed by '['), then the hyphen must be followed by another singleChar, and the sequence (singleChar, hyphen, singleChar) is treated as a charRange. It is an error if either of the two singleChars in a charRange is a SingleCharNoEsc comprising an unescaped hyphen.</old>

to

<new>
If a charGroupPart starts with a singleChar and this is immediately followed by a hyphen, and if the hyphen is part of the character group (that is, it is not being treated as a subtraction operator because it is followed by '['), then: (a) the hyphen is treated as a singleChar if it is immediately followed by ']'; (b) in all other cases the hyphen must be followed by another singleChar, and the sequence (singleChar, hyphen, singleChar) is treated as a charRange. It is an error if either of the two singleChars in a charRange is a SingleCharNoEsc comprising an unescaped hyphen.</new>

Comment 2 Dave Peterson 2010-10-23 01:45:19 UTC

(In reply to comment #0)
> From the telcon 2010-10-22:

> <dezell> MK: won't change the grammar, only the prose following rule 81 in the
> grammar.

In some sense that prose is "part of the grammar"; had the productions and related material been updated to the format used in the other parts of 1.1, prose like this would be in a formal Constraint, referenced from the production display.  I like that format; it makes the intent clearer.  

> <dezell> MSM: I think we need a story about the topic in the Note, about what
> happens with more than one hyphen.
> <dezell> MK: we could change the rule to say if you can't parse as part of a
> character range then backtrack.

We already require lookahead (which is after all a form of backtrack) to see if the next character following a '-' is '['; it shouldn't be any worse to also lookahead to check if the next character is ']'.


(In reply to comment #1)
> I asserted during the telcon that the rules for 1.0 second edition were
> unclear, and Dave Petersen disputed this.

Let me be clear:  I only felt that the 2E rules were clear about the case we are concerned with here, namely "Is '[+-]' legal?"  I have no certainty that the 2E rules prevented ambiguities.

Beyond that, I concur with the changes Mike Kay has suggested above.  But I hope someone succeeds in double-checking that they don't introduce problems with other areas of the RE chapter.  (How do you prove a negative?)

Comment 3 Michael Kay 2010-10-24 23:30:03 UTC

One minor correction to comment #1: I believe that [a-z-+] is legal (allowing any lower-case letter or '-' or '+') both with XSD 1.1 as currently written and with my proposed revision. 

It is not legal in XSD 1.0 2e, with or without the proposed change. 

(But in fact the proposed change to XSD 1.0 is academic, since we have already made the decision in any future revision of XSD 1.0 to replace the regex appendix with that from XSD 1.1.)

Comment 4 David Ezell 2010-11-04 16:46:26 UTC

in Lyon, we examined the 1.0 and 1.1 specs, we mulled over the behavior of four specific examples.  Things ok in 1.0 but x in 1.1 are backward incompatible:
       behavior in 1.0, 1.1
[-+]               ok   ok
[+-]               ok   x
[a-z+-]            ok   x
[a-z-+]            x    ok
[--z]              ok   x
[a--k--z]          ok   x        

Problem is with paragraph following productin 81:

<quote>
If a charGroupPart starts with a singleChar and this is immediately followed by a hyphen, and if the hyphen is part of the character group 

<problem>(that is, it is not being treated as a subtraction operator because it is followed by '['),
</problem>

then the hyphen must be followed by another singleChar, and the sequence (singleChar, hyphen, singleChar) is treated as a charRange. It is an error if either of the two singleChars in a charRange is a SingleCharNoEsc comprising an unescaped hyphen.

</quote>

MK has suggested we need to accomodate the case where the '-' is the final character of a charGroup.

MSM observes that we might implement that by saying "the final character of a charGroup is followed either by ']' or by '-[', but that more work is needed.

Comment 5 C. M. Sperberg-McQueen 2011-01-18 01:38:35 UTC

Some additional data comes from looking carefully at things (again) and checking (again) some regexes using Xerophily (aka MSM's regex parser):

(1) In comment 4, DE summarizes some results for 1.0 and 1.1 rules for regexes, but in a couple of cases the results given don't agree with what Xerophily says.  These are correct:

[-+]               ok   ok
[+-]               ok   x
[a-z+-]            ok   x
[a-z-+]            x    ok

But these two are not correct:

[--z]              ok   x
[a--k--z]          ok   x        

Neither of these is accepted by 1.0, because in 1.0 an unescaped hyphen is not allowed as the end-point of a range, and may itself be a (single-character) range only at the beginning of end of a positive character group. 

(2) The grammar in 1.1 has an ambiguity we had not detected before, which may affect the rule after production 81.  A single-character escape (e.g. \n) satisfies both the non-terminal singleChar and the non-terminal charClassEsc, each of which appear on the right-hand side of the rule for charGroupPart, so there are two different ways in which a single-character escape can be a charGroupPart.  In the case of \n and others of the class, the difference is semantically unimportant: in both cases, the enclosing character group includes the character indicated.  (As a result, Xerophily does not register this ambiguity: both parses produce the same abstract syntax tree.)  

But in the case of \- the ambiguity may have consequences.  The prose following production 81 imposes certain constraints on charGroupPart strings that begin with a singleChar followed by a hyphen.  But \- can be either a singleChar or not a singleChar; the rule says nothing about a charGroupPart which begins with a charClassEsc which happens to be a singleCharEsc, and the rule may be thought not to apply to that parse.  For this reason, Xerophily currently produces two parses for [\--z]:  one for the range from hyphen to z, and one for the character class containing hyphen (escaped), hyphen (unescaped), and z.

We either need to remove the ambiguity, or we need to recast the wording of the prose rule to make it cover the case.

Having thought about this a bit, I think I favor changing the prose after production 81 along the lines suggested, to specify that if a charGroup part begins with a singleChar (or a charClassEsc which is a singleCharEsc) followed by a hyphen, then one of the following must be true:

  (1) The hyphen is followed by [ and the hyphen indicates character-class subtraction.

  (2) The hyphen is followed by ] and it is treated as a singleChar, the last charGroupPart of the character group.

  (3) The hyphen is followed by -[ and it is treated as a singleChar, the last charGroupPart of the character group.

  (4) The hyphen is followed by a singleChar and indicates a range.

Personally, I'd like to get rid of the constraint forbidding unescaped hyphens as character-range endpoints, but I'm not sure we can do so without adding ambiguity.

So in addition to the change just outlined, I think I favor asking each member of the WG to contribute two or more regular expressions involving character classes, with a strong preference to the twisted, the devious, and the deceptively simple-looking, and that we test the 1.0 and current 1.1 grammars and the proposed change(s), on all the samples provided as well as on a few thousand randomly generated test strings.

We should also decide whether we want to eliminate the ambiguity identified above or not.

Comment 6 C. M. Sperberg-McQueen 2011-01-18 15:37:10 UTC

In its discussion of the ambiguity in the grammar for charGroupPart, comment 5 appears to be wrong. (I'd say "it *is* wrong", but I've made so many howlers in working over this material that I'm trying to teach myself more caution in my conclusions.)

There is an ambiguity.  In some cases, the ambiguity makes a difference to the semantics, and in some cases it does not.  The claim of semantic ambiguity relies on reading the prose following production 81 as applying only to singleCharEsc when parsed as a singleChar and not when parsed as a charClassEsc.  This much seems correct in the analysis of comment 5.

But the difference is not \- vs \n and the other singleCharEsc strings; it's between situations in which a charGroupPart begins with a singleCharEsc followed by a hyphen.  So:

  [\-\n] is ambiguous in syntax but unambiguous in semantics
  [\--z] is ambiguous both syntactically and semantically
  [\n-z] is ambiguous both syntactically and semantically

At the moment, I tentatively lean toward moving the class of singleCharEsc out of the class of charClassEsc; this would mean 

- change production 89 from 

  [89] charClassEsc ::= ( SingleCharEsc | MultiCharEsc | catEsc | complEsc )

to

  [89] charClassEsc ::= ( MultiCharEsc | catEsc | complEsc )

- change production 75 from 

  [75] charClass ::= charClassEsc | charClassExpr | WildcardEsc

to 

  [75] charClass ::= singleCharEsc | charClassEsc | charClassExpr | WildcardEsc

- leave productions 81 and 82 alone.

The changes to 89 and 75 will entail accompanying changes to the the neighboring prose.

An alternative would be to rework 81 and 82 instead.

Comment 7 Dave Peterson 2011-01-19 05:10:34 UTC

(In reply to comment #5)

> (2) The grammar in 1.1 has an ambiguity we had not detected before, which may
> affect the rule after production 81.  A single-character escape (e.g. \n)
> satisfies both the non-terminal singleChar and the non-terminal charClassEsc,
> each of which appear on the right-hand side of the rule for charGroupPart, so
> there are two different ways in which a single-character escape can be a
> charGroupPart.  In the case of \n and others of the class, the difference is
> semantically unimportant: in both cases, the enclosing character group includes
> the character indicated.  (As a result, Xerophily does not register this
> ambiguity: both parses produce the same abstract syntax tree.)  
> 
> But in the case of \- the ambiguity may have consequences.  The prose following
> production 81 imposes certain constraints on charGroupPart strings that begin
> with a singleChar followed by a hyphen.  But \- can be either a singleChar or
> not a singleChar; the rule says nothing about a charGroupPart which begins with
> a charClassEsc which happens to be a singleCharEsc, and the rule may be thought
> not to apply to that parse.

I'm inclined to say that   \-   is a singleChar as well as charClassExcape; being an existentialist rather than an intentionalist, I'd say that the prose applies, period, regardless of which way the   \-   was legitimized.  But lots of people think intentionally, so we need to clear it up.

This can be cleared up by mentioning both alternative ways of arriving at   \-   in the charGroupPart as though they were different (as suggested by MSM above) or by asserting that the manner of legitimizing does not change the fact that they are nonetheless a singleChar and hence the rule applies.

Comment 8 C. M. Sperberg-McQueen 2011-01-28 17:34:36 UTC

On the WG call of 28 January 2011 the WG agreed to change the prose constraint after production [81] as outlined in comment 5, and also to change the grammar as outlined in comment 6 in order to eliminate the ambiguity in the definition of charGroupPart.  We will NOT eliminate the rule forbidding unescaped hyphens as range end-points, as tentatively suggested in comment 5, on the grounds that it's not necessary to resolve this issue and it would be risky to make that change without a more careful analysis and more extensive testing than we are likely to manage with the time at our disposal.

Comment 9 David Ezell 2011-03-11 17:31:51 UTC

On the telcon:
RESOLVED: adopt the proposal as ammended.
AMMENDMENT:  in the paragraph starting "then the following rules apply"
1) make the items a list
2) make the last sentence item 'e' (or equivalent).

Editors to determine the final numbering scheme.

http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.b11125.html

Comment 10 C. M. Sperberg-McQueen 2011-03-19 21:44:41 UTC

The proposal adopted 11 March 2011 has now been integrated into the status quo document.