Protest the \- change (E2-18) from Bob Foster on 2004-04-07 (www-xml-schema-comments@w3.org from April to June 2004)

From: Bob Foster <bob@objfac.com>
Date: Wed, 07 Apr 2004 17:17:42 -0500
To: www-xml-schema-comments@w3.org
Message-ID: <40747E06.5060805@objfac.com>

I previously copied this address on the subject but on 4/3/2004 Henry 
Thompson suggested I write a protest, even though the Errata seem to 
have been closed as of 3/16/2004. I take the latter as an indication my 
previous mail didn't do the job.

The proposed change E2-18 unnecessarily introduces an incompatible 
change to the regular expression language accepted by patterns. This 
breaks a number of existing published schemas, including 
http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd and 
http://java.sun.com/dtd/jspxml.xsd.

The original problem reported is that the language in F.1 "The - 
character is a valid character range only at the beginning or end of a 
·positive character group" contradicted the published grammar. The 
public record doesn't say so, but a further problem was that the 
published grammar was ambiguous in its treatment of patterns like "a-z", 
which could be interpreted as either one seRange or three 
XMLCharIncDash, and in fact, the pattern "---" was allowed by the 
grammar (- could appear anywhere).

There is an issue, but it should not be resolved by an incompatible 
change. Instead, the issue could be resolved by an Error that simply 
struck out the offending sentence quoted above, amended the grammar as 
shown below (to remove the character references already handled by the 
parser) and added a Clarification along the following lines:

[17]   	charRange	   ::=   	 seRange | XmlCharIncDash  	
[18]   	seRange	   ::=   	charOrEsc '-' charOrEsc	
[20]   	charOrEsc	   ::=   	XmlChar | SingleCharEsc	
[21]   	XmlChar	   ::=   	[^\#x2D#x5B#x5D]	
[22]   	XmlCharIncDash	   ::=   	[^\#x5B#x5D]	

"Clarification. The grammar for posCharGroup is ambiguous in that any 
seRange could also be interpreted as a sequence of three XMLCharIncDash. 
The ambiguity is to be resolved in favor of seRange, such that any 
three-character sequence where the first and third character are not one 
of #x2D, #x5B or #x5D ('-', '[' or ']') and the second character is a 
'-' is to be considered an seRange. This requires more than one token 
lookahead."

The result would not unduly tax processors, as this was the only 
sensible interpretation of the grammar prior to the errata, and it would 
not break any existing documents (either pre- or post-errata).

Bob Foster
http://xmlbuddy.com/

Received on Wednesday, 7 April 2004 18:17:40 UTC