This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
1. It appears that in the productions defining REs in Appendix G, we use REs (usually character classes) as though they were nonterminals. An example is the production for normal characters: Char ::= [^.\?*+{}()|#x5B#x5D] In productions, normally each nonterminal is the LHS of a production, and each terminal is a character string denoting itself. An RE other than a single character string denoting itself is neither. In the appendix, terminals are quoted strings and nonterminals are names linked to their defining production. These neither-fish-nor-fowl REs are displayed as unquoted strings. Perhaps they could be hyperlinked to a paragraph describing this modification to the standard production system. (But can the necessary productions for character classes be made without circularity? That may need some thought) 2. Similarly, "#-escapes" representing characters via their Unicode code numbers are not normally allowed in our REs--at least I can't find anything that allows them. Nor can I find anything that makes an exception for REs-that-are-nonterminals-in-productions. At least a note, and some kind of special treatment within the RE seems appropriate. (Actually, I wish that the codes were explained in a text note near each use of such codes; I suspect that I'm not the only reader who doesn't have the codes memorized. Perhaps the special treatment could be a hyperlink to such an explanation.) We do not currently define the production system we currently use. If we really want to have a non-standard production system which allows REs as additional RHS components, we need to define it. However, I think since we use the production system to define the REs, this could get very circular unless we are both careful, and lucky that the circularity can be avoided. Expressing a small positive character class as an "or" of single characters is easy enough. But I'm not sure how to deal with a large character class, such as the negative character class of the production quoted above.
Unless I am mistaken, the grammar notation used is that of XML, in which there are some regex features. So I'm not sure the presence of regular expressions on the right hand side of rules is necessarily an error. On the other hand, the fact that we don't describe our notation more clearly is certainly an error.
The XML 1.0 Rec does indeed use regex-like constructs in its EBNF notation, described at http://www.w3.org/TR/REC-xml/#sec-notation. However, I've always felt that it's not very wise to use a regex notation when defining regex syntax, especially when the syntax used in the production rules is different from the regex syntax being described (and is itself defined rather informally). And it's easily avoided.
(In reply to comment #2) > The XML 1.0 Rec does indeed use regex-like constructs in its EBNF notation, > described at http://www.w3.org/TR/REC-xml/#sec-notation. Examining that definition, I find that it is circular. Namely, the square-bracket notations are defined in terms of the "Char" nonterminal, which is in turn defined in terms of square-bracket notations. I hope we don't choose to knowingly appeal to a circular definition. I think we *can* do it better.
(In reply to comment #3) > it is circular. Namely, the > square-bracket notations are defined in terms of the "Char" nonterminal, which > is in turn defined in terms of square-bracket notations. Isn't the mistake here that the section in the XML rec refers to 'Char' (a particular production) when it should refer to something like 'Character' (one of all possible characters)? Wouldn't changing these definitions in the XML rec remove the circularity without breaking the definition of XML syntax as a whole?
Since this issue is essentially about notation, not about the definition of conformant regular expressions, schema documents, simple type definitions, pattern facets, or processors, I'm marking it editorial. This means it may be dealt with after, not necessarily before, the next public working draft.
bug was discussed on 23 Jan, proposal approved as amended, has been added to the source, and is now in the status quo and most recent LCWD. Hereby marked FIXED, and as originator I will immediately mark it CLOSED.