This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5321 - REs are not production nonterminals
Summary: REs are not production nonterminals
Status: CLOSED FIXED
Alias: None
Product: XML Schema
Classification: Unclassified
Component: Datatypes: XSD Part 2 (show other bugs)
Version: 1.0/1.1 both
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: C. M. Sperberg-McQueen
QA Contact: XML Schema comments list
URL:
Whiteboard: cluster: regex, notation
Keywords: editorial, needsAgreement, needsDrafting
Depends on:
Blocks:
 
Reported: 2007-12-15 20:29 UTC by Dave Peterson
Modified: 2009-02-12 21:58 UTC (History)
0 users

See Also:


Attachments

Description Dave Peterson 2007-12-15 20:29:39 UTC
1.  It appears that in the productions defining REs in Appendix G, we use REs (usually character classes) as though they were nonterminals.  An example is the production for normal characters:

  Char ::= [^.\?*+{}()|#x5B#x5D]

In productions, normally each nonterminal is the LHS of a production, and each terminal is a character string denoting itself.  An RE other than a single character string denoting itself is neither.

In the appendix, terminals are quoted strings and nonterminals are names linked to their defining production.  These neither-fish-nor-fowl REs are displayed as unquoted strings.  Perhaps they could be hyperlinked to a paragraph describing this modification to the standard production system.  (But can the necessary productions for character classes be made without circularity?  That may need some thought)

2.  Similarly, "#-escapes" representing characters via their Unicode code numbers are not normally allowed in our REs--at least I can't find anything that allows them.  Nor can I find anything that makes an exception for REs-that-are-nonterminals-in-productions.  At least a note, and some kind of special treatment within the RE seems appropriate.  (Actually, I wish that the codes were explained in a text note near each use of such codes; I suspect that I'm not the only reader who doesn't have the codes memorized.  Perhaps the special treatment could be a hyperlink to such an explanation.)

We do not currently define the production system we currently use.  If we really want to have a non-standard production system which allows REs as additional RHS components, we need to define it.
However, I think since we use the production system to define the REs, this could get very circular unless we are both careful, and lucky that the circularity can be avoided.

Expressing a small positive character class as an "or" of single characters is easy enough.  But I'm not sure how to deal with a large character class, such as the negative character class of the production quoted above.
Comment 1 C. M. Sperberg-McQueen 2008-01-06 19:57:46 UTC
Unless I am mistaken, the grammar notation used is that of XML, in which
there are some regex features.  So I'm not sure the presence of regular
expressions on the right hand side of rules is necessarily an error.

On the other hand, the fact that we don't describe our notation more clearly
is certainly an error.
Comment 2 Michael Kay 2008-01-06 22:18:40 UTC
The XML 1.0 Rec does indeed use regex-like constructs in its EBNF notation, described at http://www.w3.org/TR/REC-xml/#sec-notation. However, I've always felt that it's not very wise to use a regex notation when defining regex syntax, especially when the syntax used in the production rules is different from the regex syntax being described (and is itself defined rather informally). And it's easily avoided.
Comment 3 Dave Peterson 2008-01-06 23:28:52 UTC
(In reply to comment #2)
> The XML 1.0 Rec does indeed use regex-like constructs in its EBNF notation,
> described at http://www.w3.org/TR/REC-xml/#sec-notation.

Examining that definition, I find that it is circular.  Namely, the square-bracket notations are defined in terms of the "Char" nonterminal, which is in turn defined in terms of square-bracket notations.  I hope we don't choose to knowingly appeal to a circular definition.  I think we *can* do it better.
Comment 4 Pete Cordell 2008-01-07 09:31:40 UTC
(In reply to comment #3)
> it is circular.  Namely, the
> square-bracket notations are defined in terms of the "Char" nonterminal, which
> is in turn defined in terms of square-bracket notations.  

Isn't the mistake here that the section in the XML rec refers to 'Char' (a particular production) when it should refer to something like 'Character' (one of all possible characters)?  Wouldn't changing these definitions in the XML rec remove the circularity without breaking the definition of XML syntax as a whole?
Comment 5 C. M. Sperberg-McQueen 2008-05-24 03:01:11 UTC
Since this issue is essentially about notation, not about the
definition of conformant regular expressions, schema documents,
simple type definitions, pattern facets, or processors, I'm marking it
editorial.  This means it may be dealt with after, not necessarily 
before, the next public working draft.
Comment 6 Dave Peterson 2009-02-12 21:58:08 UTC
bug was discussed on 23 Jan, proposal approved as amended, has been added to the source, and is now in the status quo and most recent LCWD.  Hereby marked FIXED, and as originator I will immediately mark it CLOSED.