2216 – R-224: Questions about metacharacters in regular expressions

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2216 - R-224: Questions about metacharacters in regular expressions

Summary: R-224: Questions about metacharacters in regular expressions

Status:	CLOSED FIXED

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Datatypes: XSD Part 2 (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P1 normal
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	XML Schema comments list

URL:
Whiteboard:	cluster: regex
Keywords:	resolved

Depends on:
Blocks:

Reported:	2005-09-14 19:18 UTC by Sandy Gao
Modified:	2009-04-21 19:21 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Sandy Gao 2005-09-14 19:18:02 UTC

Appendix F in the Part 2 of XML Schema 1.0 defines 'metacharacter' thus: 

A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. 

It defines 'normal character' thus: 

[Definition:] A normal character is any XML character that is not a 
metacharacter. In regular expressions, a normal character is an atom that 
denotes the singleton set of strings containing only itself. 

Production [10], which I take to be defining normal characters, reads: 

Normal Character [10] Char ::= [^.\?*+()|#x5B#x5D] 

The metacharacters all need escapes, so production 24 is also relevant here: 

Single Character Escape [24] SingleCharEsc ::= '\' [nrt\|.?*+(){}
#x2D#x5B#x5D#x5E] 

I have some questions: 

1. shouldn't { and } (braces) be included in production [10]? ? [10] Char ::= 
[^.\?*+{}()|#x5B#x5D] 
2. shouldn't | (vertical bar) be among the characters defined as 
metacharacters? 
3. should ^ (#x5E) be included among the metacharacters? 
4. would it be possible to list the magic characters in the same order in 10 
and 24, to make eyeball-based comparisons easier? 

I suspect the answer to (2) is 'yes' and the answer to (3) is 'no, on the 
theory that the term 'metacharacter' is best reserved for characters which have 
special meaning at the top level of a regular expression and which must 
therefore have escapes to avoid ambiguity. Hyphen, circumflex, comma, n, r, and 
t all have special meaning only in special contexts (within character groups, 
within quantity-range specifications, or after backslash), and so aren't 
metacharacters in this sense. 

See:
http://lists.w3.org/Archives/Public/www-xml-schema-comments/2003JulSep/0009.html

Comment 1 Sandy Gao 2005-09-14 19:18:21 UTC

Note that items 1 and 2 are covered by R-41 (bug 2019)

Comment 2 C. M. Sperberg-McQueen 2006-01-15 00:17:26 UTC

The WG classified this issue as a requirement at its telcon of 13 January 2006
and instructed the editors to prepare a proposal with the obvious fix.

Comment 3 C. M. Sperberg-McQueen 2006-09-21 00:00:30 UTC

At the face to face meeting of January 2006 in St. Petersburg,
the Working Group decided not to take further action on this
issue in XML Schema 1.1.  (This issue was not discussed
separately; it was one of those which were dispatched by a
blanket decision that all other open issues would be closed
without action, unless raised again in last-call comments.)  Some
members of the Working Group expressed regret over not being able
to resolve all the issues dealt with in this way, but on the
whole the Working Group felt it better not to delay Datatypes 1.1
in order to resolve all of them.

This issue should have been marked as RESOLVED /WONTFIX at that
time, but apparently was not.  I am marking it that way now, to
reduce confusion.

Comment 4 C. M. Sperberg-McQueen 2006-09-21 14:18:45 UTC

Since bug 1889 has been reopened, we should probably reopen all of the issues
relating to the grammar of regular expressions, including this one.

Comment 5 Dave Peterson 2008-05-18 05:32:20 UTC

(In reply to comment #0)
>> Appendix F in the Part 2 of XML Schema 1.0 defines 'metacharacter' thus: 
> 
> A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. 
> 
> It defines 'normal character' thus: 
> 
> [Definition:] A normal character is any XML character that is not a 
> metacharacter. In regular expressions, a normal character is an atom that 
> denotes the singleton set of strings containing only itself. 
> 
> Production [10], which I take to be defining normal characters, reads: 
> 
> Normal Character [10] Char ::= [^.\?*+()|#x5B#x5D] 
> 
> The metacharacters all need escapes, so production 24 is also relevant here: 
> 
> Single Character Escape [24] SingleCharEsc ::= '\' [nrt\|.?*+(){}
> #x2D#x5B#x5D#x5E] 
> 
> I have some questions: 
 
> 3. should ^ (#x5E) be included among the metacharacters? 

> I suspect...the answer to (3) is 'no, on the 
> theory that the term 'metacharacter' is best reserved for characters which have 
> special meaning at the top level of a regular expression and which must 
> therefore have escapes to avoid ambiguity. Hyphen, circumflex, comma, n, r, and 
> t all have special meaning only in special contexts (within character groups, 
> within quantity-range specifications, or after backslash), and so aren't 
> metacharacters in this sense. 

Let me define characters used autonymously (self-naming) as those which act as single-character classes containing themselves, and metacharacters as those which are not being used autonymously, with the understanding that the same character in different occurrences in an RE can be one or the other.  I'll call the characters selected by the "metacharacter" nonterminal "top-level metacharacters" or "TLMs".  "top-level" refers to "outside of a character class expression".

In top-level, many of the TLMs can occur where other characters can occur autonymously; in those locations the TLM would have to be escaped to have autonymous effect.  There are other top-level places were a TLM cannot be a legal metacharacter and could presumably be used autonymously.  But the designers of the language apparently didn't want the users to have to wonder, so they made it possible and required that the TLMs always be escaped.  (For that matter, a few TLMs cannot be used as metacharacters in a location where an autonymous character can occur, but that's the language design.)

Within character class expressions, only a few TLMs can be used as metacharacters, also '^' (which is not a TLM) can be so used.  The autonymous vs meta rules are different here; there is no blanket prohibition of potential metacharacters being used autonymously; rather, there are some rules specifying where they can and can't be so used.  (A few TLMs still never can be autonymous, those that can't be metacharacters here can always be autonymous, and for '-' and '^' the rules allow each at different places.)  But since '^' can't be used as a metacharacter in the top-level, it is not in the TLM list.

All the TLMs and '^' are *permitted* to be escaped if their autonymous use is wanted; this is so that if a user is not sure if it can be meta at a given location and wants autonymous usage, they can just escape it and be sure to get the effect they want.  That's why '^' is in the single-character-escape list.

Are we having fun yet?  ;-)

Comment 6 C. M. Sperberg-McQueen 2008-05-30 04:01:22 UTC

A wording proposal intended to resolve this issue (and some other 
regex-related issues) is at 
http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.b1889.html
(member-only link).

Comment 7 C. M. Sperberg-McQueen 2008-05-31 02:00:45 UTC

The wording proposal mentioned in comment #6 was adopted by the WG
at its call today.  We believe it resolves the issue in full, and I am
accordingly marking the issue as resolved.