<!DOCTYPE TEI.2 PUBLIC '-//C. M. Sperberg-McQueen//DTD
          TEI Lite 1.0 plus SWeb (XML)//EN'
          '../../../../People/cmsmcq/lib/swebxml.dtd' [
<!ENTITY date.last.touched '29 May 2008'>

<!ENTITY rarr   "&#x2192;" ><!--/rightarrow /to A: =rightward arrow-->
<!ENTITY sect   "&#167;" ><!--=section sign-->

<!ENTITY charClass3  SYSTEM "images/charClass3.png" NDATA PNG >
]>
<?xml-stylesheet type="text/xsl" href="../../../../People/cmsmcq/lib/swebtohtml.xsl"?> 
<TEI.2 rend="w3c-public">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Regular expressions in XSD 1.0 and 1.1</title>
<author>C. M. Sperberg-McQueen</author>
</titleStmt>
<publicationStmt>
<pubPlace>Boston</pubPlace>
<pubPlace>Sophia-Antipolis</pubPlace>
<pubPlace>Tokyo</pubPlace>
<publisher>World Wide Web Consortium</publisher>
<date>2008</date>
</publicationStmt>
<sourceDesc>
<p>Created in electronic form.</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<front>
<titlePage>
<docTitle>
<titlePart>Regular expressions in XSD 1.0 and 1.1</titlePart>
</docTitle>
<docAuthor>C. M. Sperberg-McQueen</docAuthor>
<docDate>Begun 25-27 March 2008.
Last revised &date.last.touched;</docDate>
</titlePage>
</front>
<body>
<p>This note describes some issues relating to the regular-expression
language defined by XSD 1.0 and 1.1.  It is intended for the use of
the W3C XML Schema Working Group and of others interested in the
implementation of XSD.</p>
<div>
<head>Changes in the regex language from 1.0 forward</head>
<p>Distinct versions of the XSD spec provide slightly different versions
of the regular expresssion language.  The following sections provide a 
point by point comparison. The versions tabulated here are:
<list>
<item><label>1E</label> XSD 1.0, First Edition</item>
<item><label>PER</label> Proposed Edited Recommendation version of XSD 1.0, Second Edition</item>
<item><label>2E</label> XSD 1.0, Second Edition</item>
<item><label>D4</label> 16 July 2004 draft of XSD 1.1 (no changes from 2E)</item>
<item><label>D5</label> 24 February 2005 draft of XSD 1.1 (no changes from 2E)</item>
<item><label>D6</label> 16 January 2006 draft of XSD 1.1</item>
<item><label>LC1</label> 17 February 2006 Last-Call draft of XSD 1.1</item>
<item><label>D8</label> Current status-quo draft of XSD 1.1 (26 March 2008)</item>
<item><label>W</label> Proposed change for bug 1889 drafted by Michael Kay,
as transcribed into spec (not necessarily correctly) by MSM, December 2007.</item>
<item><label>X</label> miscellaneous draft proposals for various bugs</item>
<item><label>X1889</label> WG-internal draft proposal for bug 1889</item>
</list>
</p>
<p>The differences are given in the following sections.
Production numbers used in the section headings 
are those of 1.0; as may be seen, they vary erratically
in the drafts of 1.1.  Changes solely in the production
number are of course not shown.</p>

<div>
<head>Note about anchors</head>
<p>The PER adds a note about begin/end anchors; later versions
follow suit.
<list>
<item><label>1E</label>: <foreign>om.</foreign></item>
<item><label>PER, 2E, all 1.1 versions</label>: 
<q type="block">
<p>Note: Unlike some popular regular expression languages (including
those defined by Perl and standard Unix utilities), the regular
expression language defined here implicitly anchors all regular
expressions at the head and tail, as the most common use of regular
expressions in pattern is to match entire literals. For example, a
datatype derived from string such that all values must begin with
the character A (#x41) and end with the character Z (#x5a) would be
defined as follows:
</p>
<eg>
&lt;simpleType name='myString'>
 &lt;restriction base='string'>
  &lt;pattern value='A.*Z'/>
 &lt;/restriction>
&lt;/simpleType>
</eg>
<p>
In regular expression languages that are not implicitly anchored at
the head and tail, it is customary to write the equivalent regular
expression as:
<eg>
   ^A.*Z$
</eg>
where "^" anchors the pattern at the head and "$" anchors at the tail.
</p>
<p>
In those rare cases where an unanchored match is desired, including .*
at the beginning and ending of the regular expression will achieve the
desired results. For example, a datatype derived from string such
that all values must contain at least 3 consecutive A (#x41)
characters somewhere within the value could be defined as follows:
<eg>
&lt;simpleType name='myString'>
 &lt;restriction base='string'>
  &lt;pattern value='.*AAA.*'/>
 &lt;/restriction>
&lt;/simpleType>
</eg>
</p></q></item>
</list>
</p>

</div>

<div>
<head>Change to production 9 atom</head>
<p>D8 changes the name of the Char production to NormalChar; this affects
both production 9 for atom, and production 10 for Char.
<list>
<item><label>1E, PER, 2E, D4, D5, D6, LC</label>: 
<eg>[9] atom ::= Char 
           | charClass 
           | ( '(' regExp ')' )</eg>
</item>
<item><label>D8</label>:
<eg>[72] atom ::= NormalChar
           | charClass 
           | ( '(' regExp ')' )</eg>
</item>
</list>
</p>
</div>

<div>
<head>Prose after production 9 atom, definition of metacharacter</head>
<p>D6 changes the prose definition of metacharacter.
<list>
<item><label>1E, PER, 2E, D4, D5</label>: 

<q type="block"><p>[Definition:] A <term>metacharacter</term> is
either ., \, ?, *, +, {, } (, ), [ or ]. These characters
have special meanings in regular expressions, but can be escaped to
form atoms that denote the sets of strings containing only
themselves, i.e., an escaped metacharacter behaves like a normal
character.</p></q>
</item>
<item><label>D6, LC, D8, W</label> (adds |):
<q type="block"><p>[Definition:] A <term>metacharacter</term> is
either ., \, ?, *, +, {, } (, ), |, [, or ]. These characters
have special meanings in regular expressions, but can be escaped to
form atoms that denote the sets of strings containing only
themselves, i.e., an escaped metacharacter behaves like a normal
character.</p></q>
</item>
</list>
</p>
</div>

<div>
<head>Changes to production 10 Char</head>
<p>D6 changes the prose definition of Char, the production for
<soCalled>normal</soCalled> characters.
<list>
<item><label>1E, PER, 2E, D4, D5</label>: 
<eg>[10] Char ::= [^.\?*+()|#x5B#x5D]</eg>
</item>
<item><label>D6, LC</label> (excludes braces, {}):
<eg>[55] Char ::= [^.\?*+{}()|#x5B#x5D]</eg>
</item>
<item><label>D8</label>:
<eg>[73] NormalChar ::= [^.\?*+{}()|#x5B#x5D]</eg>
</item>
</list>
</p>
</div>

<div>
<head>Changes to production 11 charClass</head>
<p>The PER adds WildcardEsc as a right-hand side for charClass.
This ensures that the wildcard escape (<q><code>.</code></q>) is
legal as an atom (so <q><code>.*</code></q> and the like are
legal), even though it is no longer a character-class escape.
(See also below, section <ptr target="multicharesc" type="secnum"/>.)
<list>
<item><label>1E</label>: 
<eg>[11] charClass ::= charClassEsc 
                 | charClassExpr </eg>
</item>
<item><label>PER, 2E, D4-D8, W</label>: 
<eg>[11] charClass ::= charClassEsc
                 | charClassExpr
                 | WildcardEsc</eg>
</item>
</list>
</p>
<p>Within the part of the grammar reachable from <ident>charClass</ident>,
there are a number of changes of this kind which affect not the
definition of some class of terminal symbols (like the changes to
<ident>Char</ident> reported in the preceding section) but which 
affect the overall structure of the grammar and change the
reachability of one symbol from another.</p>
<p>A graph showing the various non-terminals in this part of the
grammar, and the changes in the (reachability relation of the) grammar
may be helpful:
<figure entity="charClass3" rend="100%">
</figure>
</p>
<p>In particular, note that W excludes <ident>charClassEsc</ident>
from <ident> posCharGroup</ident> (intentionally or unintentionally?).</p>
</div>

<div>
<head>Changes to prose after production 11 charClass</head>
<p>Draft W adjusts the prose to agree with the revised production.
<list>
<item><label>1E, PER, 2E, D4-D8</label>: 
<q type="block">
<p>A character class is either a character class escape or a character
class expression.</p></q>
</item>
<item><label>W</label>: 
<p>A character class is either a character class escape or a character
class expression or a wildcard escape (WildcardEsc).</p>
</item>
</list>
</p>
<p>W also makes various other changes to the prose which will
<emph>not</emph> be registered here except insofar as they
affect or reflect the grammar or changes to it.</p>
</div>

<div>
<head>Changes to production 13 charGroup and accompanying prose</head>
<p>Draft W changes the presentation of character groups; among
other things, the change reduces the lookahead required to parse
expressions.
<list>
<item><label>1E, PER, 2E, D4-D8</label>: 
<q type="block">
<p>[Definition:] A character group is either a positive character
group, a negative character group, or a character class
subtraction.</p>
<eg>[13] charGroup ::= posCharGroup 
                 | negCharGroup
                 | charClassSub  </eg>
</q>
</item>
<item><label>W</label> changes prose and production and adds new prose:  
<q type="block">
<p>[Definition:]   A character group (charGroup) starts with either a 
positive character group or a negative character group, and is 
optionally followed by a subtraction operator '-' and a further 
character class expression. 
[Definition:]  A character group that contains a subtraction operator 
is referred to as a character class subtraction.</p>

<eg>[76] charGroup ::= (posCharGroup | negCharGroup) 
                   ('-' charClassExpr)?</eg>
<p>If the first character in a charGroup is '^', this is taken as
indicating that the charGroup starts with a negCharGroup. A
posCharGroup can itself start with '^' but only when it appears within
a negCharGroup, that is, when the '^' is preceded by another '^'.
</p>
<p>
A '-' character is recognized as a subtraction operator (and hence, as
terminating the posCharGroup or negCharGroup) if it is immediately
followed by a '[' character.
</p>
<p>
For any positive character group or negative character group G,
and any character class expression C, G-C is a valid character
group, identifying the set of all characters in C(G) that are not in
C(C).
</p>
</q>
</item>
</list>
</p>
</div>

<div>
<head>Changes to production 14 posCharGroup and accompanying prose</head>
<p>This is part of Draft W's changes to the presentation of character groups.
<list>
<item><label>1E, PER, 2E, D4-D8</label>: 
<q type="block">
<p>[Definition:] A positive character group consists of one or more
character ranges or character class escapes, concatenated
together. A positive character group identifies the set of characters
containing all of the characters in all of the sets identified by its
constituent ranges or escapes.  
</p>
<eg>[14] posCharGroup ::= (charRange | charClassEsc)+</eg>
</q>
</item>
<item><label>W</label>:
<q type="block">
<p>[Definition:] A positive character group consists of one or more
character group parts, concatenated together. The set of characters
identified by a positive character group is the union of all of the
sets identified by its constituent character group parts.</p>
<eg>[77] posCharGroup ::= (charGroupPart)+</eg>
</q>
</item>
</list>
</p>
</div>

<div>
<head>Changes to description of negative character groups</head>
<p>This is part of Draft W's changes to the presentation of character groups.
There is no change to the definition of the non-terminal.
<list>
<item><label>1E, PER, 2E, D4-D8</label>: 
<q type="block">
<p>[Definition:] A negative character group is a positive character
group preceded by the ^ character. For all positive character groups
P, ^P is a valid negative character group, and C(^P) contains all XML
characters that are not in C(P).</p>
</q>
</item>
<item><label>W</label>:
<q type="block">
<p>[Definition:] A negative character group (negCharGroup) consists of
a ^ character followed by a positive character group. The set of
characters identified by a negative character group C(^P) is the set
of all characters that are not in C(P).</p>
</q>
</item>
</list>
</p>
</div>

<div>
<head>Definition of character group part</head>
<p>W refactors the makeup of positive character groups:
<list>
<item><label>1E, PER, 2E, D4-D8</label>: <foreign>om.</foreign></item>
<item><label>W</label>:
<q type="block">
<p>[Definition:]  A character group part (charGroupPart) is either a single unescaped character (SingleCharNoEsc), a single escaped character (SingleCharEsc), or a character range (charRange).</p>
<eg>
[79] charGroupPart ::= singleChar | charRange 
[80] singleChar    ::= SingleCharEsc | SingleCharNoEsc 
</eg>
<p>
If a charGroupPart starts with a singleChar and this is immediately
followed by a hyphen, and if the hyphen is part of the character group
(that is, it is not being treated as a substraction operator because
it is followed by '['), then the hyphen must be followed by another
singleChar, and the sequence (singleChar, hyphen, singleChar) is
treated as a charRange. It is an error if either of the two
singleChars in a charRange is a SingleCharNoEsc comprising an
unescaped hyphen.
</p></q>
</item>
</list>
</p>
</div>

<div>
<head>Definition of character-class subtraction</head>
<p>W eliminates character-class subtraction as a separate non-terminal:
<list>
<item><label>1E, PER, 2E, D4-D8</label>: 
<q type="block">
<p>[Definition:] A character class subtraction is a character class
expression subtracted from a positive character group or negative
character group, using the - character.  
</p>
<eg>[16] charClassSub ::= ( posCharGroup | negCharGroup ) '-' charClassExpr</eg>
<p></p>
</q></item>
<item><label>W</label>: <foreign>om.</foreign>
</item>

</list>
</p>
</div>
<div>
<head>Changes to production 17 charRange</head>
<p>The PER changes the rule for character ranges; 2E changes it again.
<list>
<item><label>1E</label>: 
<eg>[17] charRange ::= seRange 
                 | XmlCharRef 
                 | XmlCharIncDash </eg>
</item>
<item><label>PER</label>: 
<eg>[17] charRange ::= seRange 
                 | XmlChar </eg>
</item>
<item><label>2E, D4-D8</label>: 
<eg>[17] charRange ::= seRange 
                 | XmlCharIncDash </eg>
</item>
<item><label>W</label>: 
<eg>[82] charRange ::= charOrEsc '-' charOrEsc  
                   /* Or should this be: singleChar '-' singleChar ? */</eg>
Note that this is the old definition of <ident>seRange</ident>.
</item>
</list>
</p>
</div>

<div>
<head>Changes to production 18 seRange</head>
<p>W deletes production 18.
<list>
<item><label>1E-D8</label>: 
<eg>[18] seRange ::= charOrEsc '-' charOrEsc</eg>
</item>
<item><label>W</label>: <foreign>om.</foreign></item>
</list>
</p>
</div>

<div>
<head>Changes to production 19 XmlCharRef</head>
<p>The PER deletes production 19.
<list>
<item><label>1E</label>: 
<eg>[19] XmlCharRef ::= ( '&amp;#' [0-9]+ ';' )
                  | (' &amp;#x' [0-9a-fA-F]+ ';' ) </eg>
</item>
<item><label>PER-D8, W</label>: <foreign>om.</foreign></item>
</list>
</p>
</div>

<div>
<head>Changes to productions 20 charOrEsc and 21 XmlChar</head>
<p>W deletes productions 20 and 21.
<list>
<item><label>1E-D8</label>: 
<eg>[20] charOrEsc ::= XmlChar | SingleCharEsc
[21] XmlChar ::= [^\#x2D#x5B#x5D]</eg>
</item>
<item><label>W</label>: <foreign>om.</foreign></item>
</list>
</p>
</div>

<div>
<head>Changes to production 22 XmlCharIncDash</head>
<p>The PER deletes production 22; 2E restores it.  W deletes it again.
<list>
<item><label>1E, 2E, D4-D8</label>: 
<eg>[22] XmlCharIncDash ::= [^\#x5B#x5D]</eg>
</item>
<item><label>PER, W</label>: <foreign>om.</foreign></item>
</list>
</p>
</div>

<div>
<head>Changes to prose after production 22</head>
<p>The notes after production 22 vary.
<list>
<item><label>1E</label>: 
<q type="block">
<p>A single XML character is a character range that identifies the set
of characters containing only itself. All XML characters are valid
character ranges, except as follows:
<list>
<item>The [, ], - and \ characters are not valid character ranges;</item>
<item>The ^ character is only valid at the beginning of a positive
character group if it is part of a negative character group</item>
<item>The - character is a valid character range only at the beginning
or end of a positive character group.</item></list></p></q>
</item>
<item><label>PER</label> (omits third bullet item):
<q type="block">
<p>A single XML character is a character range that identifies the set
of characters containing only itself. All XML characters are valid
character ranges, except as follows:
<list>
<item>The [, ], - and \ characters are not valid character ranges;</item>
<item>The ^ character is only valid at the beginning of a positive
character group if it is part of a negative character group</item>
</list></p></q>
</item>
<item><label>2E, D4-D8</label> (restores third bullet item, adds note
specifying a resolution of ambiguity):
<q type="block">
<p>A single XML character is a character range that identifies the set
of characters containing only itself. All XML characters are valid
character ranges, except as follows:
<list>
<item>The [, ], - and \ characters are not valid character ranges;</item>
<item>The ^ character is only valid at the beginning of a positive
character group if it is part of a negative character group</item>
<item>The - character is a valid character range only at the beginning
or end of a positive character group.</item></list></p>
<p>Note: The grammar for character range as given above is
ambiguous, but the second and third bullets above together remove the
ambiguity.</p></q>
Note that LC1 erroneously shows the third bullet as an addition vis-a-vis
D6; it's not clear how this error occurred.
</item>
<item><label>W</label>: <foreign>om.</foreign></item>
</list>
</p>
</div>
<div>
<head>Definition of SingleCharNoEsc</head>
<p>W adds the non-terminal <ident>SingleCharNoEsc</ident>.
<list>
<!-- one of (item label) -->
<item><label>W</label>:
<q type="block">
<eg>[87] SingleCharNoEsc ::= [^\#x5B#x5D]</eg>
<p>
A single unescaped character (SingleCharNoEsc) is any character except
'[' or ']'. There are special rules, described earlier, that
constraint the use of the characters '-' and '^' in order to
disambiguate the syntax.
</p>
<p>
A single unescaped character identifies the singleton set of
characters containing that character alone.
</p>
<p>
A single escaped character, when used within a character group,
identifies the singleton set of characters containing the character
denoted by the escape (see Character Class Escapes (&sect;G.1.1)).
</p></q></item>
</list>
</p>
</div>

<div>
<head>Changes to list of blocks</head>
<p>PER drops some surrogate blocks from the list of block codes, and
adds a note explaining their absence.  D6 adds some additional blocks.
<list>
<item><label>1E</label> includes <list>
<item>#xD800 - #xDB7F : HighSurrogates</item>
<item>#xDB80 - #xDBFF : HighPrivateUseSurrogates</item>
<item>#xDC00 - #xDFFF : LowSurrogates</item>
</list>
in the list of named Unicode blocks.
</item>
<item><label>PER, 2E</label>:  deletes those three blocks and adds: 
<q type="block"><p>Note: The blocks mentioned above exclude the
<code>HighSurrogates</code>, <code>LowSurrogates</code> and 
<code>HighPrivateUseSurrogates</code> blocks.
These blocks identify "surrogate" characters, which do not occur at
the level of the "character abstraction" that XML instance documents
operate on. </p></q>
</item>
<item><label>D6</label>:  adds numerous additional blocks:<list>
<item>#x0500 - #x052F : CyrillicSupplement</item>
<item>#x0750 - #x077F : ArabicSupplement</item>
<item>#x1380 - #x139F : EthiopicSupplement</item>
<item>#x1700 - #x171F : Tagalog	 	</item>
<item>#x1720 - #x173F : Hanunoo</item>
<item>#x1740 - #x175F : Buhid</item>
<item>#x1760 - #x177F : Tagbanwa</item>
<item>#x1900 - #x194F : Limbu</item>
<item>#x1950 - #x197F : TaiLe</item>
<item>#x1980 - #x19DF : NewTaiLue</item>
<item>#x19E0 - #x19FF : KhmerSymbols</item>
<item>#x1A00 - #x1A1F : Buginese</item>
<item>#x1D00 - #x1D7F : PhoneticExtensions</item>
<item>#x1D80 - #x1DBF : PhoneticExtensionsSupplement</item>
<item>#x1DC0 - #x1DFF : CombiningDiacriticalMarksSupplement</item>
<item>#x27C0 - #x27EF : MiscellaneousMathematicalSymbols-A</item>
<item>#x27F0 - #x27FF : SupplementalArrows-A</item>
<item>#x2900 - #x297F : SupplementalArrows-B	 	</item>
<item>#x2980 - #x29FF : MiscellaneousMathematicalSymbols-B</item>
<item>#x2A00 - #x2AFF : SupplementalMathematicalOperators</item>
<item>#x2B00 - #x2BFF : MiscellaneousSymbolsandArrows</item>
<item>#x2C00 - #x2C5F : Glagolitic</item>
<item>#x2C80 - #x2CFF : Coptic</item>
<item>#x2D00 - #x2D2F : GeorgianSupplement</item>
<item>#x2D30 - #x2D7F : Tifinagh</item>
<item>#x2D80 - #x2DDF : EthiopicExtended</item>
<item>#x2E00 - #x2E7F : SupplementalPunctuation</item>
<item>#x31C0 - #x31EF : CJKStrokes</item>
<item>#x31F0 - #x31FF : KatakanaPhoneticExtensions</item>
<item>#x4DC0 - #x4DFF : YijingHexagramSymbols</item>
<item>#xA700 - #xA71F : ModifierToneLetters</item>
<item>#xFE00 - #xFE0F : VariationSelectors</item>
<item>#xFE10 - #xFE1F : VerticalForms</item>
<item>#x10000 - #x1007F : LinearBSyllabary</item>
<item>#x10080 - #x100FF : LinearBIdeograms</item>
<item>#x10100 - #x1013F : AegeanNumbers</item>
<item>#x10140 - #x1018F : AncientGreekNumbers</item>
<item>#x10300 - #x1032F : OldItalic</item>
<item>#x10330 - #x1034F : Gothic</item>
<item>#x10380 - #x1039F : Ugaritic</item>
<item>#x103A0 - #x103DF : OldPersian</item>
<item>#x10400 - #x1044F : Deseret</item>
<item>#x10450 - #x1047F : Shavian</item>
<item>#x10480 - #x104AF : Osmanya</item>
<item>#x10800 - #x1083F : CypriotSyllabary</item>
<item>#x10A00 - #x10A5F : Kharoshthi</item>
<item>#x1D000 - #x1D0FF : ByzantineMusicalSymbols</item>
<item>#x1D100 - #x1D1FF : MusicalSymbols</item>
<item>#x1D200 - #x1D24F : AncientGreekMusicalNotation</item>
<item>#x1D300 - #x1D35F : TaiXuanJingSymbols</item>
<item>#x1D400 - #x1D7FF : MathematicalAlphanumericSymbols</item>
<item>#x20000 - #x2A6DF : CJKUnifiedIdeographsExtensionB</item>
<item>#x2F800 - #x2FA1F : CJKCompatibilityIdeographsSupplement</item>
<item>#xE0000 - #xE007F : Tags</item>
<item>#xE0100 - #xE01EF : VariationSelectorsSupplement</item>
<item>#xF0000 - #xFFFFF : SupplementaryPrivateUseArea-A	</item>
<item>#x100000 - #x10FFFF : SupplementaryPrivateUseArea-B</item>
</list>
In addition, 
<list>
<item>The upper bound of block ArabicPresentationForms-B is
changed from xFEFE to xFEFF.</item>
<item>The Specials block from #xFEFF to #xFEFF is dropped.</item>
<item>For the Specials block beginning at #xFFF0, the upper bound
is changed from #xFFFD to #xFFFF.</item>
<item></item>
</list>

</item>
</list>
</p>
</div>

<div id="multicharesc">
<head>Changes to production 37 MultiCharEsc</head>
<p>PER separates out <q><code>.</code></q> (the wildcard escape) from other
escapes.  This has the effect of making wildcard escapes no longer legal
as positive character groups, so <q><code>[.]</code></q>
and <q><code>[^.]</code></q> are no longer ambiguous, and the positive
character group in them is the full-stop character, not the wildcard
escape; they match, respectively, the same as <q><code>\.</code></q>
(i.e. the full-stop character)
and the same as <q><code>[^\.]</code></q>
(i.e. any character <emph>but</emph> the full-stop character).
If the full-stop were interpreted as the wildcard escape, the
first would match any character at all bar newline and
carriage return, and the second would match the complement of those,
i.e. newline or carriage return.
<list>
<item><label>1E</label>: 
<eg>[37] MultiCharEsc ::= '.' | ('\' [sSiIcCdDwW])</eg>
</item>
<item><label>PER, 2E, D4-D8, W</label>: 
<eg>[37]  MultiCharEsc ::= '\' [sSiIcCdDwW]
[37a] WildcardEsc  ::= '.'
</eg>
</item>
</list>
</p>
</div>

<div>
<head>Changes to definition of <q><code>\i</code></q></head>
<p>After production 37a, D6 changes the definition of <q><code>\i</code></q>;
LC1 changes it again.
<list>
<item><label>1E through D5</label>: 
<q type="block">
<list>
<label>\i</label>
<item>the set of initial name characters, those matched by Letter | '_' | ':' </item>
</list>
</q>
</item>
<item><label>D6</label>: 
<q type="block">
<list>
<label>\i</label>
<item>the set of initial name characters, those matched by NameStartChar</item>
</list>
</q>
</item>
<item><label>LC1</label>: 
<q type="block">
<list>
<label>\i</label>
<item>the set of initial name characters, those matched by NameStartChar in [XML] 
or by  Letter | '_' | ':' in [XML 1.0]</item>
</list>
</q>
</item>
</list>
</p>
</div>

<!--*
<div>
<head>Vn Changes to production nn Nonterminal</head>
<p><label>Vn</label> 
<list>
<item><label>1E</label>: 
<eg></eg>
</item>
<item><label>PER</label>: 
<eg></eg>
</item>
<item><label>2E</label>: 
<eg></eg>
</item>
</list>
</p>
</div>
*-->

</div>
<div>
<head>Open issues</head>
<p>Several issues are open against our regular expression language.</p>

<div>
<head>Bug 1889 Regex [+-] syntax</head>
<p>Issue <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=1889">1889</xref>
was raised by Michael Kay 15 August 2005.  It raises several points.
<list>
<item><p>What characters are allowed as the start- and end-characters of
ranges?</p>
<p>The grammar implies that backslash, hyphen, and square brackets are
not allowed without escaping, but the prose suggests that only
backslash is disallowed at the start of a range, and only backslash
and left square bracket are disallowed at the end.</p>
</item>
<item><p>In the prose, does <q>XML character</q> mean a character allowed
in XML, or a character that matches the non-terminal <ident>XmlChar</ident>?</p>
<p>The non-terminal <ident>XmlChar</ident> excludes backslash, hyphen,
and both square brackets (and makes the prose description redundant and
confusing, as a correct but incomplete and unnecessary description of
part but not all of the governing rule.</p>
</item>
<item><p>Is <q><code>-</code></q> a legal character range, and when?</p>
<p>The list after the production group headed <q>Character Range</q>
mentions it twice in what appear to be contradictory ways:
<q>The [, ], - and \ characters are not valid character ranges;</q>
and
<q>The - character is a valid character range only at the beginning or end of
a positive character group.</q>
.</p></item>
</list>
</p>
</div>

<div>
<head>Bug 2123 R-134: Treatment of ^ in regexes</head>
<p>In bug <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=2123">2123</xref> 
(raised 5 April 2002), James Clark observes that the prose statement that 
<q>The ^ character is only valid at the beginning of a ·positive character
group· if it is part of a ·negative character group·</q> seems to contradict
the EBNF, which allows expressions like <q><code>[^X]</code></q> (which
is ambiguous) and <q><code>[^]</code></q> (which is not).</p>
<p>
He does not make clear whether he suggests changing the prose to say
clearly that it is overriding the grammar, or changing the grammar to
make the claim in the prose follow from the grammar, e.g. by defining
<ident>posCharGroup</ident> not as
<eg>posCharGroup ::= ( charRange | charClassEsc  )+</eg>
but (using the notation of XML 1.0) as 
<eg>posCharGroup ::= ( (charRange | charClassEsc) - '^') 
                     (charRange | charClassEsc)*
</eg>
or (perhaps more simply) as
<eg>posCharGroup ::= ( charRange | charClassEsc  )+ - ('^' Char*)</eg>
(Unfortunately, this won't work because if the posCharGroup is part
of a negCharGroup, the caret <emph>is</emph> allowed.
</p>
</div>

<div>
<head>Bug 2216 R-224: Questions about metacharacters in regular expressions</head>
<p>Bug <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=2216">2216</xref> 
(raised 10 July 2003) points out that:
<list>
<item>The prose says <q>A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. </q></item>
<item>The prose says <q>A normal character is any XML character that is not a 
metacharacter.</q></item>
<item>The EBNF defines <ident>Char</ident> (now <ident>NormalChar</ident>), under
the heading <q>Normal Character</q>) as <eg>[^.\?*+{}()|#x5B#x5D]</eg>
</item>
</list>
Braces { } are not listed in the prose, but are in the EBNF.
Vertical bar | is in the EBNF but not mentioned in the prose.
</p>
</div>

<div>
<head>Bug 3260 Use of "-" in regular expressions</head>
<p>Bug <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=3260">3260</xref>, 
raised 9 May 2006 by Michael Kay on behalf of the XML Query and XSL Working
Groups, says <q>(General comment) in regular expressions defining permitted values for data
types, the character "-" is sometimes preceded by "\" and sometimes not. The
schema regex syntax, I believe, does not permit the "\".</q>
</p>
</div>

<div>
<head>Bug 3659 Bugs in date/time regexes</head>
<p>Bug <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=3659">3659</xref>,
reflects <xref href="http://lists.w3.org/Archives/Public/www-xml-schema-comments/2006JulSep/0024.html"
>email sent 6 September 2006</xref> to the XML Schema comments list by
Laurens Holst, saying that the regular expressions given for the date/time
types do not match the grammar and offering new ones.
</p>
</div>

<div>
<head>Bug 5431 Normal characters, character references</head>
<p>Bug <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=5431">5431</xref>
was raised 26 January 2008 by Dave Peterson to urge clarification of
confusing text about normal characters and (XML) character references.</p>
</div>

<div>
<head>Bug 5486 Regex: Names of Unicode code blocks</head>
<p>Bug <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=5486">5486</xref>
(raised 16 February 2008 by Michael Kay) points out that the normative reference
to Unicode is now to version 4.1, but the block names listed in the text
are those of 3.1.
</p>
</div>

<div>
<head>Bug Bug 5321 REs are not production nonterminals</head>
<p>Bug <xref href="http://www.w3.org/Bugs/Public/show_bug.cgi?id=5321">5321</xref>,
raised 15 December 2007 by Dave Peterson, notes that the notation used
to define our regular expressions appears not to be formally defined or
even described anywhere.
</p>
</div>

</div>
<div>
<head>Test cases</head>
<p>Some test cases may help illustrate the ways in which the changes
listed above do, or do not, result in the grammar accepting different languages.
</p>
<div>
<head>Manually constructed tests</head>
<p>The following expressions were constructed manually in an attempt to
exercise the points of difference in the various grammars.  They are
<emph>not</emph> all proposed as legal regular expressions: for most
of them, part of the point is that some definitions of the regex language
accept them and other definitions do not.  A few strings are included
which really ought to be accepted by all definitions, or rejected by all,
as a kind of sanity check.</p>
<p>Some expressions which use or misuse ^ and $ as 
anchors, or appear to do so. (If we include the XPath 2.0
regex language, these become more directly relevant.)
<list>
<item><q><code>^a*$</code></q></item>
<item><q><code>a+b</code></q></item>
<item><q><code>.*a+b.*</code></q></item>
<item><q><code>a*^b$a+</code></q></item>
<item><q><code>a*\^b\$a+</code></q></item>
</list>
   <!--* 2 Char -> NormalChar is purely cosmetic *-->
</p>
<p>Either | is a metacharacter or not.
<list>
<item><q><code>|</code></q></item>
<item><q><code>|+</code></q></item>
<item><q><code>|||</code></q></item>
<item><q><code>\||\|</code></q></item>
</list>
</p>
<p>
If { and } are defined as normal characters, 
<q><code>x{4}</code></q> has two parses, one matching
<q><code>xxxx</code></q> and one matching, well, 
<q><code>x{4}</code></q>
<list>
<item><q><code>{</code></q></item>
<item><q><code>{}</code></q></item>
<item><q><code>x{4}</code></q> (How many parses?)</item>
<item><q><code>x{4,}</code></q></item>
<item><q><code>x{,4}</code></q></item>
</list>
</p>
<p>Separating wildcard escape out from the other multi-character
escapes has no effect at the top level.  So the first few of these should all
be accepted, as escapes.  But within character classes, wildcard
escape is no longer accepted as a positive character group.  The
last two are still legal, but have a different interpretation.
<list>
<item><q><code>\s</code></q></item>
<item><q><code>\d</code></q></item>
<item><q><code>\W</code></q></item>
<item><q><code>.</code></q></item>
<item><q><code>[.]</code></q></item>
<item><q><code>[.-[abc]]</code></q></item>
<item><q><code>[^.]</code></q></item>
<item><q><code>[^.-[\r]]</code></q></item>
</list>
</p>
   <!--* 6 wildcard escape in prose *-->
   <!--* 7 charGroup refactoring, does not change language *-->

<p>The redefinition of positive character group by draft W has
the effect (if I am reading things right) of excluding
multi-character escapes from positive character groups.
   <!--* 8 posCharGroup redefined, loses catEsc, complEsc, MultiCharEsc *-->
   <!--* 10 redefinition of charGroupPart *-->
<list>
<item><q><code>[abc]</code></q></item>
<item><q><code>[a-cx-z]</code></q></item>
<item><q><code>[\d\s]</code></q></item>
<item><q><code>[^\d\s]</code></q></item>
<item><q><code>[0-9-[135]]</code></q></item>
<item><q><code>[\d-[13579]]</code></q></item>
<item><q><code>[\p{Lu}-[AEIOU]]</code></q></item>
<item><q><code>[\p{IsBasicLatin}-[AEIOU]]</code></q></item>
<item><q><code>[\p{Lu}\p{Ll}-[\p{IsCherokee}]]</code></q></item>
<item><q><code>[\p{L}-[\p{Lm}]]</code></q></item>

<item><q><code>[-]</code></q></item>
<item><q><code>[a-kl-z]</code></q></item>
<item><q><code>[a-k-z]</code></q></item>
<item><q><code>[a\-k-z]</code></q></item>
<item><q><code>[a-k\-z]</code></q></item>
<item><q><code>[a\-k\-z]</code></q></item>
</list>
</p>
<p>
The changes to the charRange production (and others, perhaps)
affect the use of hyphen, especially.
<list>
<item><q><code>[+--]</code></q></item>
<item><q><code>[+-\-]</code></q></item>
<item><q><code>[+\--]</code></q> (redundant, but legal)</item>
<item><q><code>[\--=]</code></q></item>
<item><q><code>[--=]</code></q></item>
<item><q><code>[-\-=]</code></q></item>
<item><q><code>[\-\-=]</code></q> (redundant, but legal)</item>
<item><q><code>[--?]</code></q></item>
<item><q><code>[\--?]</code></q></item>
<item><q><code>[-\-?]</code></q></item>
<item><q><code>[\-\-?]</code></q></item>
<item><q><code>[--\?]</code></q></item>
<item><q><code>[\--\?]</code></q></item>
<item><q><code>[xyz0-9()*/+-]</code></q></item>
<item><q><code>[xyz0-9()*/+\-]</code></q></item>
<item><q><code>[^-+*/]</code></q></item>
<item><q><code>[^\-+*/]</code></q></item>
</list>
</p>
<p>The changes to the list of blocks are fairly straightforward:
some new names are legal, some old names are illegal.
<list>
<item><q><code>\p{IsBasicLatin}</code></q></item>
<item><q><code>\p{IsCherokee}</code></q></item>
<item><q><code>\p{IsCoptic}</code></q></item>
<item><q><code>\p{IsGothic}</code></q></item>
<item><q><code>\p{IsHighSurrogates}</code></q></item>
</list>
</p>
<p>A schema document which defines a simple type using all of the preceding
patterns is in this directory at 
<xref>dummy.xsd</xref>.  In its current form, it provides no help
keeping track of the prescribed interpretation of any of the legal expressions
above, but it will do as a way to see whether processors will accept them.
(Processors which quit after the first error, of course, won't check
all of the pattern elements in <xref>dummy.xsd</xref>.)</p>
</div>
<div>
<head>Generation of tests by overgenerating grammar</head>
<p>One simple way to generate test cases is to simplify the grammar
to make it less restrictive (which in turn means it will generate not
only strings legal against the real grammar, but strings rejected
by the real gramamr).</p>
<p>The module <xref>maketests.pl</xref> implements this approach.
In order to avoid a mind-numbing series of single-character regular
expressions marching systematically through all of Unicode,
followed by an even more mind-numbing series of two-character
regular expressions, it defines a very restrictive set of 
characters in which (for example) <q><code>a</code></q>
and <q><code>d</code></q> stand in for all Unicode characters
not used by the regex languae in special meanings.</p>
<p>[More detail desirable here.  How to use Prolog to generate
from the grammar.]</p>
<p>One drawback to this method is that strings are generated in order
of increasing length; for any given construct which requires a certain
minimum length (e.g. character-class subtraction), it can take a 
very long time before any strings appear which excercise that
part of the grammar.</p>
</div>
<div>
<head>Random string and not-quite-random strings</head>
<p>Another way of generating test cases is simply to generate
random strings.  By calling a random-number generator repeatedly,
and using the random numbers to construct a sequence of characters,
one can construct strings of arbitrary length, without having
to work through the sentences of the language systematically.</p>
<p>A very simple way to generate random strings is to ask
for random numbers in the range 32 to 2**16, and choose the
corresponding Unicode character for each.  This way, however, the
random number generator spends a lot of its time generating
different Unicode characters that do not have syntactically
distinct roles; more <soCalled>interesting</soCalled> test
cases are generated if the top of the range is set to 
126.  </p>
<p>[Need to find code from 2005 that works this way.]</p>
<p>A different approach may yield more interesting test cases
more quickly:<list>
<item>Create a list of strings, including <list>
<item>each string that appears in the grammar as a literal token
(or would, if the grammar were written that way)</item>
<item>one or more strings for each syntactically distinct
set of characters</item>
<item>optionally, for each non-terminal <ident>N</ident> in the grammar
one or more strings known to be in <ident>L(N)</ident></item>
<item>optionally, for each non-terminal <ident>N</ident> in the grammar
one or more strings known to be similar to items 
in <ident>L(N)</ident>, but not themselves in in <ident>L(N)</ident>
(<soCalled>false friends</soCalled>).
</item>
</list>
</item>
<item>Put the strings into a 1..<ident>n</ident> array.</item>
<item>Run the random number generator in the range 1..<ident>n</ident>
and concatenate the resulting strings.</item>
</list>
</p>
<p>[Implementation needed to see whether this does produce
more 'interesting' tests faster.]</p>
</div>
</div>
</body>
<back>
<div>
<head>To do</head>
<list>
<item>Separate utility predicates from parser (make 
<xref>regex.dcg</xref> purer).</item>
<item>Write code to verify AST returned by parser,
as a way of implementing ad-hoc non-grammatical rules
for ambiguity resolution or special constraints.</item>
<item>Write code to dump ASTs in XML.</item>
<item>Write code to invoke parser and AST-validator with appropriate
options for a particular grammar.</item>
<item>Write code to invoke all parsers in sequence and determine
whether they agree or not.</item>
<item>Write code to generate test strings, parse with all grammars,
and report.</item>
<item>Write code to read an XSD schema document and test each
<code>//xsd:pattern/@value</code>: parse with all grammars,
and report.</item>
<item>Write code to read an XSD catalog file, and 
for each schema document listed, call the test-and-report
routine.</item>
<item>Draw dependency graph showing RHS &rarr; LHS references
in (all variants of the) grammar for character classes.</item>
</list>
</div>
</back>
</text>
</TEI.2>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:"/Library/SGML/Public/Emacs/sweb.ced"
sgml-omittag:t
sgml-shorttag:t
End:
-->
