Regular expressions in XSD 1.0 and 1.1

Regular expressions in XSD 1.0 and 1.1 C. M. Sperberg-McQueen Begun 25-27 March 2008. Last revised &date.last.touched; (Minor encoding fixes 9 December 2009)

This note describes some issues relating to the regular-expression language defined by XSD 1.0 and 1.1. It is intended for the use of the W3C XML Schema Working Group and of others interested in the implementation of XSD.

Changes in the regex language from 1.0 forward

Distinct versions of the XSD spec provide slightly different versions of the regular expresssion language. The following sections provide a point by point comparison. The versions tabulated here are: 1E XSD 1.0, First Edition PER Proposed Edited Recommendation version of XSD 1.0, Second Edition 2E XSD 1.0, Second Edition D4 16 July 2004 draft of XSD 1.1 (no changes from 2E) D5 24 February 2005 draft of XSD 1.1 (no changes from 2E) D6 16 January 2006 draft of XSD 1.1 LC1 17 February 2006 Last-Call draft of XSD 1.1 D8 Current status-quo draft of XSD 1.1 (26 March 2008) W Proposed change for bug 1889 drafted by Michael Kay, as transcribed into spec (not necessarily correctly) by MSM, December 2007. X miscellaneous draft proposals for various bugs X1889 WG-internal draft proposal for bug 1889

The differences are given in the following sections. Production numbers used in the section headings are those of 1.0; as may be seen, they vary erratically in the drafts of 1.1. Changes solely in the production number are of course not shown.

Note about anchors

The PER adds a note about begin/end anchors; later versions follow suit. 1E: om. PER, 2E, all 1.1 versions:

Note: Unlike some popular regular expression languages (including those defined by Perl and standard Unix utilities), the regular expression language defined here implicitly anchors all regular expressions at the head and tail, as the most common use of regular expressions in pattern is to match entire literals. For example, a datatype derived from string such that all values must begin with the character A (#x41) and end with the character Z (#x5a) would be defined as follows:

In regular expression languages that are not implicitly anchored at the head and tail, it is customary to write the equivalent regular expression as: ^A.*Z$ where "^" anchors the pattern at the head and "$" anchors at the tail.

In those rare cases where an unanchored match is desired, including .* at the beginning and ending of the regular expression will achieve the desired results. For example, a datatype derived from string such that all values must contain at least 3 consecutive A (#x41) characters somewhere within the value could be defined as follows: <simpleType name='myString'> <restriction base='string'> <pattern value='.*AAA.*'/> </restriction> </simpleType>

Change to production 9 atom

D8 changes the name of the Char production to NormalChar; this affects both production 9 for atom, and production 10 for Char. 1E, PER, 2E, D4, D5, D6, LC: [9] atom ::= Char | charClass | ( '(' regExp ')' ) D8: [72] atom ::= NormalChar | charClass | ( '(' regExp ')' )

Prose after production 9 atom, definition of metacharacter

D6 changes the prose definition of metacharacter. 1E, PER, 2E, D4, D5:

[Definition:] A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. These characters have special meanings in regular expressions, but can be escaped to form atoms that denote the sets of strings containing only themselves, i.e., an escaped metacharacter behaves like a normal character.

D6, LC, D8, W (adds |):

[Definition:] A metacharacter is either ., \, ?, *, +, {, } (, ), |, [, or ]. These characters have special meanings in regular expressions, but can be escaped to form atoms that denote the sets of strings containing only themselves, i.e., an escaped metacharacter behaves like a normal character.

Changes to production 10 Char

D6 changes the prose definition of Char, the production for normal characters. 1E, PER, 2E, D4, D5: [10] Char ::= [^.\?*+()|#x5B#x5D] D6, LC (excludes braces, {}): [55] Char ::= [^.\?*+{}()|#x5B#x5D] D8: [73] NormalChar ::= [^.\?*+{}()|#x5B#x5D]

Changes to production 11 charClass

The PER adds WildcardEsc as a right-hand side for charClass. This ensures that the wildcard escape (.) is legal as an atom (so .* and the like are legal), even though it is no longer a character-class escape. (See also below, section .) 1E: [11] charClass ::= charClassEsc | charClassExpr PER, 2E, D4-D8, W: [11] charClass ::= charClassEsc | charClassExpr | WildcardEsc

Within the part of the grammar reachable from charClass, there are a number of changes of this kind which affect not the definition of some class of terminal symbols (like the changes to Char reported in the preceding section) but which affect the overall structure of the grammar and change the reachability of one symbol from another.

A graph showing the various non-terminals in this part of the grammar, and the changes in the (reachability relation of the) grammar may be helpful:

In particular, note that W excludes charClassEsc from posCharGroup (intentionally or unintentionally?).

Changes to prose after production 11 charClass

Draft W adjusts the prose to agree with the revised production. 1E, PER, 2E, D4-D8:

A character class is either a character class escape or a character class expression.

A character class is either a character class escape or a character class expression or a wildcard escape (WildcardEsc).

W also makes various other changes to the prose which will not be registered here except insofar as they affect or reflect the grammar or changes to it.

Changes to production 13 charGroup and accompanying prose

Draft W changes the presentation of character groups; among other things, the change reduces the lookahead required to parse expressions. 1E, PER, 2E, D4-D8:

[Definition:] A character group is either a positive character group, a negative character group, or a character class subtraction.

[13] charGroup ::= posCharGroup | negCharGroup | charClassSub W changes prose and production and adds new prose:

[Definition:] A character group (charGroup) starts with either a positive character group or a negative character group, and is optionally followed by a subtraction operator '-' and a further character class expression. [Definition:] A character group that contains a subtraction operator is referred to as a character class subtraction.

[76] charGroup ::= (posCharGroup | negCharGroup) ('-' charClassExpr)?

If the first character in a charGroup is '^', this is taken as indicating that the charGroup starts with a negCharGroup. A posCharGroup can itself start with '^' but only when it appears within a negCharGroup, that is, when the '^' is preceded by another '^'.

A '-' character is recognized as a subtraction operator (and hence, as terminating the posCharGroup or negCharGroup) if it is immediately followed by a '[' character.

For any positive character group or negative character group G, and any character class expression C, G-C is a valid character group, identifying the set of all characters in C(G) that are not in C(C).

Changes to production 14 posCharGroup and accompanying prose

This is part of Draft W's changes to the presentation of character groups. 1E, PER, 2E, D4-D8:

[Definition:] A positive character group consists of one or more character ranges or character class escapes, concatenated together. A positive character group identifies the set of characters containing all of the characters in all of the sets identified by its constituent ranges or escapes.

[14] posCharGroup ::= (charRange | charClassEsc)+ W:

[Definition:] A positive character group consists of one or more character group parts, concatenated together. The set of characters identified by a positive character group is the union of all of the sets identified by its constituent character group parts.

[77] posCharGroup ::= (charGroupPart)+

Changes to description of negative character groups

This is part of Draft W's changes to the presentation of character groups. There is no change to the definition of the non-terminal. 1E, PER, 2E, D4-D8:

[Definition:] A negative character group is a positive character group preceded by the ^ character. For all positive character groups P, ^P is a valid negative character group, and C(^P) contains all XML characters that are not in C(P).

[Definition:] A negative character group (negCharGroup) consists of a ^ character followed by a positive character group. The set of characters identified by a negative character group C(^P) is the set of all characters that are not in C(P).

Definition of character group part

W refactors the makeup of positive character groups: 1E, PER, 2E, D4-D8: om. W:

[Definition:] A character group part (charGroupPart) is either a single unescaped character (SingleCharNoEsc), a single escaped character (SingleCharEsc), or a character range (charRange).

[79] charGroupPart ::= singleChar | charRange [80] singleChar ::= SingleCharEsc | SingleCharNoEsc

If a charGroupPart starts with a singleChar and this is immediately followed by a hyphen, and if the hyphen is part of the character group (that is, it is not being treated as a substraction operator because it is followed by '['), then the hyphen must be followed by another singleChar, and the sequence (singleChar, hyphen, singleChar) is treated as a charRange. It is an error if either of the two singleChars in a charRange is a SingleCharNoEsc comprising an unescaped hyphen.

Definition of character-class subtraction

W eliminates character-class subtraction as a separate non-terminal: 1E, PER, 2E, D4-D8:

[Definition:] A character class subtraction is a character class expression subtracted from a positive character group or negative character group, using the - character.

[16] charClassSub ::= ( posCharGroup | negCharGroup ) '-' charClassExpr

W: om.

Changes to production 17 charRange

The PER changes the rule for character ranges; 2E changes it again. 1E: [17] charRange ::= seRange | XmlCharRef | XmlCharIncDash PER: [17] charRange ::= seRange | XmlChar 2E, D4-D8: [17] charRange ::= seRange | XmlCharIncDash W: [82] charRange ::= charOrEsc '-' charOrEsc /* Or should this be: singleChar '-' singleChar ? */ Note that this is the old definition of seRange.

Changes to production 18 seRange

W deletes production 18. 1E-D8: [18] seRange ::= charOrEsc '-' charOrEsc W: om.

Changes to production 19 XmlCharRef

The PER deletes production 19. 1E: [19] XmlCharRef ::= ( '&#' [0-9]+ ';' ) | (' &#x' [0-9a-fA-F]+ ';' ) PER-D8, W: om.

Changes to productions 20 charOrEsc and 21 XmlChar

W deletes productions 20 and 21. 1E-D8: [20] charOrEsc ::= XmlChar | SingleCharEsc [21] XmlChar ::= [^\#x2D#x5B#x5D] W: om.

Changes to production 22 XmlCharIncDash

The PER deletes production 22; 2E restores it. W deletes it again. 1E, 2E, D4-D8: [22] XmlCharIncDash ::= [^\#x5B#x5D] PER, W: om.

Changes to prose after production 22

The notes after production 22 vary. 1E:

A single XML character is a character range that identifies the set of characters containing only itself. All XML characters are valid character ranges, except as follows: The [, ], - and \ characters are not valid character ranges; The ^ character is only valid at the beginning of a positive character group if it is part of a negative character group The - character is a valid character range only at the beginning or end of a positive character group.

PER (omits third bullet item):

2E, D4-D8 (restores third bullet item, adds note specifying a resolution of ambiguity):

Note: The grammar for character range as given above is ambiguous, but the second and third bullets above together remove the ambiguity.

Note that LC1 erroneously shows the third bullet as an addition vis-a-vis D6; it's not clear how this error occurred. W: om.

Definition of SingleCharNoEsc

W adds the non-terminal SingleCharNoEsc. W: [87] SingleCharNoEsc ::= [^\#x5B#x5D]

A single unescaped character (SingleCharNoEsc) is any character except '[' or ']'. There are special rules, described earlier, that constraint the use of the characters '-' and '^' in order to disambiguate the syntax.

A single unescaped character identifies the singleton set of characters containing that character alone.

A single escaped character, when used within a character group, identifies the singleton set of characters containing the character denoted by the escape (see Character Class Escapes (§G.1.1)).

Changes to list of blocks

PER drops some surrogate blocks from the list of block codes, and adds a note explaining their absence. D6 adds some additional blocks. 1E includes #xD800 - #xDB7F : HighSurrogates #xDB80 - #xDBFF : HighPrivateUseSurrogates #xDC00 - #xDFFF : LowSurrogates in the list of named Unicode blocks. PER, 2E: deletes those three blocks and adds:

Note: The blocks mentioned above exclude the HighSurrogates, LowSurrogates and HighPrivateUseSurrogates blocks. These blocks identify "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.

D6: adds numerous additional blocks: #x0500 - #x052F : CyrillicSupplement #x0750 - #x077F : ArabicSupplement #x1380 - #x139F : EthiopicSupplement #x1700 - #x171F : Tagalog #x1720 - #x173F : Hanunoo #x1740 - #x175F : Buhid #x1760 - #x177F : Tagbanwa #x1900 - #x194F : Limbu #x1950 - #x197F : TaiLe #x1980 - #x19DF : NewTaiLue #x19E0 - #x19FF : KhmerSymbols #x1A00 - #x1A1F : Buginese #x1D00 - #x1D7F : PhoneticExtensions #x1D80 - #x1DBF : PhoneticExtensionsSupplement #x1DC0 - #x1DFF : CombiningDiacriticalMarksSupplement #x27C0 - #x27EF : MiscellaneousMathematicalSymbols-A #x27F0 - #x27FF : SupplementalArrows-A #x2900 - #x297F : SupplementalArrows-B #x2980 - #x29FF : MiscellaneousMathematicalSymbols-B #x2A00 - #x2AFF : SupplementalMathematicalOperators #x2B00 - #x2BFF : MiscellaneousSymbolsandArrows #x2C00 - #x2C5F : Glagolitic #x2C80 - #x2CFF : Coptic #x2D00 - #x2D2F : GeorgianSupplement #x2D30 - #x2D7F : Tifinagh #x2D80 - #x2DDF : EthiopicExtended #x2E00 - #x2E7F : SupplementalPunctuation #x31C0 - #x31EF : CJKStrokes #x31F0 - #x31FF : KatakanaPhoneticExtensions #x4DC0 - #x4DFF : YijingHexagramSymbols #xA700 - #xA71F : ModifierToneLetters #xFE00 - #xFE0F : VariationSelectors #xFE10 - #xFE1F : VerticalForms #x10000 - #x1007F : LinearBSyllabary #x10080 - #x100FF : LinearBIdeograms #x10100 - #x1013F : AegeanNumbers #x10140 - #x1018F : AncientGreekNumbers #x10300 - #x1032F : OldItalic #x10330 - #x1034F : Gothic #x10380 - #x1039F : Ugaritic #x103A0 - #x103DF : OldPersian #x10400 - #x1044F : Deseret #x10450 - #x1047F : Shavian #x10480 - #x104AF : Osmanya #x10800 - #x1083F : CypriotSyllabary #x10A00 - #x10A5F : Kharoshthi #x1D000 - #x1D0FF : ByzantineMusicalSymbols #x1D100 - #x1D1FF : MusicalSymbols #x1D200 - #x1D24F : AncientGreekMusicalNotation #x1D300 - #x1D35F : TaiXuanJingSymbols #x1D400 - #x1D7FF : MathematicalAlphanumericSymbols #x20000 - #x2A6DF : CJKUnifiedIdeographsExtensionB #x2F800 - #x2FA1F : CJKCompatibilityIdeographsSupplement #xE0000 - #xE007F : Tags #xE0100 - #xE01EF : VariationSelectorsSupplement #xF0000 - #xFFFFF : SupplementaryPrivateUseArea-A #x100000 - #x10FFFF : SupplementaryPrivateUseArea-B In addition, The upper bound of block ArabicPresentationForms-B is changed from xFEFE to xFEFF. The Specials block from #xFEFF to #xFEFF is dropped. For the Specials block beginning at #xFFF0, the upper bound is changed from #xFFFD to #xFFFF.

Changes to production 37 MultiCharEsc

PER separates out . (the wildcard escape) from other escapes. This has the effect of making wildcard escapes no longer legal as positive character groups, so [.] and [^.] are no longer ambiguous, and the positive character group in them is the full-stop character, not the wildcard escape; they match, respectively, the same as \. (i.e. the full-stop character) and the same as [^\.] (i.e. any character but the full-stop character). If the full-stop were interpreted as the wildcard escape, the first would match any character at all bar newline and carriage return, and the second would match the complement of those, i.e. newline or carriage return. 1E: [37] MultiCharEsc ::= '.' | ('\' [sSiIcCdDwW]) PER, 2E, D4-D8, W: [37] MultiCharEsc ::= '\' [sSiIcCdDwW] [37a] WildcardEsc ::= '.'

Changes to definition of \i

After production 37a, D6 changes the definition of \i; LC1 changes it again. 1E through D5: \i the set of initial name characters, those matched by Letter | '_' | ':' D6: \i the set of initial name characters, those matched by NameStartChar LC1: \i the set of initial name characters, those matched by NameStartChar in [XML] or by Letter | '_' | ':' in [XML 1.0]

Open issues

Several issues are open against our regular expression language.

Bug 1889 Regex [+-] syntax

Issue 1889 was raised by Michael Kay 15 August 2005. It raises several points.

What characters are allowed as the start- and end-characters of ranges?

The grammar implies that backslash, hyphen, and square brackets are not allowed without escaping, but the prose suggests that only backslash is disallowed at the start of a range, and only backslash and left square bracket are disallowed at the end.

In the prose, does XML character mean a character allowed in XML, or a character that matches the non-terminal XmlChar?

The non-terminal XmlChar excludes backslash, hyphen, and both square brackets (and makes the prose description redundant and confusing, as a correct but incomplete and unnecessary description of part but not all of the governing rule.

Is - a legal character range, and when?

The list after the production group headed Character Range mentions it twice in what appear to be contradictory ways: The [, ], - and \ characters are not valid character ranges; and The - character is a valid character range only at the beginning or end of a positive character group. .

Bug 2123 R-134: Treatment of ^ in regexes

In bug 2123 (raised 5 April 2002), James Clark observes that the prose statement that The ^ character is only valid at the beginning of a ·positive character group· if it is part of a ·negative character group· seems to contradict the EBNF, which allows expressions like [^X] (which is ambiguous) and [^] (which is not).

He does not make clear whether he suggests changing the prose to say clearly that it is overriding the grammar, or changing the grammar to make the claim in the prose follow from the grammar, e.g. by defining posCharGroup not as posCharGroup ::= ( charRange | charClassEsc )+ but (using the notation of XML 1.0) as posCharGroup ::= ( (charRange | charClassEsc) - '^') (charRange | charClassEsc)* or (perhaps more simply) as posCharGroup ::= ( charRange | charClassEsc )+ - ('^' Char*) (Unfortunately, this won't work because if the posCharGroup is part of a negCharGroup, the caret is allowed.

Bug 2216 R-224: Questions about metacharacters in regular expressions

Bug 2216 (raised 10 July 2003) points out that: The prose says A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. The prose says A normal character is any XML character that is not a metacharacter. The EBNF defines Char (now NormalChar), under the heading Normal Character) as [^.\?*+{}()|#x5B#x5D] Braces { } are not listed in the prose, but are in the EBNF. Vertical bar | is in the EBNF but not mentioned in the prose.

Bug 3260 Use of "-" in regular expressions

Bug 3260, raised 9 May 2006 by Michael Kay on behalf of the XML Query and XSL Working Groups, says (General comment) in regular expressions defining permitted values for data types, the character "-" is sometimes preceded by "\" and sometimes not. The schema regex syntax, I believe, does not permit the "\".

Bug 3659 Bugs in date/time regexes

Bug 3659, reflects email sent 6 September 2006 to the XML Schema comments list by Laurens Holst, saying that the regular expressions given for the date/time types do not match the grammar and offering new ones.

Bug 5431 Normal characters, character references

Bug 5431 was raised 26 January 2008 by Dave Peterson to urge clarification of confusing text about normal characters and (XML) character references.

Bug 5486 Regex: Names of Unicode code blocks

Bug 5486 (raised 16 February 2008 by Michael Kay) points out that the normative reference to Unicode is now to version 4.1, but the block names listed in the text are those of 3.1.

Bug Bug 5321 REs are not production nonterminals

Bug 5321, raised 15 December 2007 by Dave Peterson, notes that the notation used to define our regular expressions appears not to be formally defined or even described anywhere.

Test cases

Some test cases may help illustrate the ways in which the changes listed above do, or do not, result in the grammar accepting different languages.

Manually constructed tests

The following expressions were constructed manually in an attempt to exercise the points of difference in the various grammars. They are not all proposed as legal regular expressions: for most of them, part of the point is that some definitions of the regex language accept them and other definitions do not. A few strings are included which really ought to be accepted by all definitions, or rejected by all, as a kind of sanity check.

Some expressions which use or misuse ^ and $ as anchors, or appear to do so. (If we include the XPath 2.0 regex language, these become more directly relevant.) ^a*$ a+b .*a+b.* a*^b$a+ a*\^b\$a+

Either | is a metacharacter or not. | |+ ||| \||\|

If { and } are defined as normal characters, x{4} has two parses, one matching xxxx and one matching, well, x{4} { {} x{4} (How many parses?) x{4,} x{,4}

Separating wildcard escape out from the other multi-character escapes has no effect at the top level. So the first few of these should all be accepted, as escapes. But within character classes, wildcard escape is no longer accepted as a positive character group. The last two are still legal, but have a different interpretation. \s \d \W . [.] [.-[abc]] [^.] [^.-[\r]]

The redefinition of positive character group by draft W has the effect (if I am reading things right) of excluding multi-character escapes from positive character groups. [abc] [a-cx-z] [\d\s] [^\d\s] [0-9-[135]] [\d-[13579]] [\p{Lu}-[AEIOU]] [\p{IsBasicLatin}-[AEIOU]] [\p{Lu}\p{Ll}-[\p{IsCherokee}]] [\p{L}-[\p{Lm}]] [-] [a-kl-z] [a-k-z] [a\-k-z] [a-k\-z] [a\-k\-z]

The changes to the charRange production (and others, perhaps) affect the use of hyphen, especially. [+--] [+-\-] [+\--] (redundant, but legal) [\--=] [--=] [-\-=] [\-\-=] (redundant, but legal) [--?] [\--?] [-\-?] [\-\-?] [--\?] [\--\?] [xyz0-9()*/+-] [xyz0-9()*/+\-] [^-+*/] [^\-+*/]

The changes to the list of blocks are fairly straightforward: some new names are legal, some old names are illegal. \p{IsBasicLatin} \p{IsCherokee} \p{IsCoptic} \p{IsGothic} \p{IsHighSurrogates}

A schema document which defines a simple type using all of the preceding patterns is in this directory at dummy.xsd. In its current form, it provides no help keeping track of the prescribed interpretation of any of the legal expressions above, but it will do as a way to see whether processors will accept them. (Processors which quit after the first error, of course, won't check all of the pattern elements in dummy.xsd.)

Generation of tests by overgenerating grammar

One simple way to generate test cases is to simplify the grammar to make it less restrictive (which in turn means it will generate not only strings legal against the real grammar, but strings rejected by the real gramamr).

The module maketests.pl implements this approach. In order to avoid a mind-numbing series of single-character regular expressions marching systematically through all of Unicode, followed by an even more mind-numbing series of two-character regular expressions, it defines a very restrictive set of characters in which (for example) a and d stand in for all Unicode characters not used by the regex languae in special meanings.

[More detail desirable here. How to use Prolog to generate from the grammar.]

One drawback to this method is that strings are generated in order of increasing length; for any given construct which requires a certain minimum length (e.g. character-class subtraction), it can take a very long time before any strings appear which excercise that part of the grammar.

Random string and not-quite-random strings

Another way of generating test cases is simply to generate random strings. By calling a random-number generator repeatedly, and using the random numbers to construct a sequence of characters, one can construct strings of arbitrary length, without having to work through the sentences of the language systematically.

A very simple way to generate random strings is to ask for random numbers in the range 32 to 2**16, and choose the corresponding Unicode character for each. This way, however, the random number generator spends a lot of its time generating different Unicode characters that do not have syntactically distinct roles; more interesting test cases are generated if the top of the range is set to 126.

[Need to find code from 2005 that works this way.]

A different approach may yield more interesting test cases more quickly: Create a list of strings, including each string that appears in the grammar as a literal token (or would, if the grammar were written that way) one or more strings for each syntactically distinct set of characters optionally, for each non-terminal N in the grammar one or more strings known to be in L(N) optionally, for each non-terminal N in the grammar one or more strings known to be similar to items in L(N), but not themselves in in L(N) (false friends). Put the strings into a 1..n array. Run the random number generator in the range 1..n and concatenate the resulting strings.

[Implementation needed to see whether this does produce more 'interesting' tests faster.]

To do Separate utility predicates from parser (make regex.dcg purer). Write code to verify AST returned by parser, as a way of implementing ad-hoc non-grammatical rules for ambiguity resolution or special constraints. Write code to dump ASTs in XML. Write code to invoke parser and AST-validator with appropriate options for a particular grammar. Write code to invoke all parsers in sequence and determine whether they agree or not. Write code to generate test strings, parse with all grammars, and report. Write code to read an XSD schema document and test each //xsd:pattern/@value: parse with all grammars, and report. Write code to read an XSD catalog file, and for each schema document listed, call the test-and-report routine. Draw dependency graph showing RHS → LHS references in (all variants of the) grammar for character classes.