Applications of Brzozowski derivatives to XML Schema processing

C. M. Sperberg-McQueen

29 June 2005

A paper accepted for presentation at Extreme Markup Languages 2005, Montréal



One of the main advances of SGML and XML over earlier methods of document representation is the idea that user-defined validity conditions for a document type could be formulated explicitly, and that document validity could be assessed automatically. Strategies for understanding, analysing, and implementing those constraints are thus a core problem for the use of markup-language technologies. Many discussions of these topics limit themselves to a brief discussion of content-model notation and its meaning; deeper discussions may discuss the equivalence between content models and finite-state automata, possibly showing how to translate content models into automata. Vocabulary designers, users, and implementors would all benefit from more serious attention to other techniques of working with content models, notably including the notion of the Brzozowski derivative.
This paper describes some applications of Brzozowski derivatives to problems which arise in connection with schemas for XML vocabularies. It begins with a brief introduction to Brzozowski derivatives and a description of how to extend them to cover content models in XML Schema 1.0, and then shows how to use derivatives to validate document instances against content models, check the Unique Particle Attribution constraint, handle a weakened form of that constraint, check whether one content model accepts a subset of the language accepted by another, and define either of two competing meanings for extensions of the XML Schema 1.0 all-group.
Software developers can use Brzozowski derivatives to reduce the cost of operations in certain circumstances, or to develop a simple well-understood reference implementation against which to check the behavior of faster but more complex code. Vocabulary developers can use Brzozowski derivatives to analyse and understand different ways of formulating particular content models; in particular, it may be very useful in vocabulary design to be able to check mechanically whether one content model accepts only a subset of what another content model accepts.

1. Introduction

Brzozowski derivatives (in what follows they will often be called just “derivatives”) were introduced by Janusz Brzozowski in 1964 [Brzozowski 1964].
Brzozowski defines derivatives thus:
Given a set R of sequences and some finite sequence s, the derivative of R with respect to s is denoted by DsR and is DsR = {t | stR}.
In words, the derivative of R with respect to s is the set of strings t which can follow s in sentences of R, or: the set of strings t such that the concatenation of s and t is a sentence in R.
Regular sets of strings can, of course, be denoted by regular expressions, and Brzozowski's contribution was to show how, given (1) a regular expression E denoting the language R and (2) a string s, to calculate a regular expression D denoting the derivative of R with respect to s. He also proved (3) that of all the derivatives of an expression, only a finite number would be distinct from each other in terms of recognizing different languages, and (4) that even if equal expressions are not always detected, there will still be only a finite number of dissimilar derivatives, if certain simple tests of similarity are performed; he then showed (5) how to construct a finite-state automaton from the set of characteristic derivatives thus identified.
Brzozowski derivatives have applications to schemas and can simplify discussions (and implementations) of validation, unique particle attribution, and similar problems. They can also be used to define ‘weakened wildcards’ (see below).

2. Related work

In their work on content models, Anne Brüggemann-Klein and Derick Wood have regularly pointed out the relations between regular expressions, finite-state automata (particularly Glushkov automata), and Brzozowski derivatives [Brüggemann-Klein 1993a], [Brüggemann-Klein 1993b], [Brüggemann-Klein/Wood 1998]. An even earlier mention (with, however, no references to Brzozowski or to any other literature) appears in an unattributed white paper prepared at Software Exoterica [Wilmott 1991], where derivatives are treated as a tool for algebraic manipulations of content models and proofs of content-model equivalence.
The earliest suggestion, however, that Brzozowski derivatives ought to be applied in validation software appears to be due to Joe English [English 1999], who provides a terse summary and urges their obvious utility for handling complex content models. Surprisingly, Joe English's own Haskell-based XML parser, HXML, appears not to use Brzozowski derivatives; not being a validating parser, it has no occasion to use them. But the suggestion was taken up by a number of others: Martin Schmidt adapts English's code fragment and applies Brzozowski derivatives in a validating XML parser written as a Master's thesis project and forming part of the Haskell XML Toolbox [Schmidt 2002]. More prominently, James Clark uses derivatives in his RelaxNG validator, Jing [Clark 2002]. According to at least one report ([Foster 2003]), Brzozowski derivatives are also used in Sun's MSV validator [Kawaguchi 2003]. Bob Foster has described (mostly in blog articles) a number of applications of Brzozowski derivatives, including validation of expressions with integer-range exponents on sub-expressions [Foster 2004].
Both English and Clark credit their acquaintance with the idea to Mark Hopkins [Hopkins 1994], who uses Brzozowski derivatives in a set of regular-expression tools posted to Usenet in 1994, and stresses the advantage of not having to build large finite state automata for short inputs.
Brzozowski derivatives have an obvious importance to those concerned with Relax NG: they help reduce the cost of Relax NG's non-determinism and co-occurrence constraints. But they also have applications to other schema languages, and deserve to be better understood among those working with them, including especially those working with XML Schema (hereinafter XSD).[1]

3. A simple extension of Brzozowski derivatives

This section explains the extension of Brzozowski derivatives to XSD content models. No knowledge of Brzozowski's original formulation is required.

3.1. Notation

Brzozowski worked with a regular-expression notation which differs interestingly both from standard regular expressions and from XSD content model notation. So to apply the notion of derivatives to XSD content models, we need to extend Brzozowski's rules. This is easily done.
For concision, I will use the following notation as a shorthand instead of writing content models in the transfer syntax defined by the XML Schema specification [W3C 2001b]. A content model expression is one of the following:
  • ε (in the XML Schema transfer syntax, this can be written <xsd:sequence/>)
  • ∅ (in the transfer syntax: <xsd:choice/>)
  • an expanded name, in the sense of the XML Namespaces specification [W3C 2004] (in the transfer syntax, <xsd:element ref = "QName"/> or <xsd:element name = "NCName"> ... </xsd:element>). For convenience, we will allow ourselves to write expanded names as qualified names (QNames) whenever the bindings of namespace prefixes are clear. The symbol e will sometimes be used as a dummy name.
  • wc(KW, NS), where KW is in {strict, lax, skip} and NS is either the keyword ##any or a list of namespace names or ε (in the transfer syntax, <xsd:any namespace = "NS" processContents = "KW"/>); I'll use ε to denote the absence of a namespace name; since the empty string can never be used as a namespace name, this leads to no ambiguities; the transfer syntax uses the keyword ##local.
  • wc(KW, not(NS)), with KW as above and NS either a namespace name or ε (in the transfer syntax, <xsd:any namespace = "##other" processContents = "KW"/>).
  • F | G, where F and G are each content-model expressions (in the transfer syntax, <xsd:choice> F G </xsd:choice>).
  • F, G, where F and G are each content-model expressions (in the transfer syntax, <xsd:sequence> F G </xsd:sequence>).
  • F{n, m}, where F is a content-model expression (in the content model, this is minOccurs = "n" maxOccurs = "m" on the element representing F).
These expressions denote regular sets of sequences of XML elements as follows:
  • ε denotes the set consisting solely of the empty sequence.[2]
  • ∅ denotes the empty set.
  • An expanded name e denotes an element (or, strictly speaking, a sequence of length one whose sole member is an element) whose generic identifier (written as a QName) maps to e.[3]
  • wc(KW, NS) (wc for ‘wildcard’) denotes a sequence whose sole member is an element in namespace NS.
  • wc(KW, not(NS)) denotes a sequence whose sole member is an element not in namespace NS.
  • F | G denotes the union of the set denoted by F and that denoted by G.
  • F, G denotes the set of sequences fg where fF and gG.
  • F{n, m} denotes the set of sequences formed by concatenating at least n and at most m occurrences of members of the set denoted by F, where F is a content-model expression, n is a non-negative integer, and m is either a positive integer or the symbol unbounded.
In examples, the notations familiar from SGML and XML content models will occasionally be used: for some expression E, E* ≡ E{0, unbounded}, E+ ≡ E{1, unbounded}, E? ≡ E{0, 1}.
Content models are either content-model expressions or “all”-expressions. An “all”-expression is:
  • F&G, where F and G are each either simple content-model expressions or are themselves “all”-expressions.
A simple content-model expression is ε, ∅, or F{n, 1}, where F is an expanded name and 0 ≤ n ≤ 1. (For convenience, we'll allow ourselves to write just e instead of e{1, 1}, when e is an expanded name.) An “all”-expression has the following denotation:
  • F&G, denotes the set of sequences formed by concatenating, in any order, exactly one member of each set denoted by the simple content-model expressions in F and G.
Some examples may help make things clear.
  • The content model expression “a, b, c{1, unbounded}” matches any of the following sequences of elements:
    • a b c
    • a b c c
    • a b c c c
    etc.
  • The content model expression “a, b, (c{1, unbounded} | (d){2, 4})” matches any of the sequences above, as well as any of:
    • a b d d
    • a b d d d
    • a b d d d d
  • The content model expression “a & b{0, 1} & c” matches any of the sequences
    • a c
    • c a
    • a b c
    • a c b
    • b a c
    • b c a
    • c a b
    • c b a
  • The content model expression “ε” matches the empty sequence.
  • The content model expression “” matches no sequences at all.
Note that while in XSD all of the operators | and , and & are n-ary (i.e. they take an arbitrary number of arguments), we can treat them without loss of generality as binary operators, all of which obey the associative law. That is, for any content model expressions E, F, and G, we have
(E | (F | G))
= ((E | F) | G)
= (E | F | G)
and
(E, (F, G))
= ((E, F), G)
= (E, F, G)
and
(E & (F & G))
= ((E & F) & G)
= (E & F & G)
The last set of equations requires some discussion. They will seem foreign to those used to the SGML & operator, which is not associative, but natural to those accustomed to working with the interleave operator of Relax NG and some recent discussions of content models by computer scientists. In SGML, the expression (a & (b & c)) is not at all the same as ((a & b) & c): the former requires an a to appear at the beginning or end of the sequence, while the latter requires c to appear at the beginning or end, so that a c b is accepted by the first expression, but not the second. In RelaxNG, the two expressions are equivalent. In XSD, the issue does not arise, since XSD does not allow all-expressions to nest within all-expressions and so does not need to specify either the associative or the non-associative interpretation.
The & operators of SGML and RelaxNG also differ in another way: as suggested by the term interleave, RelaxNG allows sub-parts of the left- and right-hand expressions to be interleaved. In RelaxNG, for example, the expression (a, b) & (d, e) accepts the sequence a d b e, as well as any other in which both a occurs before b and c occurs before d, and nothing else occurs. In SGML, the expression accepts only the two sequences “a b d e” and “d e a b”. Here, too, XSD restricts the operator in such a way that the XSD operator can be interpreted either way: in XSD, all arguments of the & operator must be simple content-model expressions, so that the question of interleaving sub-parts of the matching strings does not arise.
Let it suffice for now to observe that the validation semantics assigned to all-expressions below have been formulated in such a way that the binary &-operator to which we translate XSD all-expressions does obey the associative law.

3.2. Nullability

It is useful to know, for an expression E, whether ε is in L(E) (i.e. whether the language accepted by E includes the empty string). Following Brüggeman-Klein [Brüggemann-Klein 1993a], I will say that if ε is in L(E), then E is nullable, and define the predicate ν(E) which is true if and only if E is nullable.
The definition of nullable for content-model expressions is:
  • ν(ε) = true
  • ν(∅) = false
  • ν(e) = false for any expanded name e
  • ν(wc(KW, NS)) = false for any value of KW and NS
  • ν(F | G) = ν(F) ∨ ν(G)
  • ν(F, G) = ν(F) ∧ ν(G)
  • ν(F{n, m}) = (n = 0) ∨ ν(F)
  • ν(F & G) = ν(F) ∧ ν(G)
Some algebraic manipulations will become simpler if we define a related function, which following Brzozowski I call δ. For all regular expressions E,
  • δ(E) = ε if and only if L(E) includes ε.
  • Otherwise, δ(E) = ∅
That is, if ν(E) then δ(E) = ε else δ(E) = ∅.

3.3. Construction of derivatives

First we define the derivative of an expression E with respect to a single expanded name x. This we specify to be identical by definition to the derivative with respect to a sequence consisting just of one element whose expanded name is x.
  • Dx ε = ∅
  • Dx ∅ = ∅
  • Dx e = ε if x = e
  • Dx e = ∅ if e is an expanded name, and (xe)
  • Dx wc(KW, NS) = ε if x is in namespace NS, else ∅
  • Dx wc(KW, not(NS)) = ∅ if x is in namespace NS, else ε
  • Dx (F | G) = (Dx F) | (Dx G)
  • Dx (F, G) = ((Dx F), G) | (Dx G) if F is nullable, else (Dx F), G[4]
  • Dx F{n,m} = (Dx F), F{--n, --m} if F is not nullable or if n = 0 and m = unbounded, else (Dx F), F{--n, --m} | (Dx F{--n, --m}). (See definition of -- operator below.)
  • Dx (F & G) = ((Dx F) & G) | (F & (Dx G))
For positive integers, the -- operator decrements, but it respects special cases: --k = k - 1 for integer k > 0, but --0 = 0, and --unbounded = unbounded.[5]
To calculate the derivative of E with respect to a sequence x1, x2, x3, ..., xn, we first take the derivative with respect to x1, then the derivative of that expression with respect to x2, and so on until we are at the end of the sequence. Algebraically,
Ds E = D<x1, x2, ... xn> E = Dxn (... (Dx3 (Dx2 (Dx1 E))) ...)

3.4. Simplification of expressions

The following identities are useful. For any expressions E, F, and G,
  • E | E = E
  • E | F = F | E
  • (E | F) | G = E | (F | G)
  • ε, E = E, ε = E
  • ε & E = E & ε = E
  • ∅ | E = E | ∅ = E
  • ∅, E = E, ∅ = ∅
  • ∅ & E = E & ∅ = ∅
In what follows, we'll (silently) make use of these identities to simplify expressions generated by taking derivatives.

3.5. Characteristic derivatives

Brzozowski proves that every regular expression is guaranteed to have a finite number of distinct (i.e. non-equivalent) derivatives. The set of distinct derivatives of E with respect to the strings in Σ* (for some input alphabet Σ) is called the characteristic derivatives of E.
Brzozowski also defines a test of similarity for regular expressions: expressions are similar if they can be transformed into each other using only the first three identities listed in section 3.4. Since in practice the detection of equivalence may be expensive, it is encouraging to note that Brzozowski also proves that every regular expression is guaranteed to have a finite number of dissimilar derivatives. The consequence is that even if we fail to detect all equivalences among the derivatives of an expression, we can still avoid having the number of characteristic derivatives grow without bound.
One complication should be addressed here: like all standard work on regular languages, Brzozowski assumes that the input alphabet Σ is finite. Content models, however, are regular expressions over an infinite set, namely XML element types (or element type names). The following discussion silently makes the usual adjustment: for content models without wildcards, it suffices to assume an alphabet including all the names used in the content model, plus one name which matches nothing in the content model. Since in the absence of wildcards all non-matching names behave the same way, the single non-matching name successfully represents the infinite set of possible inputs which match nothing in the content model. When wildcards are present, it suffices to include one otherwise unknown name in each namespace mentioned in any wildcard, plus one name in a namespace unmentioned in any wildcard. In this way we can work with finite input alphabets which correctly stand in for the infinite input alphabet provided by XML.

4. Applications to XML Schema processing

This section identifies some important tasks in XSD processing and shows how Brzozowski derivatives provide convenient and clear ways to formulate and perform the task.

4.1. Validation

We can use derivatives to validate without ever constructing anything recognizable as an automaton: for content model E and sequence s, E accepts s if and only if ν(Ds E).[6]
For strongly deterministic expressions, the time it takes to calculate the derivative of E with respect to a single input symbol is at most linear in the size of E, and the derivative is never larger than E.[7] This means that overall, validation of a sequence will be at worst linear in the size of the sequence. (Note that formal proofs of these claims remain a task for future work.)
For example, let us consider the strongly deterministic content model E =
( (script | style | meta)*,
  ( (title, (script | style | meta)*, (base, (script | style | meta)*)?)
    |
    (base, (script | style | meta)*, (title, (script | style | meta)*))
  )
)
which is a slight simplification of the definition of head in XHTML. If we seek to validate the sequence <meta, title, style> against this content model, we proceed as follows:
  1. Find Dmeta E.
    This proves to be E again (which is intuitively plausible: any number of meta elements are allowed at the beginning of the sequence).[8]
  2. Now find Dtitle E.
    This proves to be ((script | style | meta)*, (base, (script | style | meta)*)?).[9]
  3. Now find Dstyle ((script | style | meta)*, (base, (script | style | meta)*)?).
    This is (base, (script | style | meta)*)?[10]
    Since ν((base, (script | style | meta)*)?) is true, the sequence is valid against the content model.
For non-strongly-deterministic expressions, the derivative may be larger than the original expression and will take the form of a disjunction. If the expression is weakly deterministic, then the number of disjuncts will correspond to the number of states the (non-strongly-deterministic) Glushkov automaton could be in at the given state in the parse.
For example, consider the content model E = (e{1,5}, b{0,2}){1,5}. A straightforward calculation of the derivatives shows how the second and later occurrences of “e” in an input string cause more and more growth in the size of the derivative:
E =
(e{1,5}, b{0,1}){1,5}
De E =
e{0,4}, b?, (e{1,5}, b?){0,4}
Dee E =
  e{0,3}, b?, (e{1,5}, b?){0,4} 
| e{0,4}, b?, (e{1,5}, b?){0,3}
Deee E =
  e{0,2}, b?, (e{1,5}, b?){0,4}
| e{0,4}, b?, (e{1,5}, b?){0,3}
| e{0,3}, b?, (e{1,5}, b?){0,3} 
| e{0,4}, b?, (e{1,5}, b?){0,2}
Deeeb E =
  (e{1,5}, b?){0,4} 
| (e{1,5}, b?){0,3} 
| (e{1,5}, b?){0,3} 
| (e{1,5}, b?){0,2}'
Deeee E =
  e?, b?, (e{1,5}, b?){0,4} 
| e{0,4}, b?, (e{1,5}, b?){0,3} 
| e{0,3}, b?, (e{1,5}, b?){0,3} 
| e{0,4}, b?, (e{1,5}, b?){0,2} 
| e{0,2}, b?, (e{1,5}, b?){0,3} 
| e{0,4}, b?, (e{1,5}, b?){0,2} 
| e{0,3}, b?, (e{1,5}, b?){0,2} 
| e{0,4}, b?, (e{1,5}, b?)?

4.2. Unique Particle Attribution

SGML DTDs, XML 1.0 DTDs, and XSD 1.0 all share a rule which restricts the forms of regular expressions legal in content models, in order to ensure that they do not require unbounded lookahead. ISO 8879 requires that content models be “unambiguous”. The sense assigned to ambiguity is somewhat broader than the sense usual in formal language study: the expression a?, a is ambiguous in the SGML sense, because when a parser is confronted with a sequence beginning with an a, lookahead is required to know whether it matches the first a in the expression, or the second. It is not ambiguous in the usual formal sense: for any word in the language, there is only one parse tree. In part to avoid the confusion caused by this unusual usage of the term ambiguity, the XML 1.0 spec requires instead that content models be deterministic. Determinism, however, may be defined in either of two ways, weak or strong. To avoid the confusion caused by the two flavors of determinism, XSD 1.0, in turn, imposes what it calls the Unique Particle Attribution constraint. All three of these names denote essentially the same requirement: that it be possible, without lookahead in the document instance, to know which token (or particle) in a content model is matched by a token in the document instance, and that there be at most one such token. This is what Brüggemann-Klein calls weak determinism, as opposed to strong determinism, which additionally requires that every repetition of a token be licensed by at most one occurrence indicator in the expression.[11]
Previous formulations of this constraint have either provided an algorithm translating expressions into automata, required that this translation be performed, and appealed to properties of the resulting automata (so Brüggemann-Klein and Wood in [Brüggemann-Klein 1993a] and [Brüggemann-Klein/Wood 1998]), or else they have refrained from requiring such a translation, but then found themselves unable to specify precisely what the constraint is, relying instead on nudges and winks and hopes that the reader understands what is meant (so ISO 8879 [ISO 1986] and the XML specification [W3C 2004]).[12] Using Brzozowski derivatives, however, the UPA constraint of XSD 1.0 can be specified constructively and precisely, without reference to the corresponding automata.[13] Two approaches offer themselves.

4.2.1. Testing initial determinism

4.2.1.1. Definition
The first idea is this: we define a property which holds of some content-model expressions: determinism with respect to initial x, where x is an input signal (for us: an expanded name). The UPA constraint can now be expressed thus:
  • The original content model E must be x-deterministic for all x.
  • For any sequence of input signals s, the derivative of E with respect to s must be x-deterministic for all x in the input alphabet.
Since taking the derivative of E with respect to every sequence s of input signals will eventually generate all of the characteristic derivatives of E, the second item can also be stated thus:
  • For all symbols x in the input alphabet, and for all characteristic derivatives F of the original content model, F must be deterministic with respect to initial x.
We can now define the function idet(x, E), which is true if and only if E is deterministic with respect to initial x.
  • idet(x, ε) = true
  • idet(x, ∅) = true
  • idet(x, e) = true
  • idet(x, wc(KW,NS)) = true for any value of KW and NS
  • idet(x, wc(KW,not(NS))) = true for any value of KW and NS
  • idet(x, (F | G)) = ((DxF = ∅ ∧ idet(x,G)) ∨ (DxG = ∅ ∧ idet(x,F)))
  • idet(x, (F, G)) = if ν(F), then ((DxF = ∅ ∧ idet(x, G)) ∨ (DxG = ∅ ∧ idet(x, F))), else (F not nullable) idet(x,F)
  • idet(x, (F & G)) = ((DxF = ∅ ∧ idet(x, G)) ∨ (DxG = ∅ ∧ idet(x, F)))
  • idet(x, F{n,m}) = idet(x, F)
It is useful to note that for all expressions E, if DxE = ∅, then idet(x,E) = true.
This approach was anticipated in part by Sam Wilmott [Wilmott 1991], but the treatment of repetition and of sequences differs there.[14]
4.2.1.2. Examples
A few simple examples may illustrate the method. First, let E be a{2,4}, a. If we test the first few derivatives of E for initial determinism with respect to a, we have:
  • DεE = (a{2,4}, a)
    • idet(a, (a{2,4},a))
      = idet(a, a{2,4})[15]
      = true.
  • DaE = (a{1,3}, a)
    • idet(a, (a{1,3},a))
      = idet(a, a{1,3})
      = true
  • DaE = (a{0,2}, a)
    • idet(a,a{0,2}, a)
      = ((Dxa{0,2} = ∅ ∧ idet(x, a)) ∨ (Dxa = ∅ ∧ idet(x, a{0,2})))
      = ((false ∧ true) ∨ (false ∧ true))
      = false
At this point it is unnecessary to consider other derivatives of E: since idet(DaE) is false, we know that E does not obey the UPA constraint. This is a simple instance of the general rule: if a choice between F and G would be non-deterministic, then the sequence (F{n,m}, G) is non-deterministic for n < m.
When n = m, however, the situation is different, as shown by a second example. Let E be a{2,2}, a. There are five characteristic derivatives of E (for any alphabet containing a and at least one other symbol, here represented by b). We have:
  • DεE = a{2,2}, a
    • idet(a, (a{2,2},a)) = idet(a, a{2,2}) = true.
    • idet(b, (a{2,2},a)) = idet(b, a{2,2}) = true.
  • DaE = a{1,1}, a
    • idet(a, (a{1,1},a)) = idet(a, a{1,1}) = true
    • idet(b, (a{1,1},a)) = idet(b, a{1,1}) = true
  • DaaE = a
    • idet(a, a) = true
    • idet(b, a) = true
  • DaaaE = ε
    • idet(a, ε) = true
    • idet(b, ε) = true
  • DbE = DabE = DaabE = ∅
    • idet(a, ∅) = true
    • idet(b, ∅) = true
Since idet(x,F) is true for all derivatives F of E and all symbols x in the alphabet, we can conclude that E obeys the UPA constraint.
As final example, let E = (a, a?){2,4}. We have:
  • DεE = (a, a?){2,4}
    • idet(a, (a{2,2},a)) = idet(a, a{2,2}) = true.
    • idet(b, (a{2,2},a)) = idet(b, a{2,2}) = true.
  • DaE = a?, (a, a?){1,3}
    • idet(a, a?, (a, a?){1,3}) = ((Daa? = ∅ ∧ idet(a, (a, a?){1,3})) ∨ (Da(a, a?){1,3} = ∅ ∧ idet(a, a?))) = ((false ∧ true) ∨ (false ∧ true)) = false
No further examination is necessary, to know that E is non-deterministic and thus violates the UPA constraint.
4.2.1.3. Reducing the cost of the calculation
In some cases, the work needed to test an expression for initial determinism can be reduced. This can be important for large maximum numbers of repetitions, especially when expressions with repetition operators nest. The expression (e{0,1000}){0,1000}, for example, has 1,000,002 (1000 × 1000 + 2) distinct derivatives.[16]
The key to reducing the cost of the calculation is noticing that for 1 < n < m and any expression E, the expression E{n, m} is deterministic for initial x if and only if E{--n, --m} is deterministic for initial x. This means that n can be reduced to 1, and m to 2, without changing the result of the determinism calculation (but saving some arbitrarily large number of derivatives in the process). When the minimum and maximum numbers of repetitions are the same, then they can be reduced to 2 but not to 1. That is, if n = m > 2, then E{n, m} is deterministic for initial x if and only if E{--n, --m} is deterministic for initial x.
More formally: every subexpression E of the form F{n,m} can be replaced by another expression E′, constructed so that E obeys the UPA constraint if and only if E′ does, but E′ has fewer characteristic derivatives. The value of E′ is:
  • If E = ε or ∅ or an expanded name or wc(KW,NS) or wc(KW,not(NS)), then E′ = E.
  • If E = F | G, then E′ = F′ | G′.
  • If E = F, G, then E′ = F′, G′.
  • If E = F{n, m}, then E′ = F′{j,k}, where
    • F′ is the recursive simplification of F.
    • If n = m > 1, then j = k = 2.
    • Else j = min(1,n), k = min(2,m), where for all natural numbers n, min(n, unbounded) = n
It's intuitively clear that for any expression E we can construct the E′ described. Informally, it's not hard to see that E is weakly deterministic if and only if E′ is: in expressions of the form F{n,m}, there are two possible sources of non-determinism other than non-determinism within F itself, each unaffected by the reduction of n to 1 or 2 and m to 2:
  • If n < m, then some derivative of E will have the form F{0,j} (with j = m - n). At that point, there may be weak non-determinism between the start set of F and the start set of whatever expressions can follow E. The magnitude of the difference between n and m is immaterial, so m can be reduced to n + 1 without affecting the determinism of the expression.
  • If m > 1, then F{n, m} is equivalent, for purposes of language recognition and determinism, to F, F{--n,--m} and there may be weak non-determinism between the start set of F and optional tokens at the end of F.[17] It does not matter whether F can be repeated a hundred times or only repeated once: if it can be repeated at all, this non-determinism must be checked for, and m can be reduced from 100, or from unbounded, to 2 without affecting the determinism of the expression. At the same time, n must be reduced by a commensurate amount: if n = m, both are reduced to 2, while if n < m, n can be reduced to 1.
It would be desirable to have a more formal proof that the reduction in exponents (a) reduces the number of distinct derivatives and (b) preserves strong and weak determinism / non-determinism. Such a formal proof remains a task for future work.

4.2.2. Counting initial matches

A second approach is to define a function c which counts the number of tokens in the expression which can match an initial symbol x:
  • c(x, ε) = 0
  • c(x, ∅) = 0
  • c(x, e) = 1 if x = e, else 0
  • c(x, wc(KW,NS)) = 1 if x matches NS, 0 otherwise
  • c(x, wc(KW,not(NS))) = 0 if x matches NS, 1 otherwise
  • c(x, (F | G)) = c(x,F) + c(x,G)
  • c(x, (F, G)) = c(x,F) + c(x,G) if F is nullable, else c(x,F)
  • c(x, (F & G)) = c(x,F) + c(x,G)
  • c(x, F{n,m}) = c(x,F))
The XSD Unique Particle Attribute constraint, the XML 1.0 determinism rule, and the SGML ambiguity rule then all amount to this:
  • For all symbols x in the input alphabet, and for all characteristic derivatives F of the original content model, c(x, F) ≤ 1.
To illustrate, let us repeat the first example from the previous section: let E be a{2,4}, a. For the first few derivatives of E, we have:
  • DεE = (a{2,4}, a)
    • c(a, (a{2,4},a))
      = c(a, a{2,4})
      = 1.
  • DaE = (a{1,3}, a)
    • c(a, (a{1,3},a))
      = c(a, a{1,3})
      = 1
  • DaE = (a{0,2}, a)
    • c(a,a{0,2}, a)
      = c(x, a{0,2}) + c(x, a)
      = 1 + 1
      = 2
As before, the derivative of E with respect to a decides the question: c(DaE) > 1, and so E violates the UPA constraint.

4.3. Weakened wildcards

In discussions of XSD and the Unique Particle Attribution constraint, various alternative validation behaviors have occasionally been proposed; one of the most popular ideas is ‘weakened wildcards’. This is a weakened form of the Unique Particle Attribution constraint, which allows content models in which both an extended name and a wildcard can match an input element, with the proviso that in cases of such competition, it is the extended name, not the wildcard, which is used to match the input element. Competition between element tokens or between wildcards is still prohibited.
Either of the two approaches to the UPA constraint described above can be extended to handle weakened wildcards.
The idet predicate can be specialized in the obvious way into two predicates, edet (which is true just in case there is at most one element particle in the content model which can match the input symbol) and wdet (true if there is at most one matching wildcard). The weakened form of the constraint is then:
  • For all symbols x in the input alphabet, and for all characteristic derivatives F of the original content model, edet(x,F) and wdet(x,F) must both be true.
The c function can similarly be specialized to make functions that count just element tokens (ecount) or just wildcard tokens (wcount) which can match an initial x. The weak wildcard rule then says that for every member F of the characteristic derivatives of E and every symbol x in the input alphabet, it must be the case both that ecount(x,F) ≤ 1 and that wcount(x,F) ≤ 1.

4.4. Checking subsumption of content models

This topic has already been treated, using different formalisms, by Henry Thompson and Matthew Fuchs [Thompson/Tobin 2003], [Fuchs/Brown 2003]. Brzozowski derivatives provide a different way of considering the problem, which seems in some respects more convenient and concise.
Section 4.4.1 below extends the notation described above in section 3.1, adding (as Brzozowski does) the operators 'and' and 'not' to the expression language and showing how to take their derivatives. They are redundant and do not extend the expressive power of the language, but they are convenient to have when considering subsumption.
Section 4.4.2 expresses the subsumption test in several ways using derivatives, notably:
B subsumes R if and only if for all strings s, δ(Ds (R and ~B)) = ∅
(N.B. this definition of subsumption corresponds to shallow local validation, not to deep validation.)
Section 4.4.3 describes a relatively simple way to test subsumption: find the set of characteristic derivatives of (R and ~B) and test each one for nullability. Some speculations about complexity are also included.
Section 4.4.4 works in detail through some examples.
Section 4.4.5 is a series of proofs showing that the various formulations of subsumption offered in section 4.4.2 are in fact equivalent.

4.4.1. Extending the expression notation

It will be convenient, before we proceed, to extend the expression language defined earlier. In addition to the expressions defined above, the following are also legal expressions:
  • F and G, where F and G are each content-model expressions or “all”-expressions
  • ~F, where F is either a content-model expression or an “all”-expression
We interpret these expressions as follows:
  • F and G denotes the set of sequences which are both in F and in G. Note that this is not the same as F & G — this is a straightforward set intersection operation.
  • ~F denotes the set of sequences not in F
We need to define how to take derivatives with respect to expressions of these shapes. Brzozowski proves that for all strings s,
Ds (F and G) = (Ds F) and (Ds G)
Ds (~F) = ~(Ds F)
We'll use these below.

4.4.2. Expressing the subsumption test

It is a well known result that for two regular expressions B and R, it is readily decidable whether B subsumes R, i.e. whether the language accepted by R is or is not a subset of that accepted by B. Since content models are so similar to regular expressions, it has seemed plausible that the same relation is decidable for content models. In the worst case, it is always possible to translate the two content models first into regular expressions and then into finite state automata exploit the standard algorithms. The presence of wildcards, the all connector, and repetition operators other than the Kleene star, however, has complicated the story somewhat. Uncertainty about the possible effect of these complications contributed to the decision to define complex type restriction not in terms of the subset relation between languages accepted by two types, but instead in terms of a complex set of constructive rules for checking content models of substantially similar structure. It is thus particularly significant that Brzozowski derivatives offer a way to define and check the subsumption relation between two content models.
To test whether expression B subsumes expression R, we want to test any one of the following propositions:
  1. for all strings s, s is in L(R) only if s is in L(B)
  2. for all strings s, δ(Ds R) = ε only if δ(Ds B) = ε
  3. the expression (R and ~B) denotes the empty set
  4. for all strings s, δ(Ds (R and ~B)) = ∅
A proof that these expressions are equivalent is given in section 4.4.5 below.

4.4.3. Calculating subsumption

The simplest way I can find to define the subsumption test using derivatives is this. We wish to establish that for all strings s over the input alphabet, δ(Ds (R and ~B)) = ∅.
Brzozowski proves that any expression E has "a finite number of types of derivatives", where two expressions E and F are of the same type if and only if L(E) = L(F), i.e. the two expressions are equivalent. The n different types (equivalence classes) of derivatives of E are called the characteristic derivatives of E. Brzozowski defines a straightforward way to generate all the characteristic derivatives of E, which involves examining increasingly long strings over the input alphabet until a fixed point is reached. The algorithm runs something like this:
We'll use variables n (current string length), X (the set of characteristic derivatives found so far), and f (a Boolean flag).
  1. Start with n = 0, X = ∅, f = false.
  2. For each s in the set of all strings of length n,
    1. Let F = Ds E.
    2. If F is equivalent to any member of X, do nothing.
    3. Otherwise, let X = X ∪ {F} and let f = true.
  3. If f = true, then increment n and go to step 2.
  4. Otherwise, f = false, and X contains the set of characteristic derivatives.
The finiteness of the set of characteristic derivatives gives us one simple way to test proposition (4):
  1. Find the set of characteristic derivatives of (R and ~B).
  2. For each characteristic derivative, test whether it is nullable or not.
If no characteristic derivative of (R and ~B) is nullable, then B subsumes R.
There may be some way to do this by analysing R and B separately, and performing some calculation on their two sets of characteristic derivatives, but in the general case this doesn't seem to simplify things; in the worst case, R has n characteristic derivatives and B has m, and we have to check n × m combinations. This is exactly what taking the characteristic derivatives of (R and ~B) does: since for any string s we have
Ds (R and ~B) = (Ds R) and (Ds ~B) = (Ds R) and ~(Ds B)
we end up calculating the derivatives of R in parallel to those of B, with the exception that we don't necessarily generate all n × m pairs of characteristic derivatives.
It seems clear that calculating the set of characteristic derivatives of (R and ~B) takes about as much work as (and is isomorphic to) calculating the MR and MB (the Glushkov automata of R and B) and then calculating, for all strings over the alphabet, whether the string is in B and not R.
Which in turn is about the same amount of work it takes to calculate the two Glushkov automata, negate the automaton for B, and calculate the automaton which is the intersection of the MR and M~B.
It may be possible to reduce the cost of translation by translating content models not into standard finite state automata but into what might be called ‘counter-extended automata’ (described in an unpublished paper [Sperberg-McQueen 2004]). In the worst case, however, counter-extended automata explode in size in much the same way as a naive expansion of content models into standard regular expressions. I still don't know a good theory explaining when and how using extended automata will be cheaper than expanding to standard expressions and using standard automata, and by how much. This is another topic for further work.

4.4.4. Examples

4.4.4.1. Shorter all-group
Consider first the pair of expressions
B = (a & b & c)
R = (a & b)
Does B subsume R?
We test by seeing whether among the characteristic derivatives of E = (R and ~B) there is any which is nullable. The characteristic derivatives are found thus:
For n = 0:
  • s = ε, Ds E = (a & b) and ~(a & b & c). This is not nullable.
For n = 1:
  • s = “a”: Ds E = (b and ~(b & c)). Not nullable.
  • s = “b”: Ds E = (a and ~(a & c)). Not nullable.
  • s = “c”: Ds E = ∅. Not nullable.
For n = 2:
  • s = “aa”: Ds E = ∅. Not nullable.
  • s = “ab”: Ds E = (ε and ~(c)). Nullable.
  • s = “ac”: Ds E = (∅). Not nullable .
  • s = “ba”: Ds E = (ε and ~(c)). Nullable.
  • s = “bb”: Ds E = ∅. Not nullable .
  • s = “bc”: Ds E = ∅. Not nullable .
  • s = “ca”: Ds E = ∅. Not nullable .
  • s = “cb”: Ds E = ∅. Not nullable .
  • s = “cc”: Ds E = ∅. Not nullable.
For n = 3, all strings produce Ds E = ∅, which is not nullable.
Note that (as expected) the sequences “ab” and “ba” constitute counter-examples: this derivation is not legal because they are locally valid against R but not against B.
4.4.4.2. Another shorter all-group
If we rewrite the base type as B′ = (a? & b? & c?) then we get these results:
For n = 0, 1, 2:
  • s = ε: Ds E = (a & b) and ~(a & b & (c?)). Nullable = false.
  • s = “a”: Ds E = b and ~(b & (c?)). Nullable = false.
  • s = “b”: Ds E = a and ~(a & (c?)). Nullable = false.
  • s = “c”: Ds E = ∅. Nullable = false.
  • s = “aa”: Ds E = ∅. Nullable = false.
  • s = “ab”: Ds E = ε and ~(c?). Nullable = false.
  • s = “ac”: Ds E = ∅. Nullable = false.
  • s = “ba”: Ds E = ε and ~(c?). Nullable = false.
  • s = “bb”: Ds E = ∅. Nullable = false.
  • s = “bc”: Ds E = ∅. Nullable = false.
  • s = “ca”: Ds E = ∅. Nullable = false.
  • s = “cb”: Ds E = ∅. Nullable = false.
  • s = “cc”: Ds E = ∅. Nullable = false.
For n = 3, all strings get Ds E = ∅. Nullable = false.
As may be seen, the characteristic derivatives of (R and ~B′) are:
  • (a & b) and ~(a & b & (c?))
  • b and ~(b & (c?))
  • a and ~(a & (c?))
  • ε and ~(c?)
and none of them is nullable. Since for every string s, Ds E is equivalent to one of these five derivatives, we have established that for every string s, Ds E is not nullable, which means ε is not accepted by Ds E, which means E did not accept s. The absence of any string accepted by E is precisely what we mean when we say L(E) = ∅. So B′ does subsume R, Q.E.D.
4.4.4.3. Reduction of all-group to sequence
B = (a & b)
R = (a, b)
E = (R and ~B) = ((a, b) and ~(a & b))
We obtain the characteristic derivatives of E thus:
For n = 0, 1, 2:
  • s = ε: Ds E = (a, b) and ~(a & b). Nullable = false.
  • s = “a”: Ds E = b and ~(b). Nullable = false.
  • s = “b”: Ds E = ∅. Nullable = false.
  • s = “aa”: Ds E = ∅. Nullable = false.
  • s = “ab”: Ds E = ε and ~(ε). Nullable = false.
  • s = “ba”: Ds E = ∅. Nullable = false.
  • s = “bb”: Ds E = ∅. Nullable = false.
For n > 3, for all strings s, Ds E = ∅. So the characteristic derivatives are those given above, none of them nullable. So B subsumes R. Q.E.D.
4.4.4.4. Translation of all-group to equivalent choice of sequences
B = (a & b)
R = ((a, b) | (b, a))
E = ((a, b) | (b, a)) and ~(a & b)
The characteristic derivatives are all reached by strings of length < 3:
  • s = ε: Ds E = ((a, b) | (b, a)) and ~(a & b). Nullable = false.
  • s = “a”: Ds E = b and ~(b). Nullable = false.
  • s = “b”: Ds E = a and ~(a). Nullable = false.
  • s = “aa”: Ds E = ∅. Nullable = false.
  • s = “ab”: Ds E = ε and ~(ε). Nullable = false.
  • s = “ba”: Ds E = ε and ~(ε). Nullable = false.
  • s = “bb”: Ds E = ∅. Nullable = false.
B subsumes R. Q.E.D.
4.4.4.5. A more complex example
It may be worth working an example with larger content models, just to give an idea of how this algorithm behaves when the expressions are not quite so simple.
Let's take
B = (a, ((b, c, d){0,5}, e{0,1}){0,4}, f)
R = (a, b, (c, d, b){2,3}, c, d, e, f)
E = (R and ~B) = ((a, b, (c, d, b){2,3}, c, d, e, f) and ~(a, ((b, c, d){0,5}, e{0,1}){0,4}, f))
B does subsume R (at least that was the intent when they were written), but it's not easy to see this at a glance. I won't go through the algorithm in full detail, but here are some notes.
Expression E has, if my calculations are correct, 18 characteristic derivatives, one each calculable as Ds E for s in the 13 prefixes of the string “abcdbcdbcdef”, four more for the last five prefixes of “abcdbcdbcdbcdef” (one of those prefixes has the same derivative as a known one), and one (the empty set) which applies to all other strings. Some of these are given below; there seemed little point in listing all of them.
The derivatives steadily get more and more complex as the string gets longer and exercises more and more heavily the non-determinism in the exponents of the base type:
For n = 0,
  • s = ε: Ds E = (a, b, ((c, d, b){2,3}), c, d, e, f) and ~(a, ((((b, c, d){0,5}), (e?)){0,4}), f)
For n = 1:
  • s = “a”: Ds E = (b, ((c, d, b){2,3}), c, d, e, f) and ~(((((b, c, d){0,5}), (e?)){0,4}), f).
  • s = b (etc.): Ds E = ∅
For n = 2:
  • s = “ab”: Ds E =
    (((c, d, b){2,3}), c, d, e, f) 
            and
            ~( ( (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,3})) 
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               ),
               f)
All other sequences of length 2 yield the derivative ∅.
For n = 3:
  • s = “abc”: Ds E =
    (d, b, ((c, d, b){1,2}), c, d, e, f)
            and 
            ~( ( (d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,3}))
               | (d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (d, ((b, c, d){0,4}), (e?))
               ), 
               f
             )
When we reach n = 11, the derivative has reached alarming proportions:
  • s = “abcdbcdbcdb”: Ds E =
    (c, d, e, f)
            and
            ~( ( (c, d, ((b, c, d)?), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,3}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?))
               | (c, d, ((b, c, d)?), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?))
               | (c, d, ((b, c, d)?), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?))
               | (c, d, ((b, c, d)?), (e?))
              ), f)
Alert eyes will note that the rudimentary expression simplifier used to produce the derivative just shown has not fully simplified some of these expressions: a smarter simplifier eliminates some of the subterms in the negated expression, producing
(c, d, e, f) 
        and 
        ~( ( (c, d, ((b, c, d)?), (e?), 
             ((((b, c, d){0,5}), (e?)){0,3}))
           | (c, d, ((b, c, d){0,4}), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d){0,4}), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d){0,4}), (e?))
           | (c, d, ((b, c, d){0,3}), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d){0,3}), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d){0,3}), (e?))
           | (c, d, ((b, c, d){0,2}), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d){0,2}), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d){0,2}), (e?))
           | (c, d, ((b, c, d)?), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d)?), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d)?), (e?))
         ), f)
which has 13 disjuncts in the inner expression, rather than 35. It may be worth noting that Brzozowski shows explicitly that even if only relatively rudimentary simplification is done on the characteristic derivatives, there is still only a finite number of them. More formally, he shows that not only is there a finite number of characteristic derivatives such than none is equal to the others, there is also a finite number, albeit a larger number, such that none is similar to the others, where similar expressions can be transformed into each other using only the identities
E | E = E
E | F = F | E
(E | F) | G = E | (F | G)
Any expression simplifier which exploits at least these identities will suffice for the task; further or more aggressive simplifications may reduce the number of distinct expressions found.
For both n = 11, s = “abcdbcdbcde”, and n = 14 s = “abcdbcdbcdbcde”, there is a marked simplification of the derivative: Ds E =
f 
        and
        ~( ( ((((b, c, d){0,5}), (e?)){0,3})
           | ((((b, c, d){0,5}), (e?)){0,2})
           | ((((b, c, d){0,5}), (e?))?)
           | ε
           ), f)
Note that the “e” at the end of each input string dramatically reduces the uncertainty in the expression.
The two strings which correspond to valid strings in R have a derivative of interesting form:
  • n = 12, s = “abcdbcdbcdef”: Ds E = ε and ~(ε)
  • n = 15, s = “abcdbcdbcdbcdef”: Ds E = ε and ~(ε)
It's interesting to see that while we do get rather large expressions in the course of the work, we don't in fact seem to have quite as much growth as the exponents in the base type would have led us to expect. A rewriting of the base type as a standard regular expression will have a Glushkov automaton with 67 states, and the similar standard Glushkov automaton for R will have 16 states. The worst case size of the FSA for their intersection will have 1072 states (67 × 16). A cautious intersection method, which constructs states only when they prove to be needed, won't construct nearly that many, of course; I infer that working with Brzozowski derivatives gives us a similarly thrifty approach.

4.4.5. Proof of equivalence for different definitions of subsumption

In section 4.4.2, I claimed that the following ways of formulating the subsumption test are all equivalent:
  1. for all strings s, s is in L(R) only if s is in L(B)
  2. for all strings s, δ(Ds R) = ε only if δ(Ds B) = ε
  3. the expression (R and ~B) denotes the empty set
  4. for all strings s, δ(Ds (R and ~B)) = ∅
In slightly more formal terms they are:
  1. (∀ s : string)(sL(R) ⇒ sL(B))
  2. (∀ s : string)(δ(Ds R) = ε ⇒ δ(Ds B) = ε)
  3. (R and ~B) = ∅
  4. (∀ s : string)(δ(Ds (R and ~B)) = ∅)
4.4.5.1. (1) is equivalent to (2)
Note that for all strings s and all expressions E, s is in L(E) if and only if ε is in L(DsE). This follows from the definition of derivatives: Ds E is defined specifically as the set of strings t such that the concatenation st is in L(E). For all strings s, the concatenation sε = εs = s, so sε is in L(E) if and only if s is.
By substitution of like for like in 1, we now have
(1′) (∀ s : string)(ε ∈ Ds R ⇒ ε ∈ Ds B)
Note that for all sets E, δ(E) = ε if and only if ε is in E. Substituting like for like in (1′), we get 2. Q.E.D.
4.4.5.2. (2) is equivalent to (3)
An expression E denotes the empty set if and only if
  1. ¬ (∃ s: string)(sL(E))
By the definition of derivatives, we then have 5 equivalent to
  1. ¬ (∃ s: string)(ε ∈ L(DsE))
The normal logical operations on 6 give us
  1. (∀ s: string) (ε ∉ L(DsE))
and the definition of δ(E) turns this into
  1. (∀ s: string) (δ(Ds E) = ∅)
Substituting the expression (R and ~B) for E, we have the following translation of 3, which will be shown equivalent to 2:
  1. (∀ s: string) (δ(Ds (R and ~B)) = ∅)
The definition of derivation with respect to and gives us:
  1. (∀ s: string) (δ((Ds R) and (Ds ~B)) = ∅)
The definition of δ turns this first into
  1. (∀ s: string) (¬ (δ((Ds R) and (Ds ~B)) = ε))
and then, in a second step, into
  1. (∀ s: string) (¬ (δ(Ds R) = ε ∧ δ(Ds ~B) = ε))
De Morgan's law then give us:
  1. (∀ s: string) ((¬ (δ(Ds R) = ε)) ∨ (¬ (δ(Ds ~B) = ε)))
The definition of δ(~E) then gives us
  1. (∀ s: string) ((¬ (δ(Ds R) = ε)) ∨ (δ(Ds B) = ε))
And this is equivalent, by the definition of material implication, to
  1. (∀ s: string) ((δ(Ds R) = ε) ⇒ (δ(Ds B) = ε))
which is identical to 2.
Q.E.D.
4.4.5.3. (3) is equivalent to (4)
We have
3. (R and ~B) = ∅
Substituting “(R and ~B)” for E in 5 we get
  1. ¬ (∃ s: string)(sL(R and ~B))
and repeating the chain from 5 to 8), we get 4:
  1. (∀ s : string) δ(Ds (R and ~B)) = ∅
Q.E.D.

4.5. Alternative extensions to all-expressions

As noted above (section 3.1), the all-expressions of XSD 1.0 are restricted in such a way that they can readily be realized by, or interpreted as, either the & operator of ISO 8879 or the interleave operator of Relax NG.
It is sometimes suggested that the restrictions on XSD all-expressions be lifted; if this is done, it will be necessary to define precisely the intended semantics. This section shows how derivatives can be used to specify either the non-interleaving or the interleaving interpretation.

4.5.1. Interleaving groups

In section 3.3, a rule was given for constructing the derivatives of &-expressions. If the arguments of the & are allowed to be arbitrary expressions, rather than only simple content-model expressions and &-expressions, then the resulting expressions have an interleave semantics.
Consider the expression E = (a, b+) & ((c* | d+), e). In the interleaving interpretation of the operator, it should accept any string consisting of one a followed by one or more b symbols, interleaved with a single e preceded either by one or more d symbols or zero or more c symbols. So the strings “abeb” and “dabeb” (to name two) are included.
Tracing the validation process on the first of these examples may serve to illustrate the process.
  1. Da E = b+ & ((c* | d*), e)
  2. Dab E = b* & ((c* | d*), e)
  3. Dabe E = (b* | ∅) = b*
  4. Dabeb E = b*
Since ν(Dabeb E) = ν(b*) is true, it is clear that E accepts the sequence “abeb”.
Note the way in which the original subexpression “a, b+” is gradually consumed in steps 1 and 2, with the remainder reappearing in the &-expression in the derivative. The subexpression “(c* | d*), e” reduces, in step 3, to the empty set (which is removed when the expression is simplified).

4.5.2. Non-interleaving groups

To suppress the interleave interpretation, it suffices to change the definition of the derivative Ds (F & G) to require that once F or G is started, it must be completed before the other expression can be recognized. The new rule is:
  • Dx (F & G) = ((Dx F), G) | ((Dx G), F)
This resembles the original rule, but in the right hand side the & operator is replaced with the sequence operator (comma).
The result is that while the sequence “abde” is accepted, the sequence “abeb” is not:
  1. Da E = b+, ((c* | d*), e)
  2. Dab E = b*, ((c* | d*), e)
  3. Dabe E = (ε | ∅) = ε
  4. Dabeb E = (∅ | ∅) = ∅
It will be seen that the operator thus defined is not associative: in the expression “((F & G) & H)” the strings recognized by F and by G must be adjacent, since once the subexpression “(F & G)” is begun, it must be completed before the H can be recognized, while in the expression “(F & (G & H))” it is the G and the H which must be adjacent. It is an open question whether a non-interleaving & operator can be defined in an associative way.

5. Conclusion

This paper has reviewed the basic concepts of Brzozowski derivatives and illustrated their applicability to problems arising in checking and using XML Schema content models. Brzozowski derivatives provide a convenient way of validating documents against content models whose translation into finite-state automata would be prohibitively large. They allow us to define the SGML and XML rules governing determinism of content models concisely, in terms of properties of the content models themselves and without requiring that the expressions be translated into automata. Brzozowski derivatives can also be used in the concise formulation of alternative constraints, for example weakened wildcards or less restrictive versions of the XML Schema all-expression. And finally, derivatives provide a simple and direct way to test two content models to see whether one accepts any sequences not accepted by the other. In some cases, the use of Brzozowski derivatives may require less work than the use of finite-state automata; in others, even when the theoretical complexity of the problem is unaffected, the algorithm using derivatives may be more easily specified, coded, and understood than the corresponding algorithm using automata.

A. References

Brüggemann-Klein, Anne. 1993. “Regular expressions into finite automata.” Theoretical Computer Science 120.2 (1993): 197-213.

Brüggemann-Klein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasis-open.org/cover/bruggDissert-ps.gz).

Brüggemann-Klein, Anne, and Derick Wood. 1998. “One-unambiguous regular languages.” Information and computation 140 (1998): 229-253.

Brzozowski, Janusz A. “Derivatives of regular expressions” Journal of the ACM 11.4 (1964): 481-494.

Clark, James. 2002. “An algorithm for RELAX NG validation.” 13 February 2002. http://www.thaiopensource.com/relaxng/derivative.html

English, Joe. 1999. How to validate XML. http://www.flightlab.com/~joe/sgml/validate.html

Foster, Bob. 2003. “Derivative works [Parsing].” Blog entry, 27 January 2003, in Technical difficulties. http://bobfoster.com/plog/index.php?m=200301

Foster, Bob. 2004. “Derivatives of Bounded Rep[e]tition.” Blog entry, 17 May 2004, in City Life: Full Speed Ahead. http://jroller.com/comments/bobfoster/FullSpeedAhead/derivatives_of_bounded_repitition

Fuchs, Matthew, and Allen Brown. 2003. “Supporting UPA and restriction on an extension of XML Schema.” Extreme Markup Languages, Montréal, 2003. Available online at http://www.idealliance.org/papers/extreme03/html/2003/Fuchs01/EML2003Fuchs01.html

Glushkov, V. N. 1961. “The abstract theory of automata”, tr. J. M. Jackson, in Russian Mathematical Surveys, A translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk, ed. J. L. B. Cooper, vol. 16 (London: Cleaver-Hume, 1961), pp. 1-53.

Hopkins, Mark. 1994. “Regular Expression Software”. Posting to comp.compilers, 17 February 1994. http://compilers.iecc.com/comparch/article/94-02-109

International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.

Kawaguchi, Kohsuke. 2003. Sun Multi-Schema XML Validator. Version 1.2, April 2003. Available from http://www.sun.com/software/xml/developers/multischema/

Schmidt, Martin. 2002. Design and Implementation of a validating XML parser in Haskell. Master's thesis, Fachhochschule Wedel (Germany).

Sperberg-McQueen, C. M. 2004. “Notes on finite state automata with counters.” http://www.w3.org/XML/2004/05/msm-cfa.html

Thompson, Henry S., and Richard Tobin. 2003. “Using finite state automata to implement W3C XML Schema content model validation and restriction checking.” XML Europe 2003, London. Available online at http://www.idealliance.org/papers/dx_xmle03/papers/02-02-05/02-02-05.html (imperfectly rendered) or at http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html.

[Wilmott, Sam]. [1991]. Content model algebra. Exoterica white paper. [Dated and attributed from external evidence. Original at http://www.omnimark.com/white/cma/cma.html has disappeared, but copies appear to be available at http://www.oasis-open.org/cover/omnimarkContentModelAlgebra200101.html and http://home.chello.no/~mgrsby/sgmlintr/sgmlcont.htm]

World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0 (Third Edition), ed. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler (Second Edition), François Yergeau (Third Edition). W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/REC-xml/.

World Wide Web Consortium (W3C). “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.

World Wide Web Consortium (W3C). 2001. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

World Wide Web Consortium (W3C). 2001. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/

World Wide Web Consortium (W3C). Namespaces in XML 1.1, ed. Tim Bray, Dave Hollander, Andrew Layman, and Richard Tobin. W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/xml-names11.


Notes

[1] Despite their use in MSV and Bob Foster's essay on numeric exponents in content models [Foster 2004], Brzozowski derivatives appear not to be widely known among those working on XSD. I was unaware of Foster's work myself until after the first version of this paper had been completed. And MSV appears not to exploit all of their applications to XSD-related problems: among the parts of XSD not implemented by MSV are two addressed here by means of derivatives.
[2] For convenience, I will use the same notation to refer both to individual sequences and to the singleton sets consisting solely of those sequences. so in what follows, ε will be used indifferently both for the set containing just the empty sequence and the empty sequence itself. When a sequence contains just a single item, the same notation may be used to denote that item, the length-one sequence containing just that item, and the singleton set containing just that sequence. This does not seem to lead to any ambiguities, and it helps keep the notation simpler. But if any reader of this document finds any ambiguities, the author would be glad to be informed so I can mend my text, or my notation.
[3] In some contexts, it may be useful to regard expanded names as denoting elements with a particular expanded name and a particular type.
[4] Or, using δ, we can say more compactly: “Dx (F, G) = ((Dx F), G) | (δ(F), Dx G)”
[5] Hence the special case for E = F{0,unbounded}.
[6] Recall that Ds E denotes, by definition, the set of strings which when appended to s will produce members of E. If ν(Ds E), then ε is in Ds E, which means that if we concatenate s with ε, the result is in E. But ε is the identity with respect to concatenation: for all sequences t, tε = t. If concatenating s and ε gives a string in E, then s is already in E, without the concatenation.
[7] Informally, a regular expression is weakly deterministic if it obeys the rule known in SGML as the ‘ambiguity rule’, in XML as the ‘determinism rule’, and in XSD as the ‘Unique Particle Attribution constraint’, which requires that there never be any doubt or non-determinism, in sequences which match the expression (or prefixes of such sequences), about which symbol in the expression matches each symbol in the sequence. A regular expression is strongly deterministic if it is not only deterministic which symbol in the expression is matched by each symbol in the input, but also deterministic which repetition operator fires, whenever more than one might apply. The expression “a?, a” is weakly non-deterministic (although wholly unambiguous), while “(a*)*” is weakly deterministic (there is only one symbol to match), but not strongly deterministic (since for the second and all subsequent occurrences of a it is non-deterministic whether the inner or outer * should fire). For more formal definitions of these terms, see Brüggemann-Klein [Brüggemann-Klein 1993a]. For a formal definition of deterministic (“one-unambiguous”) regular expressions, see Brüggemann-Klein and Wood [Brüggemann-Klein/Wood 1998].
[8] For those who wish to follow the details:
[9] The structure of the initial step is just as before: since E = (F, G) and F is nullable, this is
(((Dtitle F), G) | Dtitle G)
For the first alternative, we have
Dtitle F
= ((Dtitle H), F)
= ((Dtitle(script | style | meta)), F)
= ((Dtitle(script) | Dtitle(style) | Dtitle(meta)), F)
= ((∅ | ∅ | ∅), F)
= (∅, F)
= ∅
For the second alternative, we have
Dtitle G = (Dtitle (title, F, (base, F)?) | Dtitle (base, F, (title, F)))
= ((Dtitle(title), (F, (base, F)?)) | (Dtitle(base), (F, (title, F))))
= ((ε, (F, (base, F)?)) | (∅, (F, (title, F))))
= ((F, (base, F)?) | (∅))
= (F, (base, F)?)
[10] By the definition of derivatives for sequences, we have
Dstyle(F, (base, F)?)
= ((Dstyle(F), (base, F)?) | (Dstyle(base, F)?))
= (((Dstyle(script) | Dstyle(style) | Dstyle(meta)), (base, F)?) | ∅)
= (((∅ | ε | ∅), (base, F)?) | ∅)
= ((ε, (base, F)?))
= (base, F)?
[11] In standard regular expressions, every expression which is weakly deterministic but not strongly deterministic has an equivalent which is both weakly and strongly deterministic: expressions like “(e*)*” can be reduced to “e*” without changing the language recognized. In XML Schema content models, similarly, “e{0,10}{1,10}” can be reduced to “e{0,100}”. In some cases, however, strong determinism cannot be so readily achieved. It seems unlikely that expressions like “(e{1,5},b{0,2}){1,5}” can be translated into strongly deterministic equivalents without either loss of weak determinism or exponential explosion in size, but as far as I know this is an open question.
[12] The XML specification comes close to having the worst of both approaches: the description in the text is mostly hand-waving, and to get a precise idea of what is involved, the reader must follow the references to published algorithms for translating regular expressions into automata and then mentally perform the specified translation.
[13] It would be interesting to see if a precise definition of strong determinism could also be formulated in terms of the expressions and their derivatives, which avoids appeal to the construction of automata. It seems possible, but I do not have such a definition to offer.
[14] In some respects, the definition of ambiguity in Sam Wilmott's paper [Wilmott 1991] is very attractive: it does not explicitly require generating every characteristic derivative, and it requires checking over all input symbols only in the cases “F | G”, “F, G”, and “F*”. But the definition given there requires that every associative grouping of sequences be examined: “ab is ambiguous if ... for some c and d, where ca and db, ab = cd, and cd is ambiguous.” As formulated, this appears to refer to the language accepted by the expressions ab and cd, but I believe it should be taken to mean only that an expression of the form ((E, F), G) needs also to be examined in the form (E, (F, G)), not that the language recognized by an expression of the form ab (in conventional regular expression notation) needs to be examined to see if it can be recognized by some other expression of the form cd, which would really be expecting a bit much. The example given is “(x(y+λ))y” (in content model notation this would be “(x, y?), y”), which needs to be refactored into “x((y+λ)y)” (i.e. “x, (y?, y)”) for the non-determinism to be detected. The definition given here avoids the necessity for such qualifications and complications by specifying that the property must hold for all characteristic derivatives. It is unnecessary to refactor E = “(x, y?), y”, because DxE = “(y?, y)”, and the non-determinism will be detected there.
A more serious drawback for markup purposes is that the definition in Wilmott [Wilmott 1991] covers only regular expressions, not content models; this is a drawback since as Brüggemann-Klein and Wood [Brüggemann-Klein/Wood 1998] point out, translation from content models into conventional regular expressions (i.e. expressions without the +, ?, or & operators) does not preserve one-unambiguity.
[15] By the rule for idet(F, G)).
[16] The original expression, plus one derivative each for strings of “e” of length 1 through 1,000,000, plus ∅, which the derivative with respect to any other string.
[17] In technical terms, tokens in F which are in the follow set of tokens in the final set of F. For definitions of start set, final set, etc., see [Brüggemann-Klein 1993a] or [Brüggemann-Klein/Wood 1998].