- 1. Regular expressions with integer-range exponents (REIREs)
- 1.1. Definition
- 1.2. Paths, ambiguity, and determinism
- 1.3. Functions on REIREs

- 2. Counting regular attribute grammars
- 2.1. A simple example
- 2.2. Using an attribute grammar
- 2.3. Another example
- 2.4. Generation of CRAG from REIRE

- 3. Counter-extended finite-state automata (CFAs)
- 3.1. Definition
- 3.2. Properties of CFAs
- 3.3. Checking the UPA constraint
- 3.4. Checking instance validity
- 3.5. Making a CFA deterministic
- 3.6. Subset checking

- A. Standard construction of Glushkov automaton
- B. Standard construction of Thompson automaton
- C. To do
- D. Notation and terminology
- E. References

The XML Schema 1.0 specification defines content models which are
similar to regular expressions, but augment standard regular
expressions in various ways. Integer-range exponents are allowed in
addition to ***, *+*, and
*?* (or in the more austere regular expressions
found in some work, *** alone). Substitution
groups affect the language accepted in practice. Various forms of
wildcard are defined. And each symbol in a content model may be
associated with type and other information.

In its current form, this paper attempts to extend the basic
facts about regular expressions and finite state automata to deal with
regular expressions extended with integer-range exponents; the
basic proposal is to extend finite-state automata with counters,
and once that proposal has been outlined, the task of the paper will
be to show how various tasks may be accomplished using such
*counter-extended finite-state automata* (CFAs).

Although the idea of augmenting a finite-state automaton with one
or more counters feels like an obvious one, it has proven harder than
expected to find information on the network concerning the underlying
theory. These notes are an attempt to work out some ideas in the
absence of much guidance from the literature.

The regular expressions we are dealing with use a slightly
unusual notation, although they denote the same set of languages
as more conventional regular expressions.
We will work with regular expressions which have numeric exponents.

We say that a regular expression denotes a language (i.e. a set of
strings) over some input alphabet Σ; we can
refer to the language denoted by a regular expression *E* as
*L*(*E*).

The regular expression itself is a string which may contain
symbols from Σ in addition to the digits 0 through 9
and the meta-symbols
“`∅`”,
“`ε`”,
“`|`”,
“`,`”,
“`{`”,
“`INF`”,
“`}`”,
“`(`”, and
“`)`”.

As strings, regular expressions are defined by the following
rules:[1]

- The string “
`∅`” is a regular expression denoting the empty set. - The string “
`ε`” is a regular expression denoting {`ε`}, the set containing the empty string. - For any symbol
*x*∈ Σ (where Σ is the input alphabet we are interested in),*x*is a regular expression denoting {“*x*”}, i.e. the set whose sole member is the string consisting of the symbol*x*. - If
*F*and*G*are regular expressions, then “*F*`,`*G*” is a regular expression, denoting {“*x**y*” |*x*∈*L*(*F*) ∧`y`∈*L*(*G*)}, the set of strings made by concatenating any member of the language defined by*F*with any member of the language defined by*G*. - If
*F*and*G*are regular expressions, then “*F*`|`*G*” is a regular expression, denoting {“*x*” |*x*∈*F*∨*x*∈*G*}, the union of*L*(*F*) and*L*(*G*). - If
*F*is a regular expression,*n*is a non-negative integer,*m*is either a non-negative integer or the symbol*INF*, and*n*≤*m*, then “*F*`{`*n*`,`*m*`}`” is a regular expression, denoting {“*x*_{1}*x*_{2}...*x*_{k}” | (0 ≤*n*≤*k*≤*m*) ∧ for 1 ≤*i*≤*k*,*x*_{i}∈*F*}, the concatenation of at least*n*and at most*m*strings each of which is a member of*F*. The expression “`{`*n*`,`*m*`}`” is an integer-range exponent which is said to*dominate*expression*F*. - If
*F*is a regular expression, then “`(`*F*`)`” is a regular expression which denotes the same language as is denoted by*F*.

(Note on typography:
in the text, references to a specific symbol *a* will
take the form *a*, while a use of that symbol in a regular
expression will take the form `a`; these are the same
symbol for purposes of this paper, regardless of any differences of
typographic style or glyph.
Syntactic variables (e.g. the *x*, *F*, and *G* in the rules above) are
rendered as *x*, *F*, *G* in all cases.
In the rules above, I have enclosed strings, regular expressions,
string schemata, and regular-expression schemata in quotation marks
in the interests of making a clearer separation between the meta-level
discussion and the regular expressions themselves. In the following
discussion, the quotation marks will normally be dispensed with.)

Examples (assuming that Σ = {*a*, *b*, *c*}):

Regular expression | Denotation |
---|---|

∅ |
the empty set (the language with no members) |

ε |
{ε} = the set containing just the empty string |

a |
{“a”} |

b |
{“b”} |

a,b |
{“ab”} |

(a,b) |
{“ab”} |

(a,b,a)|(b,b,b) |
{“aba”,
“bbb”} |

(a{3,4},b{0,1}){1,2} |
{“aaa”,
“aaaa”,
“aaab”,
“aaaab”,
“aaaaaa”,
“aaaaaaa”,
“aaaaaab”,
“aaabaaa”,
“aaaaaaaa”,
“aaaaaaab”,
“aaaabaaa”,
“aaabaaaa”,
“aaabaaab”,
“aaaaaaaab”,
“aaaabaaaa”,
“aaaabaaab”,
“aaabaaaab”,
“aaaabaaaab”},
i.e. one or two occurrences of strings matching the inner
expression (three or four occurrences of a
followed optionally by a b) |

a{0,10}{0,10} |
strings of length zero to 100, consisting exclusively of the
symbol a |

Note that the repetition operators of conventional regular
expressions can be reduced to integer-range exponents as defined
here. For any regular expression *E*,
*E*`*` ⇒ *E*`{0,INF}`,
*E*`+` ⇒ *E*`{1,INF}`,
*E*`?` ⇒ *E*`{0,1}`.

The same symbol may occur more than once in a regular expression;
when it is necessary to distinguish the different occurrences of the
same symbol, we speak of the *positions* of the expression.
The expression `(a{2,3}, b{1,4}, a{0,INF}){3,4}` has three
positions: the first *a*, the *b*, and the
second *a*. We may write `a_1`,
`b_2`, `a_3` to distinguish the positions of the
expression.

If an integer-range exponent dominates an expression *E*, then
it also dominates each position *p* in *E*.

Note that as is usual for regular expressions, each string
matching a particular regular expression *E* can be thought of as
mapping out a path through the regular expression: each
symbol in the string can be associated with a particular position
in the regular expression. Thus the string “`abba`”
matches the regular expression `a(a|b)*a` and corresponds
to the path visiting *a_1*, then *b_3*
twice, then *a_4*.

For an unambiguous regular expression *E*, each word in *L*(*E*)
corresponds to exactly one path through *E*. If any word in *L*(*E*)
corresponds to more than one path through *E*, then *E* is weakly
ambiguous.[2] Brüggemann-Klein distinguishes
this weak ambiguity from a strong ambiguity attributed to Sippu and
Soisalon-Soininen. Informally, a weakly unambiguous regular
expression allows semantic actions to be associated with the symbols
of the expression, while a strongly unambiguous expression allows
semantic actions to be associated unambiguously not only with
the symbols but also with the repetition operators.

The expression `(a{1,2},b{0,1}){1,2}` is weakly
unambiguous because any `a` in any word in the language
can only match the single *a* in the expression, and
similarly for each `b` in any word in the language. It
is strongly ambiguous, by contrast, because the word `aa`
can be taken either as a single occurrence of the expression
`(a{1,2},b{0,1})`, thus firing the outermost exponent
once, or as two occurrences of the inner expression, thus firing
the outer exponent twice.

By analogy with finite state automata, we can distinguish the
notions of ambiguity and determinism, which are in a subset/superset
relation: all unambiguous automata (and expressions) are
deterministic, but not vice versa. An expression *E* is deterministic
if and only if for every prefix of any word ∈ *L*(*E*), the prefix
determines a unique path through *E*.[3] Again, we can distinguish weak
determinism (there is a unique sequence of symbols) and strong
determinism (there is a unique sequence of symbols and exponent
firings).

The Unique Particle Attribute constraint of XML Schema effectively
requires that content models be weakly deterministic. Given the
complications which arise from strong nondeterminism, it might be
desirable to consider requiring that they be strongly deterministic as
well. An algorithm for testing weak determinism is given below.

Following [Brüggemann-Klein 1993] we prepare the way for defining the algorithm
to map REIREs into counter-extended FSAs
by defining several functions on regular expressions:

*pos*(*E*) → the positions of*E**first*(*E*) → the positions of*E*which come first in some word ∈*L*(*E*).*last*(*E*) → the positions of*E*which come last in some word ∈*L*(*E*).*nullable*(*E*) → Boolean indicating whether ε ∈*L*(*E*)*follow*(*E*,*p*) → the positions of*E*which can come after*p*in some word ∈*L*(*E*)

The inductive definitions which follow match those given by
[Brüggemann-Klein 1993] but extend them to cover integer-range exponents.
The functions *first*(*E*), *last*(*E*),
and *nullable*(*E*) are:

- If
*E*= ∅, then*first*(*E*) =*last*(*E*) = ∅,*nullable*(*E*) = false. - If
*E*= ε, then*first*(*E*) =*last*(*E*) = ∅,*nullable*(*E*) = true. - If
*E*=*a*∈ Σ, then*first*(*E*) =*last*(*E*) = {*a*},*nullable*(*E*) = false. - If
*E*=*F*,*G*, then*first*(*E*) =if*nullable*(*F*) then*first*(*F*) ∪*first*(*G*)else*first*(*F*)*last*(*E*) =if*nullable*(*G*) then*last*(*F*) ∪*last*(*G*)else*last*(*G*)*nullable*(*E*) =*nullable*(*F*) ∧*nullable*(*G*) - If
*E*=*F*|*G*, then*first*(*E*) =*first*(*F*) ∪*first*(*G*)*last*(*E*) =*last*(*F*) ∪*last*(*G*)*nullable*(*E*) =*nullable*(*F*) ∨*nullable*(*G*) - If
*E*=*F*{*n*,*m*}, then*first*(*E*) =*first*(*F*),*last*(*E*) =*last*(*F*),*nullable*(*E*) = ((*n*= 0) ∨*nullable*(*F*))

The definition of *follow*(*E*,*p*)
is now straightforward.

For completeness, we stipulate that for all regular expressions *E*
and all positions *p*, if *p* ∉ *pos*(*E*), then *follow*(*E*,*p*) = ∅.
Otherwise, for all *p* ∈ *pos*(*E*):

- If
*E*= ∅, then*E*has no positions and*follow*(*E*,*p*) = ∅. - If
*E*= ε, then*E*has no positions and*follow*(*E*,*p*) = ∅. - If
*E*=*a*∈ Σ, then*follow*(*E*,*p*) = ∅. - If
*E*=*F*,*G*, then*follow*(*E*,*p*) =if*p*∈ (*pos*(*F*) \*last*(*F*)), then*follow*(*F*,*p*)if*p*∈*last*(*F*), then*follow*(*F*,*p*)if*p*∈*pos*(*G*), then*follow*(*G*,*p*)*follow*(*F*,*p*) ∪*follow*(*G*,*p*). - If
*E*=*F*|*G*, then*follow*(*E*,*p*) =*follow*(*F*,*p*) ∪*follow*(*G*,*p*)*first*(*E*) =*first*(*F*) ∪*first*(*G*)*last*(*E*) =*last*(*F*) ∪*last*(*G*)*nullable*(*E*) =*nullable*(*F*) ∨*nullable*(*G*) - If
*E*=*F*{*n*,*m*}, then*follow*(*E*,*p*) =We say that the transition fromif*p*∈ (*pos*(*F*) \*last*(*F*)) then*follow*(*F*,*p*)else (*p*∈*last*(*F*))*follow*(*F*,*p*) ∪*first*(*F*)*p*∈*last*(*F*) to*q*∈*first*(*F*) is*licensed*by the integer-range exponent`{n,m}`; in traversing the transition we say that the exponent*fires*.

Readers familiar with Brüggemann-Klein's work or with the construction of the
Glushkov automaton for standard regular expressions should note that
for the regular expressions we use here, *last*(*E*) and *follow*(*E*,*p*) include
all of the positions which *can*
come last, or follow *p*, in any word in *L*(*E*), even though in a path
involving repeated visits to a particular position, the follow set
of the position may vary on each visit, and it may be allowed to be
the last position only on some visits, not all. For example, for
*E* = `a{4,5}`, *last*(*E*) = {*a_1*}, even
though that position can be the final one in a path only if three or
four visits to that position precede the final visit. That is,
the last positions of *E* become final states of the extended
Glushkov automaton
*M*_{E} only when certain additional
conditions are met: specifically, the counters need to be in the right
position. Similarly, in the construction of the standard Glushkov
automaton, the follow set of *p* turns into a set of arcs starting in
*p* and leading to the members of the follow set. In the CFAs to be
constructed below, the arcs will be guarded so that they can be
traversed only when the counters have appropriate values.

Rather than embark directly on a description of how to
translate regular expressions with integer-range exponents
into FSAs
with counters, I turn next to a description of one way to translate
REIREs into grammars. Let us use the term *counting
grammar* to denote a grammar which ‘counts’
things (i.e. a grammar which can distinguish words having *n*
occurrences of some substring from those having *n*+1 occurrences,
for some *n* > 0).
Counting grammars can be written using conventional grammatical
formalisms like BNF, but for large *n*, conventional methods tend to
become cumbersome. Attribute grammars provide a simple way to write
counting grammars concisely; the attribute grammars described below
use regular grammars as their underlying context-free grammars, and
the attributes used are very restricted in form; they may thus be
termed *counting regular attribute grammars* (CRAGs).

Once CRAGs have been defined, the close relation between
conventional regular grammars and finite-state automata will suggest a
simple and direct way to extend finite-state automata with
counters. In other words, we have
*regular grammar* : *FSA* =
*CRAG* : *CFA*.

Before attempting a formal description of CRAGs and how to construct
them, let us consider first some simple examples. First, consider
the regular expression *E* = `a{2,4}`. A conventional
regular grammar to recognize *L*(*E*) is*S0* is the start symbol.
This has an obvious similarity to the finite-state automaton
*S0* is the start state and
accept states are marked with asterisks. Note that most arcs
leading into accept states are represented in the grammar by two
productions, one naming the (non-terminal corresponding to the) target
state and one not naming it. The latter results in a sentential
form without any non-terminals and thus terminates the
grammatical derivation; the longer form is used if there are outgoing
arcs from the target state; it allows the derivation to
continue. It will prove convenient to adopt a different approach to final
states, and so in what follows the grammars deviate slightly from the
usual regular-grammar notation. In conventional right-linear grammars,
every production has either the form `Q → a R` or the form
`Q → a`, where *Q* and *R* are non-terminals and
*a* is a terminal symbol. In the grammars below,
the second form does not appear;
every production has either the form `Q → a R` or the form
`Q → ε`.

S0 → a S1. S1 → a. S1 → a S2. S2 → a. S2 → a S3. S3 → a.where

a a a a S0 -----> S1 -----> *S2 -----> *S3 -----> *S4where

In this notation, the regular grammar for `a{2,4}`
becomes:*Q* for which a production of the form `Q
→ ε` appears in the grammar. We refer to these as
the *final productions* of the grammar, and to the
non-terminals on their left-hand sides as its *final
non-terminals*. By analogy, productions with the start symbol
(here conventionally written *S* or *S0*) on
the left-hand side will be referred to as *start
productions*. The non-terminals on the right-hand sides of
start productions are *initial non-terminals*.

S0 → a S1. S1 → a S2. S2 → a S3. S2 → a S4. S2 → ε. S3 → ε. S4 → ε.There is now a 1:1 relation between the states of the FSA and the non-terminals of the grammar, and the final states of the FSA are exactly those states

The counting grammar can be simpler if instead of using distinct
non-terminals to keep a tally of how many occurrences of `a`
have been encountered so far, we substitute an attribute which carries
that information. Augmented with such an attribute, and with appropriate
guards (or *semantic conditions*) on productions, our simple
example becomes

S → a A(1). A(n) → {n < 4} a A(n + 1). A(n) → {n > 1} ε.

The notation used is similar to that of Prolog definite-clause
grammars passing inherited grammatical attributes as
positional parameters. The expressions in braces are semantic
conditions which must hold for the production to fire.[4]

At this point the core ideas of CRAGs may be formulated this way:
at each point where the grammar needs a counter (in effect, one for
each integer-range exponent in the regular expression), an attribute
is used. Tests on productions may compare one or more attributes to
constants; the comparisons are invariably of the form
*counter* < *constant* or
*counter* ≥ *constant*. On the right
hand side, counters passed to non-terminals may be either incremented
or set to 1. There is a 1:1 mapping between the non-terminals
of the CRAG and the states of the assocated CFA. When following the
algorithm given below we generate the CRAG from a REIRE, there is
also a 1:1 mapping between the non-terminals other than *S* and the
symbol tokens of the REIRE. As a consequence of this last, each
non-terminal *N* in a CRAG has an associated symbol, and whenever
*N* appears in the right-hand side of a production rule, the
terminal symbol in the rule is *N*'s associated symbol. We write
*chi*(*N*) for the associated symbol of non-terminal *N*.

It may be useful to work out a pair of more elaborate examples.
Let us consider first `(a{2,3}, b{1,4}){3,4}`, and build
up its grammar from the grammars of its sub-expressions.

The grammar for `a{2,3}` is similar to that given above:`b{1,4}` is equally straightforward:

S → a A(1). A(n) → {n < 3} a A(n+1). A(n) → {n ≥ 2} ε.and that for

S → b B(1). B(n) → {n < 4} b B(n+1). B(n) → {n ≥ 1} ε.It may be noted in passing that since counters are invariably initialized to 1, the guard on the final production is always true and may be omitted when doing so seems clearer.

In composing two grammars *G*_{1} and *G*_{2} using the sequence
operator, we fuse the final productions of *G*_{1} with the start
productions of *G*_{2}. The left-hand side and guard are taken from the
*G*_{1} production, and the right-hand side from the *G*_{2} productions.
Note that since the start symbol never has any attributes, the
right-hand side of a start production can never depend on an inherited
attribute value; any attributes in the right-hand side of a
start production will be set to 1. In our example, the result is:`a{2,3}, b{1,4}`.
The next step is to modify the grammar to handle the outermost
exponent.

S → a A(1). A(n) → {n < 3} a A(n+1). A(n) → {n ≥ 2} b B(1). B(n) → {n < 4} b B(n+1). B(n) → {n ≥ 1} ε.This recognizes the language of

To handle the outermost exponents on an expression of the form
*E*`{n,m}`, we need to add

- a new counter attribute on all existing non-terminals (since the outermost expression doesn't add any new symbols from Σ, we don't get any new non-terminals), and
- new productions for each pair (
*Q*,*R*) with*Q*a final non-terminal of*E*and*R*an initial non-terminal of*E*; these will have the form`Q(c, ...) → {c < m ∧ G} chi(R) R(incr(c), ...)`, where*G*is the guard on the final production for*Q*, if it has one, and otherwise`true`, and*chi*(*R*) is the symbol ∈ Σ associated with*R*. - a new guard on the final productions of
*E*: if the final production has guard {*G*}, the new final production has guard {*c*≥*n*∧*G*} (where*c*is the new counter); if the final production has no guard, the new guard is just {*c*≥*n*}

In the case of `a{2,3}, b{1,4}`, the sole final non-terminal
is *B* and the sole initial non-terminal is *A*, so there is just one new
rule. The grammar for `(a{2,3}, b{1,4}){3,4}` is thus:

S → a A(1,1). A(n,m) → {n < 3} a A(n+1,m). A(n,m) → {n ≥ 2} b B(1,m). B(n,m) → {n < 4} b B(n+1,m). B(n,m) → {m < 4 ∧ n ≥ 1} a A(1, m+1). B(n,m) → {m ≥ 3 ∧ n ≥ 1} ε.

For every regular expression with integer-range exponents *E*,
there is a CRAG *G*(*E*) which accepts the same language and which
will be described below.

Formally, an attribute grammar is a tuple consisting of a context free
grammar, a semantic domain (types and operators), a set of attributes
each associated with a non-terminal, a set of attribute evaluation
rules, and a set of semantic conditions. By describing each of these
in turn, we can show how to construct *G*(*E*) for any regular expression *E*.

- The underlying context-free grammar has
*V*_{N}(the non-terminal vocabulary) is the set of positions in*E*, plus an additional start symbol (here invariably written*S*). For clarity, when it is useful to stress the difference between a position*p*and the use of that position as an element of*V*_{N}, the latter will be denoted*nt*(*p*).[5]*V*_{T}(the terminal vocabulary) is the set of symbols ∈ Σ actually used in*E*.*P*(the set of production rules) has one or more rules for each position pair*p*,*q*where*q*∈*follow*(*E*,*p*). The rules each take the form “*P*→`{`*G*`}`*a**Q*”, where*G*is an appropriate guard and*a*=*chi*(*q*). Details of the guard and the number of rules for a position pair are given below.*S*is the start symbol.

- The semantic domain of
*G*(*E*) is the set of non-negative integers, over which the legal operations are incrementation by 1 and comparison against constants using ≥ and <. - Attributes are associated with non-terminals other than
*S*according to rules given in more detail below. Informally, each attribute on a non-terminal represents a counter which keeps track of an integer-range exponent. Each non-terminal is associated with a position in*E*and has one attribute for each integer range exponent which dominates that position in*E*. The convention adopted here is to write the outermost exponent's attribute last in the sequence. All attributes in*G*(*E*) are inherited. - The attribute evaluation rules which assign values to the
attributes of non-terminals in the right-hand side of a production are
all implicit; they take the form of expressions written after the
non-terminal in the manner of parameters to a subroutine call. The
expressions all take one of the following forms:
“
`1`” (i.e. initialization to the constant 1), “`incr(`*c*`)`” (incrementing the value for some counter*c*inherited by the left hand side), or “*c*” (copying a value inherited by the left hand side). - The semantic conditions associated with a given production are
Boolean expressions written within braces at the beginning of the
production's right-hand side. All atomic semantic conditions in
the CRAG are either of the form “
*c*<*k*” or of the form “*c*≥*k*”, where*c*is the name of a counter and*k*an integer constant. Semantic conditions may be composed using Boolean operators; in this construction, only logical conjunction (∧) is used.

In order to handle strongly ambiguous regular expressions, it is
useful to allow our grammar to contain production rules which would be
redundant if their semantic conditions (guards) and attribute
evaluation rules were ignored. Informally, if *q* ∈ *follow*(*E*,*p*),
then we will have at least one rule with *p* on the left-hand side and
*q* on the right. If the transition from *p* to *q* is licensed by more
than one exponent, then there will be more than one rule,
one for each such exponent.

To describe the creation of production rules and attributes precisely,
it will be useful to define the functions
*fp* (for ‘final productions’),
*sp* (for ‘start productions’),
and *mp*
(‘medial productions’, i.e. any production which
is neither a start production nor a final production) on regular expressions.
So
*sp*(*E*) denotes the start productions,
*mp*(*E*) denotes the medial productions, and
*fp*(*E*) the final productions, in the grammar *G*(*E*) generated from *E*.

We also need some simple operations on individual productions:
for some production rule *p*, we have

*guard*(*p*) → the guard (semantic conditions) of*p**lhs*(*p*) → the non-terminal symbol on the left-hand side of*p*, including any attributes it carries*rhs*(*p*) → the right-hand side of*p*without any guard*atts*(*p*) → the attributes on the left-hand side symbol of*p*. Since individual attributes are passed as position parameters and have no names, we write`att_1`to denote the first,`att_2`the second, and so on. In the description of the algorithm, the expression*att_n*denotes the last attribute in the series; it is solely a short-hand for`att_1`or`att_2`, etc.*add_attribute*(*p*) → production*p*augmented with an additional attribute. If*p*is an initial production, then the new attribute appears only on the right-hand side, with the value 1. If*p*is a final production, then the new attribute appears only on the left-hand side, in the form of an*att_n*variable. Otherwise, the new attribute appears both on the left and the right hand side, in both places taking the form of an*att_n*variable; the production transfers the value without change from the non-terminal on the left to that on the right. Thus, for example, if*p*=`B(att_1) → {att_1 < 4} b B(incr(att_1))`, then*add_attribute*(*p*) =`B(att_1, att_2) → {att_1 < 4} b B(incr(att_1), att_2)`. In cases where it is not known how many attributes are already attached to the non-terminals in*p*, expressions in this paper may write*att_n*to refer to the newly added attribute.

A couple of utility functions are also worth defining:

*add_attribute*(*lhs*(*p*)) → the left-hand side of*p*, augmented with an additional attribute whose value is not specified. Thus, for example, if*p*=`B(att_1) → {att_1 < 4} b B(incr(att_1))`, then*add_attribute*(*lhs*(*p*)) =`B(att_1, att_2)`. In this paper, the expression*att_n*refers to the newly added attribute regardless of its actual name.*add_attribute*(*att_n*,*rhs*(*p*)) → the right-hand side of*p*, augmented with an additional attribute. In this paper,`att_n`refers to the newly added attribute; in the operation of the function, the notation*add_n*will be resolved to an appropriate form. If*p*=`B(att_1) → {att_1 < 4} b B(incr(att_1))`, then*add_attribute*(*att_n*,*rhs*(*p*)) =`b B(incr(att_1), att_2)`.*reset*(*rhs*(*p*)) → the right-hand side of*p*, with all attributes on the non-terminal reset to 1. This function can be composed with*add_attribute*, in which case the result is to add a new attribute and reset all existing ones. If*p*=`B(att_1) → {att_1 < 4} b B(att_1 + 1)`, then*reset*(*rhs*(*p*)) =`b B(1)`, and*add_attribute*(*incr*(*att_n*),*reset*(*rhs*(*p*))) =`b B(1, incr(att_2))`.

We can describe the creation of a CRAG from a regular expression
inductively:[6]

- If
*E*= ∅,*G*(*E*) has no production rules and no attributes. - If
*E*= ε,*G*(*E*) has the production rule`S → ε`; there are no attributes. - If
*E*=*x*∈ Σ, then let*p*be the single position in*pos*(*E*). Then:*sp*(*E*) = {`S →`*x**nt*(*p*)},*fp*(*E*) = {*nt*(*p*) → ε},*mp*(*E*) = ∅. - If
*E*=*F*,*G*, then*sp*(*E*) = if*nullable*(*F*) then*sp*(*F*) ∪*sp*(*G*), else*sp*(*F*)*fp*(*E*) = if*nullable*(*G*) then*fp*(*F*) ∪*fp*(*G*), else*fp*(*G*)*mp*(*E*) =*mp*(*F*) ∪*mp*(*G*) ∪ {*lhs*(*p*) → {*guard*(*p*)}*rhs*(*q*) |*p*∈*fp*(*F*) ∧*q*∈*sp*(*G*)} - If
*E*=*F*|*G*, then*sp*(*E*) =*sp*(*F*) ∪*sp*(*G*)*fp*(*E*) =*fp*(*F*) ∪*fp*(*G*)*mp*(*E*) =*mp*(*F*) ∪*mp*(*G*) - If
*E*=*F*{*n*,*m*}, for integer*m*, thenThe values of counter*sp*(*E*) = {*add_attribute*(*p*) |*p*∈*sp*(*F*)}*fp*(*E*) = {*add_attribute*(*lhs*(*p*)) → {*att_n*≥*n*∧*guard*(*p*)} ε |*p*∈*fp*(*F*)}*mp*(*E*) = {*add_attribute*(*p*) |*p*∈*mp*(*F*)} ∪ {*add_attribute*(*lhs*(*p*)) → {*att_n*<*m*∧*guard*(*p*)}*add_attribute*(incr(*att_n*),*reset*(*rhs*(*q*))) |*p*∈*fp*(*F*) ∧*q*∈*sp*(*F*)}*att_n*run from 1 to*m*. For all values*v*<*m*,*incr*(*v*) =*v*+ 1. For*v*=*m*,*incr*(*v*) is undefined. - If
*E*=*F*{*n*,INF}, thenThe values of counter*sp*(*E*) = {*add_attribute*(*p*) |*p*∈*sp*(*F*)}*fp*(*E*) = {*add_attribute*(*lhs*(*p*)) → {*att_n*≥*n*∧*guard*(*p*)} ε |*p*∈*fp*(*F*)}*mp*(*E*) = {*add_attribute*(*p*) |*p*∈*mp*(*F*)} ∪ {*add_attribute*(*lhs*(*p*)) → {*guard*(*p*)}*add_attribute*(incr(*att_n*),*reset*(*rhs*(*q*))) |*p*∈*fp*(*F*) ∧*q*∈*sp*(*F*)}*att_n*run from 1 to*n*+ 1. For all values*v*≤*n*,*incr*(*v*) =*v*+ 1. For*v*=*n*+ 1,*incr*(*v*) =*n*+ 1. Note that for this case, unlike the previous, no guard of the form “*att_n*<*m*” is added to the productions in*mp*(*E*).

The examples given earlier may serve as illustrations of this process.

Note that although the description given above describes attributes as having
type integer, in practice all attributes range over finite subranges of
integer. The number of possible attribute values and their combinations
is thus finite and the attributes can be ‘compiled out’ of
the grammar to create a (much larger) context-free grammar without
attributes. Thus, any results which apply to extended affix or attribute
grammars over finite lattices apply to CRAGs.

CFAs can be defined in direct parallel with the CRAGs just described.
In augmenting finite state automata with counters, we need to add the
following things to the usual account of FSAs:

- Counters. Each counter reflects an integer exponent in the source regular expression, and is ‘visible to’ each state corresponding to a position in the subexpression dominated by the exponent.
- Guards. An arc may have zero or more guards of the form
`counter X < k`or`counter X ≥ k`for some counter*X*and some constant*k*. I believe no two guards on the same arc will ever concern the same counter. - Increments. An arc may increment exactly one counter.
- Clears. An arc may clear one or more counters. Informally, whenever we traverse an arc and increment an outer exponent, in the same arc we clear each counter associated with an inner exponent.
- Conditions on accept states. If and only if the outermost expression
is an integer exponent, then the accept states are the states in
*last(E)*on condition that*the outermost counter is greater than or equal to the minOccurs in the outermost exponent*.

More formally, a CFA is
[7]
a tuple (*M*, *C*, *C**C*, δ, *F*), where:

*M*is a classical FSA, i.e. a tuple (*Q*, Σ, δ,*F*).*C*is a set of counters*C*_{1},*C*_{2}, ...*C*_{n}, for*n*≥ 0.*m*is a function*C*→ integer, where for any counter*C*_{i},*m*(*C*_{i}) is the maximum allowed value of*C*_{i}. As with the grammars given above, if the integer-range exponent is specified as`{n,INF}`, the range of the counter is 1 ..*n*+ 1.*C**C*(counter-conditions) is the set of tuples (*c*_{1},*c*_{2}, ...,*c*_{n}), for*n*= |*C*| (the number of counters in the CFA) ≥ 0, where for all*i*(0 <*i*≤*n*), 0 ≤*c*_{i}≤*m*(*C*_{i}). In illustrations, we may set*c*_{i}to 0 when it is not ‘relevant’ (e.g. the subexpression dominated by the corresponding exponent has not been entered or has been left).*GA*is a set of*guards*and*actions*associated with the elements of the transition function of*M*Individual arcs (members of δ_{M}) may be written in the form “*p*→ {*G*}*x*{*A*}*q*”, where*p*and*q*∈*Q*,*x*∈ Σ,*G*is the*guard*, a description of the members of*C**C*for which this particular arc applies, and*A*is the*action*, a description of how to modify the current member of*C**C*(and thus implicitly description of the members of*C**C*in the image). When the current counter state is a member of the set described by*G*, we say that the guard has been satisfied.*F*is a function from*Q*_{M}× 2^{CC}onto Boolean. If for a particular state*q*and a particular counter condition*c**c*we have*F*(*q*,*c**c*) =`true`, then for counter condition*c**c*, state*q*is an accept state. [This needs to be recast in terms of the final states of*M*rather than in terms of all states of*M*.]

A CFA matches an input string if there is a path from the
start state to a final state such that the guards on the transition
arcs are satisfied at the time the arcs are traversed, and the
final state is a final

We can define the CFA *C*(*E*), the *extended Glushkov automaton* of
regular expression *E*, as follows:

*Q*is*pos*(*E*) ∪ {*q*_{I}}. N.B.*q*_{I}∉*pos*(*E*)- Σ is the input alphabet of
*E*. *C*is a set of counters corresponding 1:1 with the integer-range exponents of*E*- For each integer-range exponent {
*p*,*q*} in*E*,*m*maps from the corresponding*C*_{i}to*q*if*q*is an integer; if*q*=`INF`, then*m*maps*C*_{i}to*p*+ 1. *C**C*(counter-conditions) is the Cartesian product of the possible states of the counters in*C*.- δ (the transition function) is a mapping constructed in the same way that production rules were constructed when building CRAGs.
*q*_{I}is the start state.*F*(*q*,*c**c*) =`true`if and only if in the CRAG*G*(*E*) there is a final production of the form`q → {`*c**c*`} ε`.

A CFA *M* is *weakly deterministic* if for every
state *p*, counter-condition *c**c*, and symbol *x*, there
is only one possible next state. I.e. for any
two arcs in δ of *M*,
*p* `→ {`*G*_{1}`}` *x* `{`*A*_{1}`}` *q*_{1}
and
*p* `→ {`*G*_{2}`}` *x* `{`*A*_{2}`}` *q*_{2},
if any *c**c* satisfies both *G*_{1} and *G*_{2}, then
*q*_{1} = *q*_{2}.

A CFA *M* is *strongly deterministic* if for every
state *p*, counter-condition *c**c*, and symbol *x*, there
is only one arc in δ. I.e. δ is a function.

Conjecture: the extended Glushkov automaton of *E* is
weakly / strongly deterministic if and only if *E* is
weakly / strongly deterministic.

Note that the extended Glushkov automaton for *E* has size
proportional to that of *E*. Construction using a
straightforward algorithm will take cubic time.

Conjecture: if it is possible to extend Brüggemann-Klein's notion of
star normal form to REIREs, then it should be possible to
construct the extended Glushkov automaton in quadratic time.

A content model obeys the Unique Particle Attribution constraint if
and only if its extended Glushkov automaton is weakly
deterministic.

An instance element *e* satisfies a content model *E* if (a) the
string of generic identifiers of the children of *E* is a word in *L*(*E*)
and (b) the other conditions imposed by the XML Schema specification
(consistency of type assignment, etc.) are met. Condition (a) can be
checked by constructing the extended Glushkov automaton of *E* and
checking whether the automaton recognizes the string.

Conjecture: for strongly deterministic CFAs, instance validity
checking is linear in the size of the input.

[It will be convenient if we can provide a simple algorithm for
making a CFA deterministic. No low-cost algorithm is currently known,
although the author has some hopes.]

Since a CFA has only a finite set of counters, each with a
finite range, any CFA can be transformed into a conventional
FSA and determinized, at cost proportional, for each subexpression,
to the product of the maxima in all of the integer-range exponents
which dominate the subexpression.

If an expression is weakly deterministic, then any strong
non-determinism comes from nested exponents. Note that in some cases,
it is possible to rewrite the expression `(E{n,m}){p,q}` as
*E*`{n*p, m*q}`, which is strongly deterministic.

[It is certainly possible to check the subset relation
between two CFAs; in the general case, this may have
cost proportional to the product of the maxima of all
nested exponents. A full description of the algorithm
remains to be worked out.]

For comparison, it may be useful to give an account of
the standard method of constructing the Glushkov automaton.

The following account of the algorithm follows that given by
[Brüggemann-Klein 1993].

The basic idea is to construct an FSA with one state for each
symbol (∈ Σ) in the regular expression. In order to
simplify the discussion, we first ‘mark’ the
regular expression by subscripting each symbol with a number. The
first symbol in the regular expression gets the subscript 1, the
second the subscript 2, etc.[8] The expression `(a | b)* a a b` thus
becomes `(a1 | b2)* a3 a4 b5`. We refer to these
subscripted symbols as *positions*. For any
particular position *p* we denote the symbol
represented by *p* as *chi*(*p*).

For any expression *E*, we can calculate
three functions:

- first
*regular expression*→*set of positions*: positions which match the first symbol of some word in*L*(*E*)- last
*regular expression*→*set of positions*: positions which match the last symbol of some word in*L*(*E*)- follow
*regular expression*×*position*→*set of positions*: the set of positions which can follow position*x*in a path through*E*

These functions are defined as follows for
regular expressions *E*, *F*, *G*
over alphabet Σ, and for any symbol *a*
∈ Σ:

- when
*E*=*ε*:*first*(*E*) =*last*(*E*) = ∅*E*has no positions, so it is not in the domain of*follow*

- when
*E*=*∅*:*first*(*E*) =*last*(*E*) = ∅*E*has no positions, so it is not in the domain of*follow*

- when
*E*=*a*:*first*(*E*) =*last*(*E*) = {*a*}*follow*(*E*,*p*) = ∅

- when
*E*=*F*,*G*:*first*(*E*) = if ε ∈*L*(*F*) then (*first*(*F*) ∪*first*(*G*)) else*first*(*F*)*last*(*E*) = if ε ∈*L*(*G*) then (*last*(*F*) ∪*last*(*G*)) else*last*(*G*)*follow*(*E*,*p*) =if*p*∈ (*pos*(*F*) \*last*(*F*)) then*follow*(*F*,*p*),else if*p*∈*last*(*F*) then (*follow*(*F*,*p*) ∪*first*(*G*)),else if*p*∈*pos*(*G*) then*follow*(*G*,*p*)

- when
*E*=*F*|*G*:*first*(*E*) =*first*(*F*) ∪*first*(*G*)*last*(*E*) =*last*(*F*) ∪*last*(*G*)*follow*(*E*,*p*) = if*p*∈*pos*(*F*) then*follow*(*F*,*p*), else if*p*∈*pos*(*G*) then*follow*(*G*,*p*)

- when
*E*=*F**:*first*(*E*) =*first*(*F*)*last*(*E*) =*last*(*F*)*follow*(*E*,*p*) = if*p*∈ (*pos*(*F*) \*last*(*F*)) then*follow*(*F*,*p*), else if*p*∈*last*(*F*) then (*follow*(*F*,*p*) ∪*first*(*F*))

The Glushkov automaton of *E*, written *M*_{E}, is the
tuple (*Q*_{E} ∪ {*q*_{I}}, Σ, δ_{E}, *q*_{I},
*F*_{E}), where:

*Q*_{E}=*pos*(*E*), i.e. the set of states of*M*_{E}is the set of positions of*E*, to which is added a new initial state*q*_{I}.- Σ is the input alphabet.
- δ
_{E}(the transition function) is a mapping (not necessarily a function) from*Q*_{E}× Σ into*Q*, constructed thus:- For all
*x*∈ Σ, let the transition function δ_{E}(*q*_{I},*x*) = {*p*|*p*∈*first*(*E*) ∧*chi*(*p*) =*x*} - For all
*p*∈*pos*(*E*) and all*x*∈ Σ, let δ_{E}(*p*,*x*) = {*q*|*q*∈*follow*(*E*,*p*) ∧*chi*(*q*) =*x*}

- For all
*q*_{I}is the start state.*F*_{E}(the set of final states) = if ε ∈*L*(*E*) then (*last*(*E*) ∪*q*_{I}) else*last*(*E*).

For comparison, we should also include an account of Thompson's
method of constructing an automaton, augmented as needed.

- clean up definition of CFA. The old definition (now in the note)
won't work because if |
*C*| = 0, the transition function is empty (*C**C*is in its signature).

Additionally (these aren't sorted, and some of them have
been done; the list needs to be weeded):

- for comparison purposes (or where necessary to work things out
mentally), walk through classical regex (or, seq, and star only)
and conventional regex (classical with plus and opt) showing:
- definition of regex
- definition of fsa
- regex → fsa: description, algorithm, implementation
- regex → (regular) grammar: description, algorithm, implementation
- determinism and non-determinism
- determinizing fsa
- minimizing fsa
- validation
- UPA checking
- subset checking for fsa
- star normal form: definition, algorithm for translating into it, proof of equivalence

- define FSA with counters
- extend Glushkov algorithm to handle counters
- recast Glushkov algorithm to produce regular attribute grammar
- show that for classical automata the extended Glushkov algorithm produces the same automata as the one cited by ABK
- show how to do UPA checking with FSAC
- show how to do validation using FSAC
- show how to do subsumption checking using FSAC

Although I have tried to explain the notation used in context, it may be
useful to list the conventions used for symbols, identifiers, etc.*a*, *b*, *c*:
symbols in Σ, the input alphabet of an FSA, CFA, CRAG, etc.
*c*:
a counter in a CRAG.
**chi(***N*): the unique symbol ∈ Σ associated with
the non-terminal *N*
**CRAG**: a counting regular attribute grammar: an attribute
grammar whose underlying CFG is a right-linear grammar (possibly modified by
the use of final productions of the form `Q → ε`
instead of `Q → a`), whose semantic domain is restricted to
the natural numbers, whose operations are restricted to initializing
attributes to zero, incrementing them, and testing them against integer
constants using the < and ≥ operators.
*E*, *F*: (syntactic variables representing) regular expressions
*G*: in regular expression contexts, a (syntactic
variable representing a) regular expression
(usually a constituent part of *E*); in CRAG contexts, a (syntactic
variable representing a) guard (i.e. a
Boolean expression which must evaluate to true on every production used
in a derivation)
*G*(*E*): the CRAG constructed from regular expression *E*
*i*, *j*: integer (natural-number) indexes to
distinguish expressions
*k*: an integer constant
*n*, *m*, *p*, *q*: in exponents, these denote
integer constants; the second exponent may be a natural number or the literal
`INF`.
*p*, *q*: in discussion of a regular expression *E*, positions
of *E*
*G*_{1}, *G*_{2}: grammars (usually CRAGs)
*L*(*E*):
for any regular expression *E*, the language denoted (and: accepted) by *E*
*N*: a non-terminal in a CRAG
*Q*: a non-terminal in a CRAG
*R*: a non-terminal in a CRAG
*S*: the start symbol in a grammar; also the initial state of
an FSA
*x*: (syntactic variable representing)
a symbol in Σ

Alblas, Henk. 1991.
“Introduction to attribute grammars”.
*Attribute grammars, applications and systems:
International Summer School SAGA,
Prague, Czechoslovakia, June 4-13, 1991, Proceedings*,
pp. 1-15.
Berlin: Springer, 1991.
Lecture Notes in Computer Science, 545.

Brüggemann-Klein, Anne.
1993.
“Regular expressions into finite automata.”
*Theoretical Computer Science*
120.2 (1993): 197-213.

[1] Equivalently, regular expressions may
be described as those strings recognized by the following
grammar; terminal symbols are quoted or spelled in all upper
case:

regular_expression ::= '∅' | 'ε' | INPUT_SYMBOL | regular_expression ',' regular_expression | regular_expression '|' regular_expression | regular_expression quantifier | '(' regular_expression ')' quantifier ::= '{' lowerbound ',' upperbound '}' lowerbound ::= INTEGER upperbound ::= INTEGER | 'INF'

[2] I here follow the treatment in Brüggemann-Klein 1993,
which is worth reading in its own right.

[3] Or
alternatively if and only if every string in Σ* defines at most one path
through the expression; this is a stronger constraint, since it
also requires uniqueness among strings which are not prefixes
of words in *L*(*E*); it is not clear to me whether the difference
matters.

[4] The reader struggling to relate the grammar given in the
text to the definition of attribute grammars offered in [Alblas 1991]
may wish to inspect this alternate version. I have used underscore
followed by an integer where Alblas uses subscripts, and following
Alblas I have put the semantic conditions associated with each
production into an *if* statement and given an error
message for the cases where the semantic conditions evaluate to false.

nonterminals: S, A. terminals: a. start symbol: S. semantic domain: type Tcounter = integer. operator incr. description of attributes: count: Tcounter, inh of A. productions and attribute evaluation rules: (1) S → a A. [count of A := 1] (2) A_0 → a A_1. [count of A_1 := incr count of A_1; if not (count of A_0 < 4) then error message "At most four 'a' symbols are allowed." fi] (3) A → ε. [if not (count of A ≥ 2) then error message "At least two 'a' symbols are required." fi]

[5] If
the reader is troubled by allowing objects
other than identifiers to be non-terminal symbols in a context-free
grammar, no particular harm is done by imagining an arbitrary set of
identifiers (disjoint from *V*_{T}) placed into a 1:1 mapping with the
positions in *E*, with *nt*(*p*) returning the appropriate
identifier. This is indeed what has been done in most examples
in this paper.

[6] This exposition owes a great deal
to [Brüggemann-Klein 1993]'s exposition of the algorithm
for creating the Glushkov automaton of a conventional regular
expression; it is little more than a copy of that algorithm
with suitable additions.

[7]
Alternatively, a CFA might be defined as a tuple
(*Q*, Σ, *C*, *C**C*, δ, *q*_{I},
*F*), where:

*Q*is a set of state families, each corresponding to a state in a conventional FSA. But note that the relevant condition of the CFA is not given wholly by the current state family, but by the current state family and the current counter values. Each state family in the CFA thus corresponds to a possibly large set of states in the corresponding classical FSA.- Σ is the input alphabet.
*C*is a set of counters*C*_{1},*C*_{2}, ...*C*_{n}, for*n*≥ 0.*m*is a function*C*→ integer, where for any counter*C*_{i},*m*(*C*_{i}) is the maximum allowed value of*C*_{i}. As with the grammars given above, if the integer-range exponent is specified as`{n,INF}`, the range of the counter is 1 ..*n*+ 1.*C**C*(counter-conditions) is the set of tuples (*c*_{1},*c*_{2}, ...,*c*_{n}), for*n*= |*C*| (the number of counters in the CFA) ≥ 0, where for all*i*(0 <*i*≤*n*), 0 ≤*c*_{i}≤*m*(*C*_{i}). In illustrations, we may set*c*_{i}to 0 when it is not ‘relevant’ (e.g. the subexpression dominated by the corresponding exponent has not been entered or has been left).- δ (the transition function) is a mapping (not necessarily a function)
from
*Q*× 2^{CC}× Σ into*Q*× 2^{CC}. We will write individual arcs (members of δ) in the form “*p*→ {*G*}*x*{*A*}*q*”, where*p*and*q*∈*Q*,*x*∈ Σ,*G*is the*guard*, a description of the members of*C**C*for which this particular arc applies, and*A*is the*action*, a description of how to modify the current member of*C**C*(and thus implicitly description of the members of*C**C*in the image). When the current counter state is a member of the set described by*G*, we say that the guard has been satisfied. *q*_{I}is the start state.*F*is a function from*Q*× 2^{CC}onto Boolean. If for a particular state*q*and a particular counter condition*c**c*we have*F*(*q*,*c**c*) =`true`, then for counter condition*c**c*, state*q*is an accept state.

It may be seen that when |*C*| = 0, this definition is equivalent to
the standard definition of a finite state automaton.

[8] Brüggemann-Klein numbers each
symbol so as to distinguish it from other occurrences of the same
symbol: the first *a* becomes *a*_{1}, the second *a*_{2},
etc. I find it simpler just to number the symbols right through the
expression.