Notes on finite state automata with counters

C. M. Sperberg-McQueen

10-20 May 2004

Periodic revisions; last touched 19 October 2004

$Id: msm-cfa.html,v 1.5 2004/10/19 22:23:00 cmsmcq Exp $



The XML Schema 1.0 specification defines content models which are similar to regular expressions, but augment standard regular expressions in various ways. Integer-range exponents are allowed in addition to *, +, and ? (or in the more austere regular expressions found in some work, * alone). Substitution groups affect the language accepted in practice. Various forms of wildcard are defined. And each symbol in a content model may be associated with type and other information.
In its current form, this paper attempts to extend the basic facts about regular expressions and finite state automata to deal with regular expressions extended with integer-range exponents; the basic proposal is to extend finite-state automata with counters, and once that proposal has been outlined, the task of the paper will be to show how various tasks may be accomplished using such counter-extended finite-state automata (CFAs).
Although the idea of augmenting a finite-state automaton with one or more counters feels like an obvious one, it has proven harder than expected to find information on the network concerning the underlying theory. These notes are an attempt to work out some ideas in the absence of much guidance from the literature.

1. Regular expressions with integer-range exponents (REIREs)

1.1. Definition

The regular expressions we are dealing with use a slightly unusual notation, although they denote the same set of languages as more conventional regular expressions. We will work with regular expressions which have numeric exponents.
We say that a regular expression denotes a language (i.e. a set of strings) over some input alphabet Σ; we can refer to the language denoted by a regular expression E as L(E).
The regular expression itself is a string which may contain symbols from Σ in addition to the digits 0 through 9 and the meta-symbols “”, “ε”, “|”, “,”, “{”, “INF”, “}”, “(”, and “)”.
As strings, regular expressions are defined by the following rules:[1]
  • The string “” is a regular expression denoting the empty set.
  • The string “ε” is a regular expression denoting {ε}, the set containing the empty string.
  • For any symbol x ∈ Σ (where Σ is the input alphabet we are interested in), x is a regular expression denoting {“x”}, i.e. the set whose sole member is the string consisting of the symbol x.
  • If F and G are regular expressions, then “F, G” is a regular expression, denoting {“xy” | xL(F) ∧ yL(G)}, the set of strings made by concatenating any member of the language defined by F with any member of the language defined by G.
  • If F and G are regular expressions, then “F|G” is a regular expression, denoting {“x” | xFxG}, the union of L(F) and L(G).
  • If F is a regular expression, n is a non-negative integer, m is either a non-negative integer or the symbol INF, and nm, then “F{n,m}” is a regular expression, denoting {“x1x2...xk” | (0 ≤ nkm) ∧ for 1 ≤ ik, xiF}, the concatenation of at least n and at most m strings each of which is a member of F. The expression “{n,m}” is an integer-range exponent which is said to dominate expression F.
  • If F is a regular expression, then “(F)” is a regular expression which denotes the same language as is denoted by F.
(Note on typography: in the text, references to a specific symbol a will take the form a, while a use of that symbol in a regular expression will take the form a; these are the same symbol for purposes of this paper, regardless of any differences of typographic style or glyph. Syntactic variables (e.g. the x, F, and G in the rules above) are rendered as x, F, G in all cases. In the rules above, I have enclosed strings, regular expressions, string schemata, and regular-expression schemata in quotation marks in the interests of making a clearer separation between the meta-level discussion and the regular expressions themselves. In the following discussion, the quotation marks will normally be dispensed with.)
Examples (assuming that Σ = {a, b, c}):
Regular expression Denotation
the empty set (the language with no members)
ε {ε} = the set containing just the empty string
a {“a”}
b {“b”}
a,b {“ab”}
(a,b) {“ab”}
(a,b,a)|(b,b,b) {“aba”, “bbb”}
(a{3,4},b{0,1}){1,2} {“aaa”, “aaaa”, “aaab”, “aaaab”, “aaaaaa”, “aaaaaaa”, “aaaaaab”, “aaabaaa”, “aaaaaaaa”, “aaaaaaab”, “aaaabaaa”, “aaabaaaa”, “aaabaaab”, “aaaaaaaab”, “aaaabaaaa”, “aaaabaaab”, “aaabaaaab”, “aaaabaaaab”}, i.e. one or two occurrences of strings matching the inner expression (three or four occurrences of a followed optionally by a b)
a{0,10}{0,10} strings of length zero to 100, consisting exclusively of the symbol a
Note that the repetition operators of conventional regular expressions can be reduced to integer-range exponents as defined here. For any regular expression E, E*E{0,INF}, E+E{1,INF}, E?E{0,1}.
The same symbol may occur more than once in a regular expression; when it is necessary to distinguish the different occurrences of the same symbol, we speak of the positions of the expression. The expression (a{2,3}, b{1,4}, a{0,INF}){3,4} has three positions: the first a, the b, and the second a. We may write a_1, b_2, a_3 to distinguish the positions of the expression.
If an integer-range exponent dominates an expression E, then it also dominates each position p in E.

1.2. Paths, ambiguity, and determinism

Note that as is usual for regular expressions, each string matching a particular regular expression E can be thought of as mapping out a path through the regular expression: each symbol in the string can be associated with a particular position in the regular expression. Thus the string “abba” matches the regular expression a(a|b)*a and corresponds to the path visiting a_1, then b_3 twice, then a_4.
For an unambiguous regular expression E, each word in L(E) corresponds to exactly one path through E. If any word in L(E) corresponds to more than one path through E, then E is weakly ambiguous.[2] Brüggemann-Klein distinguishes this weak ambiguity from a strong ambiguity attributed to Sippu and Soisalon-Soininen. Informally, a weakly unambiguous regular expression allows semantic actions to be associated with the symbols of the expression, while a strongly unambiguous expression allows semantic actions to be associated unambiguously not only with the symbols but also with the repetition operators.
The expression (a{1,2},b{0,1}){1,2} is weakly unambiguous because any a in any word in the language can only match the single a in the expression, and similarly for each b in any word in the language. It is strongly ambiguous, by contrast, because the word aa can be taken either as a single occurrence of the expression (a{1,2},b{0,1}), thus firing the outermost exponent once, or as two occurrences of the inner expression, thus firing the outer exponent twice.
By analogy with finite state automata, we can distinguish the notions of ambiguity and determinism, which are in a subset/superset relation: all unambiguous automata (and expressions) are deterministic, but not vice versa. An expression E is deterministic if and only if for every prefix of any word ∈ L(E), the prefix determines a unique path through E.[3] Again, we can distinguish weak determinism (there is a unique sequence of symbols) and strong determinism (there is a unique sequence of symbols and exponent firings).
The Unique Particle Attribute constraint of XML Schema effectively requires that content models be weakly deterministic. Given the complications which arise from strong nondeterminism, it might be desirable to consider requiring that they be strongly deterministic as well. An algorithm for testing weak determinism is given below.

1.3. Functions on REIREs

Following [Brüggemann-Klein 1993] we prepare the way for defining the algorithm to map REIREs into counter-extended FSAs by defining several functions on regular expressions:
  • pos(E) → the positions of E
  • first(E) → the positions of E which come first in some word ∈ L(E).
  • last(E) → the positions of E which come last in some word ∈ L(E).
  • nullable(E) → Boolean indicating whether ε ∈ L(E)
  • follow(E, p) → the positions of E which can come after p in some word ∈ L(E)
The inductive definitions which follow match those given by [Brüggemann-Klein 1993] but extend them to cover integer-range exponents. The functions first(E), last(E), and nullable(E) are:
  • If E = ∅, then first(E) = last(E) = ∅, nullable(E) = false.
  • If E = ε, then first(E) = last(E) = ∅, nullable(E) = true.
  • If E = a ∈ Σ, then first(E) = last(E) = {a}, nullable(E) = false.
  • If E = F, G, then
    first(E) =
    if nullable(F) then first(F) ∪ first(G)
    else first(F)
    last(E) =
    if nullable(G) then last(F) ∪ last(G)
    else last(G)
    nullable(E) = nullable(F) ∧ nullable(G)
  • If E = F|G, then
    first(E) = first(F) ∪ first(G)
    last(E) = last(F) ∪ last(G)
    nullable(E) = nullable(F) ∨ nullable(G)
  • If E = F{n,m}, then first(E) = first(F), last(E) = last(F), nullable(E) = ((n = 0) ∨ nullable(F))
The definition of follow(E,p) is now straightforward.
For completeness, we stipulate that for all regular expressions E and all positions p, if ppos(E), then follow(E,p) = ∅. Otherwise, for all ppos(E):
  • If E = ∅, then E has no positions and follow(E,p) = ∅.
  • If E = ε, then E has no positions and follow(E,p) = ∅.
  • If E = a ∈ Σ, then follow(E,p) = ∅.
  • If E = F, G, then follow(E,p) =
    if p ∈ (pos(F) \ last(F)), then follow(F,p)
    if plast(F), then follow(F,p)
    if ppos(G), then follow(G,p)
    follow(F,p) ∪ follow(G,p).
  • If E = F|G, then follow(E,p) = follow(F,p) ∪ follow(G,p)
    first(E) = first(F) ∪ first(G)
    last(E) = last(F) ∪ last(G)
    nullable(E) = nullable(F) ∨ nullable(G)
  • If E = F{n,m}, then follow(E,p) =
    if p ∈ (pos(F) \ last(F)) then follow(F,p)
    else (plast(F)) follow(F,p) ∪ first(F)
    We say that the transition from plast(F) to qfirst(F) is licensed by the integer-range exponent {n,m}; in traversing the transition we say that the exponent fires.
Readers familiar with Brüggemann-Klein's work or with the construction of the Glushkov automaton for standard regular expressions should note that for the regular expressions we use here, last(E) and follow(E,p) include all of the positions which can come last, or follow p, in any word in L(E), even though in a path involving repeated visits to a particular position, the follow set of the position may vary on each visit, and it may be allowed to be the last position only on some visits, not all. For example, for E = a{4,5}, last(E) = {a_1}, even though that position can be the final one in a path only if three or four visits to that position precede the final visit. That is, the last positions of E become final states of the extended Glushkov automaton ME only when certain additional conditions are met: specifically, the counters need to be in the right position. Similarly, in the construction of the standard Glushkov automaton, the follow set of p turns into a set of arcs starting in p and leading to the members of the follow set. In the CFAs to be constructed below, the arcs will be guarded so that they can be traversed only when the counters have appropriate values.

2. Counting regular attribute grammars

Rather than embark directly on a description of how to translate regular expressions with integer-range exponents into FSAs with counters, I turn next to a description of one way to translate REIREs into grammars. Let us use the term counting grammar to denote a grammar which ‘counts’ things (i.e. a grammar which can distinguish words having n occurrences of some substring from those having n+1 occurrences, for some n > 0). Counting grammars can be written using conventional grammatical formalisms like BNF, but for large n, conventional methods tend to become cumbersome. Attribute grammars provide a simple way to write counting grammars concisely; the attribute grammars described below use regular grammars as their underlying context-free grammars, and the attributes used are very restricted in form; they may thus be termed counting regular attribute grammars (CRAGs).
Once CRAGs have been defined, the close relation between conventional regular grammars and finite-state automata will suggest a simple and direct way to extend finite-state automata with counters. In other words, we have regular grammar : FSA = CRAG : CFA.

2.1. A simple example

Before attempting a formal description of CRAGs and how to construct them, let us consider first some simple examples. First, consider the regular expression E = a{2,4}. A conventional regular grammar to recognize L(E) is
S0 → a S1.
S1 → a.
S1 → a S2.
S2 → a.
S2 → a S3.
S3 → a.
where S0 is the start symbol. This has an obvious similarity to the finite-state automaton
     a         a          a          a 
S0 -----> S1 -----> *S2 -----> *S3 -----> *S4
where S0 is the start state and accept states are marked with asterisks. Note that most arcs leading into accept states are represented in the grammar by two productions, one naming the (non-terminal corresponding to the) target state and one not naming it. The latter results in a sentential form without any non-terminals and thus terminates the grammatical derivation; the longer form is used if there are outgoing arcs from the target state; it allows the derivation to continue. It will prove convenient to adopt a different approach to final states, and so in what follows the grammars deviate slightly from the usual regular-grammar notation. In conventional right-linear grammars, every production has either the form Q → a R or the form Q → a, where Q and R are non-terminals and a is a terminal symbol. In the grammars below, the second form does not appear; every production has either the form Q → a R or the form Q → ε.
In this notation, the regular grammar for a{2,4} becomes:
S0 → a S1.
S1 → a S2.
S2 → a S3.
S2 → a S4.
S2 → ε.
S3 → ε.
S4 → ε.
There is now a 1:1 relation between the states of the FSA and the non-terminals of the grammar, and the final states of the FSA are exactly those states Q for which a production of the form Q → ε appears in the grammar. We refer to these as the final productions of the grammar, and to the non-terminals on their left-hand sides as its final non-terminals. By analogy, productions with the start symbol (here conventionally written S or S0) on the left-hand side will be referred to as start productions. The non-terminals on the right-hand sides of start productions are initial non-terminals.

2.2. Using an attribute grammar

The counting grammar can be simpler if instead of using distinct non-terminals to keep a tally of how many occurrences of a have been encountered so far, we substitute an attribute which carries that information. Augmented with such an attribute, and with appropriate guards (or semantic conditions) on productions, our simple example becomes
S → a A(1).
A(n) → {n < 4} a A(n + 1).
A(n) → {n > 1} ε.
The notation used is similar to that of Prolog definite-clause grammars passing inherited grammatical attributes as positional parameters. The expressions in braces are semantic conditions which must hold for the production to fire.[4]
At this point the core ideas of CRAGs may be formulated this way: at each point where the grammar needs a counter (in effect, one for each integer-range exponent in the regular expression), an attribute is used. Tests on productions may compare one or more attributes to constants; the comparisons are invariably of the form counter < constant or counterconstant. On the right hand side, counters passed to non-terminals may be either incremented or set to 1. There is a 1:1 mapping between the non-terminals of the CRAG and the states of the assocated CFA. When following the algorithm given below we generate the CRAG from a REIRE, there is also a 1:1 mapping between the non-terminals other than S and the symbol tokens of the REIRE. As a consequence of this last, each non-terminal N in a CRAG has an associated symbol, and whenever N appears in the right-hand side of a production rule, the terminal symbol in the rule is N's associated symbol. We write chi(N) for the associated symbol of non-terminal N.

2.3. Another example

It may be useful to work out a pair of more elaborate examples. Let us consider first (a{2,3}, b{1,4}){3,4}, and build up its grammar from the grammars of its sub-expressions.
The grammar for a{2,3} is similar to that given above:
S → a A(1).
A(n) → {n < 3} a A(n+1).
A(n) → {n ≥ 2} ε.
and that for b{1,4} is equally straightforward:
S → b B(1).
B(n) → {n < 4} b B(n+1).
B(n) → {n ≥ 1} ε.
It may be noted in passing that since counters are invariably initialized to 1, the guard on the final production is always true and may be omitted when doing so seems clearer.
In composing two grammars G1 and G2 using the sequence operator, we fuse the final productions of G1 with the start productions of G2. The left-hand side and guard are taken from the G1 production, and the right-hand side from the G2 productions. Note that since the start symbol never has any attributes, the right-hand side of a start production can never depend on an inherited attribute value; any attributes in the right-hand side of a start production will be set to 1. In our example, the result is:
S → a A(1).
A(n) → {n < 3} a A(n+1).
A(n) → {n ≥ 2} b B(1).
B(n) → {n < 4} b B(n+1).
B(n) → {n ≥ 1} ε.
This recognizes the language of a{2,3}, b{1,4}. The next step is to modify the grammar to handle the outermost exponent.
To handle the outermost exponents on an expression of the form E{n,m}, we need to add
  • a new counter attribute on all existing non-terminals (since the outermost expression doesn't add any new symbols from Σ, we don't get any new non-terminals), and
  • new productions for each pair (Q,R) with Q a final non-terminal of E and R an initial non-terminal of E; these will have the form Q(c, ...) → {c < m ∧ G} chi(R) R(incr(c), ...), where G is the guard on the final production for Q, if it has one, and otherwise true, and chi(R) is the symbol ∈ Σ associated with R.
  • a new guard on the final productions of E: if the final production has guard {G}, the new final production has guard {cnG} (where c is the new counter); if the final production has no guard, the new guard is just {cn}
In the case of a{2,3}, b{1,4}, the sole final non-terminal is B and the sole initial non-terminal is A, so there is just one new rule. The grammar for (a{2,3}, b{1,4}){3,4} is thus:
S → a A(1,1).
A(n,m) → {n < 3} a A(n+1,m).
A(n,m) → {n ≥ 2} b B(1,m).
B(n,m) → {n < 4} b B(n+1,m).
B(n,m) → {m < 4 ∧ n ≥ 1} a A(1, m+1).
B(n,m) → {m ≥ 3 ∧ n ≥ 1} ε.

2.4. Generation of CRAG from REIRE

For every regular expression with integer-range exponents E, there is a CRAG G(E) which accepts the same language and which will be described below.
Formally, an attribute grammar is a tuple consisting of a context free grammar, a semantic domain (types and operators), a set of attributes each associated with a non-terminal, a set of attribute evaluation rules, and a set of semantic conditions. By describing each of these in turn, we can show how to construct G(E) for any regular expression E.
  • The underlying context-free grammar has
    • VN (the non-terminal vocabulary) is the set of positions in E, plus an additional start symbol (here invariably written S). For clarity, when it is useful to stress the difference between a position p and the use of that position as an element of VN, the latter will be denoted nt(p).[5]
    • VT (the terminal vocabulary) is the set of symbols ∈ Σ actually used in E.
    • P (the set of production rules) has one or more rules for each position pair p, q where qfollow(E,p). The rules each take the form “P{G} a Q”, where G is an appropriate guard and a = chi(q). Details of the guard and the number of rules for a position pair are given below.
    • S is the start symbol.
  • The semantic domain of G(E) is the set of non-negative integers, over which the legal operations are incrementation by 1 and comparison against constants using ≥ and <.
  • Attributes are associated with non-terminals other than S according to rules given in more detail below. Informally, each attribute on a non-terminal represents a counter which keeps track of an integer-range exponent. Each non-terminal is associated with a position in E and has one attribute for each integer range exponent which dominates that position in E. The convention adopted here is to write the outermost exponent's attribute last in the sequence. All attributes in G(E) are inherited.
  • The attribute evaluation rules which assign values to the attributes of non-terminals in the right-hand side of a production are all implicit; they take the form of expressions written after the non-terminal in the manner of parameters to a subroutine call. The expressions all take one of the following forms: “1” (i.e. initialization to the constant 1), “incr(c)” (incrementing the value for some counter c inherited by the left hand side), or “c” (copying a value inherited by the left hand side).
  • The semantic conditions associated with a given production are Boolean expressions written within braces at the beginning of the production's right-hand side. All atomic semantic conditions in the CRAG are either of the form “c < k” or of the form “ck”, where c is the name of a counter and k an integer constant. Semantic conditions may be composed using Boolean operators; in this construction, only logical conjunction (∧) is used.
The description just given glosses over some questions about attributes and production rules; details follow.
In order to handle strongly ambiguous regular expressions, it is useful to allow our grammar to contain production rules which would be redundant if their semantic conditions (guards) and attribute evaluation rules were ignored. Informally, if qfollow(E,p), then we will have at least one rule with p on the left-hand side and q on the right. If the transition from p to q is licensed by more than one exponent, then there will be more than one rule, one for each such exponent.
To describe the creation of production rules and attributes precisely, it will be useful to define the functions fp (for ‘final productions’), sp (for ‘start productions’), and mp (‘medial productions’, i.e. any production which is neither a start production nor a final production) on regular expressions. So sp(E) denotes the start productions, mp(E) denotes the medial productions, and fp(E) the final productions, in the grammar G(E) generated from E.
We also need some simple operations on individual productions: for some production rule p, we have
  • guard(p) → the guard (semantic conditions) of p
  • lhs(p) → the non-terminal symbol on the left-hand side of p, including any attributes it carries
  • rhs(p) → the right-hand side of p without any guard
  • atts(p) → the attributes on the left-hand side symbol of p. Since individual attributes are passed as position parameters and have no names, we write att_1 to denote the first, att_2 the second, and so on. In the description of the algorithm, the expression att_n denotes the last attribute in the series; it is solely a short-hand for att_1 or att_2, etc.
  • add_attribute(p) → production p augmented with an additional attribute. If p is an initial production, then the new attribute appears only on the right-hand side, with the value 1. If p is a final production, then the new attribute appears only on the left-hand side, in the form of an att_n variable. Otherwise, the new attribute appears both on the left and the right hand side, in both places taking the form of an att_n variable; the production transfers the value without change from the non-terminal on the left to that on the right. Thus, for example, if p = B(att_1) → {att_1 < 4} b B(incr(att_1)), then add_attribute(p) = B(att_1, att_2) → {att_1 < 4} b B(incr(att_1), att_2). In cases where it is not known how many attributes are already attached to the non-terminals in p, expressions in this paper may write att_n to refer to the newly added attribute.
A couple of utility functions are also worth defining:
  • add_attribute(lhs(p)) → the left-hand side of p, augmented with an additional attribute whose value is not specified. Thus, for example, if p = B(att_1) → {att_1 < 4} b B(incr(att_1)), then add_attribute(lhs(p)) = B(att_1, att_2). In this paper, the expression att_n refers to the newly added attribute regardless of its actual name.
  • add_attribute(att_n, rhs(p)) → the right-hand side of p, augmented with an additional attribute. In this paper, att_n refers to the newly added attribute; in the operation of the function, the notation add_n will be resolved to an appropriate form. If p = B(att_1) → {att_1 < 4} b B(incr(att_1)), then add_attribute(att_n, rhs(p)) = b B(incr(att_1), att_2).
  • reset(rhs(p)) → the right-hand side of p, with all attributes on the non-terminal reset to 1. This function can be composed with add_attribute, in which case the result is to add a new attribute and reset all existing ones. If p = B(att_1) → {att_1 < 4} b B(att_1 + 1), then reset(rhs(p)) = b B(1), and add_attribute(incr(att_n), reset(rhs(p))) = b B(1, incr(att_2)).
We can describe the creation of a CRAG from a regular expression inductively:[6]
  • If E = ∅, G(E) has no production rules and no attributes.
  • If E = ε, G(E) has the production rule S → ε; there are no attributes.
  • If E = x ∈ Σ, then let p be the single position in pos(E). Then:
    sp(E) = {S → x nt(p)},
    fp(E) = {nt(p) → ε},
    mp(E) = ∅.
  • If E = F, G, then
    sp(E) = if nullable(F) then sp(F) ∪ sp(G), else sp(F)
    fp(E) = if nullable(G) then fp(F) ∪ fp(G), else fp(G)
    mp(E) = mp(F) ∪ mp(G) ∪ {lhs(p) → {guard(p)} rhs(q) | pfp(F) ∧ qsp(G)}
  • If E = F|G, then
    sp(E) = sp(F) ∪ sp(G)
    fp(E) = fp(F) ∪ fp(G)
    mp(E) = mp(F) ∪ mp(G)
  • If E = F{n,m}, for integer m, then
    sp(E) = {add_attribute(p) | psp(F)}
    fp(E) = {add_attribute(lhs(p)) → {att_nnguard(p)} ε | pfp(F)}
    mp(E) = {add_attribute(p) | pmp(F)} ∪ {add_attribute(lhs(p)) → {att_n < mguard(p)} add_attribute(incr(att_n), reset(rhs(q))) | pfp(F) ∧ qsp(F)}
    The values of counter att_n run from 1 to m. For all values v < m, incr(v) = v + 1. For v = m, incr(v) is undefined.
  • If E = F{n,INF}, then
    sp(E) = {add_attribute(p) | psp(F)}
    fp(E) = {add_attribute(lhs(p)) → {att_nnguard(p)} ε | pfp(F)}
    mp(E) = {add_attribute(p) | pmp(F)} ∪ {add_attribute(lhs(p)) → {guard(p)} add_attribute(incr(att_n), reset(rhs(q))) | pfp(F) ∧ qsp(F)}
    The values of counter att_n run from 1 to n + 1. For all values vn, incr(v) = v + 1. For v = n + 1, incr(v) = n + 1. Note that for this case, unlike the previous, no guard of the form “att_n < m” is added to the productions in mp(E).
The examples given earlier may serve as illustrations of this process.
Note that although the description given above describes attributes as having type integer, in practice all attributes range over finite subranges of integer. The number of possible attribute values and their combinations is thus finite and the attributes can be ‘compiled out’ of the grammar to create a (much larger) context-free grammar without attributes. Thus, any results which apply to extended affix or attribute grammars over finite lattices apply to CRAGs.

3. Counter-extended finite-state automata (CFAs)

3.1. Definition

CFAs can be defined in direct parallel with the CRAGs just described. In augmenting finite state automata with counters, we need to add the following things to the usual account of FSAs:
  • Counters. Each counter reflects an integer exponent in the source regular expression, and is ‘visible to’ each state corresponding to a position in the subexpression dominated by the exponent.
  • Guards. An arc may have zero or more guards of the form counter X < k or counter X ≥ k for some counter X and some constant k. I believe no two guards on the same arc will ever concern the same counter.
  • Increments. An arc may increment exactly one counter.
  • Clears. An arc may clear one or more counters. Informally, whenever we traverse an arc and increment an outer exponent, in the same arc we clear each counter associated with an inner exponent.
  • Conditions on accept states. If and only if the outermost expression is an integer exponent, then the accept states are the states in last(E) on condition that the outermost counter is greater than or equal to the minOccurs in the outermost exponent.
More formally, a CFA is [7] a tuple (M, C, CC, δ, F), where:
  • M is a classical FSA, i.e. a tuple (Q, Σ, δ, F).
  • C is a set of counters C1, C2, ... Cn, for n ≥ 0.
  • m is a function C → integer, where for any counter Ci, m(Ci) is the maximum allowed value of Ci. As with the grammars given above, if the integer-range exponent is specified as {n,INF}, the range of the counter is 1 .. n + 1.
  • CC (counter-conditions) is the set of tuples (c1, c2, ..., cn), for n = |C| (the number of counters in the CFA) ≥ 0, where for all i (0 < in), 0 ≤ cim(Ci). In illustrations, we may set ci to 0 when it is not ‘relevant’ (e.g. the subexpression dominated by the corresponding exponent has not been entered or has been left).
  • GA is a set of guards and actions associated with the elements of the transition function of M Individual arcs (members of δM) may be written in the form “p → {G} x {A} q”, where p and qQ, x ∈ Σ, G is the guard, a description of the members of CC for which this particular arc applies, and A is the action, a description of how to modify the current member of CC (and thus implicitly description of the members of CC in the image). When the current counter state is a member of the set described by G, we say that the guard has been satisfied.
  • F is a function from QM × 2CC onto Boolean. If for a particular state q and a particular counter condition cc we have F(q, cc) = true, then for counter condition cc, state q is an accept state. [This needs to be recast in terms of the final states of M rather than in terms of all states of M.]
A CFA matches an input string if there is a path from the start state to a final state such that the guards on the transition arcs are satisfied at the time the arcs are traversed, and the final state is a final
We can define the CFA C(E), the extended Glushkov automaton of regular expression E, as follows:
  • Q is pos(E) ∪ {qI}. N.B. qIpos(E)
  • Σ is the input alphabet of E.
  • C is a set of counters corresponding 1:1 with the integer-range exponents of E
  • For each integer-range exponent {p,q} in E, m maps from the corresponding Ci to q if q is an integer; if q = INF, then m maps Ci to p + 1.
  • CC (counter-conditions) is the Cartesian product of the possible states of the counters in C.
  • δ (the transition function) is a mapping constructed in the same way that production rules were constructed when building CRAGs.
  • qI is the start state.
  • F(q, cc) = true if and only if in the CRAG G(E) there is a final production of the form q → {cc} ε.

3.2. Properties of CFAs

A CFA M is weakly deterministic if for every state p, counter-condition cc, and symbol x, there is only one possible next state. I.e. for any two arcs in δ of M, p → {G1} x {A1} q1 and p → {G2} x {A2} q2, if any cc satisfies both G1 and G2, then q1 = q2.
A CFA M is strongly deterministic if for every state p, counter-condition cc, and symbol x, there is only one arc in δ. I.e. δ is a function.
Conjecture: the extended Glushkov automaton of E is weakly / strongly deterministic if and only if E is weakly / strongly deterministic.
Note that the extended Glushkov automaton for E has size proportional to that of E. Construction using a straightforward algorithm will take cubic time.
Conjecture: if it is possible to extend Brüggemann-Klein's notion of star normal form to REIREs, then it should be possible to construct the extended Glushkov automaton in quadratic time.

3.3. Checking the UPA constraint

A content model obeys the Unique Particle Attribution constraint if and only if its extended Glushkov automaton is weakly deterministic.

3.4. Checking instance validity

An instance element e satisfies a content model E if (a) the string of generic identifiers of the children of E is a word in L(E) and (b) the other conditions imposed by the XML Schema specification (consistency of type assignment, etc.) are met. Condition (a) can be checked by constructing the extended Glushkov automaton of E and checking whether the automaton recognizes the string.
Conjecture: for strongly deterministic CFAs, instance validity checking is linear in the size of the input.

3.5. Making a CFA deterministic

[It will be convenient if we can provide a simple algorithm for making a CFA deterministic. No low-cost algorithm is currently known, although the author has some hopes.]
Since a CFA has only a finite set of counters, each with a finite range, any CFA can be transformed into a conventional FSA and determinized, at cost proportional, for each subexpression, to the product of the maxima in all of the integer-range exponents which dominate the subexpression.
If an expression is weakly deterministic, then any strong non-determinism comes from nested exponents. Note that in some cases, it is possible to rewrite the expression (E{n,m}){p,q} as E{n*p, m*q}, which is strongly deterministic.

3.6. Subset checking

[It is certainly possible to check the subset relation between two CFAs; in the general case, this may have cost proportional to the product of the maxima of all nested exponents. A full description of the algorithm remains to be worked out.]

A. Standard construction of Glushkov automaton

For comparison, it may be useful to give an account of the standard method of constructing the Glushkov automaton.
The following account of the algorithm follows that given by [Brüggemann-Klein 1993].
The basic idea is to construct an FSA with one state for each symbol (∈ Σ) in the regular expression. In order to simplify the discussion, we first ‘mark’ the regular expression by subscripting each symbol with a number. The first symbol in the regular expression gets the subscript 1, the second the subscript 2, etc.[8] The expression (a | b)* a a b thus becomes (a1 | b2)* a3 a4 b5. We refer to these subscripted symbols as positions. For any particular position p we denote the symbol represented by p as chi(p).
For any expression E, we can calculate three functions:
first
regular expressionset of positions: positions which match the first symbol of some word in L(E)
last
regular expressionset of positions: positions which match the last symbol of some word in L(E)
follow
regular expression × positionset of positions: the set of positions which can follow position x in a path through E
These functions are defined as follows for regular expressions E, F, G over alphabet Σ, and for any symbol a ∈ Σ:
  • when E = ε:
    • first(E) = last(E) = ∅
    • E has no positions, so it is not in the domain of follow
  • when E = :
    • first(E) = last(E) = ∅
    • E has no positions, so it is not in the domain of follow
  • when E = a:
    • first(E) = last(E) = {a}
    • follow(E,p) = ∅
  • when E = F, G:
    • first(E) = if ε ∈ L(F) then (first(F) ∪ first(G)) else first(F)
    • last(E) = if ε ∈ L(G) then (last(F) ∪ last(G)) else last(G)
    • follow(E,p) =
      if p ∈ (pos(F) \ last(F)) then follow(F,p),
      else if plast(F) then (follow(F,p) ∪ first(G)),
      else if ppos(G) then follow(G,p)
  • when E = F | G:
    • first(E) = first(F) ∪ first(G)
    • last(E) = last(F) ∪ last(G)
    • follow(E,p) = if ppos(F) then follow(F,p), else if ppos(G) then follow(G,p)
  • when E = F*:
    • first(E) = first(F)
    • last(E) = last(F)
    • follow(E,p) = if p ∈ (pos(F) \ last(F)) then follow(F,p), else if plast(F) then (follow(F,p) ∪ first(F))
The Glushkov automaton of E, written ME, is the tuple (QE ∪ {qI}, Σ, δE, qI, FE), where:
  • QE = pos(E), i.e. the set of states of ME is the set of positions of E, to which is added a new initial state qI.
  • Σ is the input alphabet.
  • δE (the transition function) is a mapping (not necessarily a function) from QE × Σ into Q, constructed thus:
    • For all x ∈ Σ, let the transition function δE(qI, x) = {p | pfirst(E) ∧ chi(p) = x}
    • For all ppos(E) and all x ∈ Σ, let δE(p, x) = {q | qfollow(E,p) ∧ chi(q) = x}
  • qI is the start state.
  • FE (the set of final states) = if ε ∈ L(E) then (last(E) ∪ qI) else last(E).

B. Standard construction of Thompson automaton

For comparison, we should also include an account of Thompson's method of constructing an automaton, augmented as needed.

C. To do

  1. clean up definition of CFA. The old definition (now in the note) won't work because if |C| = 0, the transition function is empty (CC is in its signature).
Additionally (these aren't sorted, and some of them have been done; the list needs to be weeded):
  • for comparison purposes (or where necessary to work things out mentally), walk through classical regex (or, seq, and star only) and conventional regex (classical with plus and opt) showing:
    • definition of regex
    • definition of fsa
    • regex → fsa: description, algorithm, implementation
    • regex → (regular) grammar: description, algorithm, implementation
    • determinism and non-determinism
    • determinizing fsa
    • minimizing fsa
    • validation
    • UPA checking
    • subset checking for fsa
    • star normal form: definition, algorithm for translating into it, proof of equivalence
  • define FSA with counters
  • extend Glushkov algorithm to handle counters
  • recast Glushkov algorithm to produce regular attribute grammar
  • show that for classical automata the extended Glushkov algorithm produces the same automata as the one cited by ABK
  • show how to do UPA checking with FSAC
  • show how to do validation using FSAC
  • show how to do subsumption checking using FSAC

D. Notation and terminology

Although I have tried to explain the notation used in context, it may be useful to list the conventions used for symbols, identifiers, etc.
a, b, c: symbols in Σ, the input alphabet of an FSA, CFA, CRAG, etc.
c: a counter in a CRAG.
chi(N): the unique symbol ∈ Σ associated with the non-terminal N
CRAG: a counting regular attribute grammar: an attribute grammar whose underlying CFG is a right-linear grammar (possibly modified by the use of final productions of the form Q → ε instead of Q → a), whose semantic domain is restricted to the natural numbers, whose operations are restricted to initializing attributes to zero, incrementing them, and testing them against integer constants using the < and ≥ operators.
E, F: (syntactic variables representing) regular expressions
G: in regular expression contexts, a (syntactic variable representing a) regular expression (usually a constituent part of E); in CRAG contexts, a (syntactic variable representing a) guard (i.e. a Boolean expression which must evaluate to true on every production used in a derivation)
G(E): the CRAG constructed from regular expression E
i, j: integer (natural-number) indexes to distinguish expressions
k: an integer constant
n, m, p, q: in exponents, these denote integer constants; the second exponent may be a natural number or the literal INF.
p, q: in discussion of a regular expression E, positions of E
G1, G2: grammars (usually CRAGs)
L(E): for any regular expression E, the language denoted (and: accepted) by E
N: a non-terminal in a CRAG
Q: a non-terminal in a CRAG
R: a non-terminal in a CRAG
S: the start symbol in a grammar; also the initial state of an FSA
x: (syntactic variable representing) a symbol in Σ

E. References

Alblas, Henk. 1991. “Introduction to attribute grammars”. Attribute grammars, applications and systems: International Summer School SAGA, Prague, Czechoslovakia, June 4-13, 1991, Proceedings, pp. 1-15. Berlin: Springer, 1991. Lecture Notes in Computer Science, 545.

Brüggemann-Klein, Anne. 1993. “Regular expressions into finite automata.” Theoretical Computer Science 120.2 (1993): 197-213.


Notes

[1] Equivalently, regular expressions may be described as those strings recognized by the following grammar; terminal symbols are quoted or spelled in all upper case:
regular_expression ::= '∅'
                     | 'ε'
                     | INPUT_SYMBOL
                     | regular_expression ',' regular_expression
                     | regular_expression '|' regular_expression
                     | regular_expression quantifier
                     | '(' regular_expression ')'

quantifier ::= '{' lowerbound ',' upperbound '}'
lowerbound ::= INTEGER
upperbound ::= INTEGER | 'INF'
[2] I here follow the treatment in Brüggemann-Klein 1993, which is worth reading in its own right.
[3] Or alternatively if and only if every string in Σ* defines at most one path through the expression; this is a stronger constraint, since it also requires uniqueness among strings which are not prefixes of words in L(E); it is not clear to me whether the difference matters.
[4] The reader struggling to relate the grammar given in the text to the definition of attribute grammars offered in [Alblas 1991] may wish to inspect this alternate version. I have used underscore followed by an integer where Alblas uses subscripts, and following Alblas I have put the semantic conditions associated with each production into an if statement and given an error message for the cases where the semantic conditions evaluate to false.
nonterminals: S, A.
terminals: a.
start symbol: S.

semantic domain:
  type Tcounter = integer.
  operator incr.

description of attributes:
  count: Tcounter, inh of A.

productions and attribute evaluation rules:
(1)  S → a A.
     [count of A := 1]

(2)  A_0 → a A_1.
     [count of A_1 := incr count of A_1;
      if not (count of A_0 < 4) then
         error message "At most four 'a' symbols are allowed."
      fi]

(3)  A → ε.
     [if not (count of A ≥ 2) then
         error message "At least two 'a' symbols are required."
      fi]
[5] If the reader is troubled by allowing objects other than identifiers to be non-terminal symbols in a context-free grammar, no particular harm is done by imagining an arbitrary set of identifiers (disjoint from VT) placed into a 1:1 mapping with the positions in E, with nt(p) returning the appropriate identifier. This is indeed what has been done in most examples in this paper.
[6] This exposition owes a great deal to [Brüggemann-Klein 1993]'s exposition of the algorithm for creating the Glushkov automaton of a conventional regular expression; it is little more than a copy of that algorithm with suitable additions.
[7] Alternatively, a CFA might be defined as a tuple (Q, Σ, C, CC, δ, qI, F), where:
It may be seen that when |C| = 0, this definition is equivalent to the standard definition of a finite state automaton.
[8] Brüggemann-Klein numbers each symbol so as to distinguish it from other occurrences of the same symbol: the first a becomes a1, the second a2, etc. I find it simpler just to number the symbols right through the expression.