1.1. Definition
The regular expressions we are dealing with use a slightly
unusual notation, although they denote the same set of languages
as more conventional regular expressions.
We will work with regular expressions which have numeric exponents.
We say that a regular expression denotes a language (i.e. a set of
strings) over some input alphabet Σ; we can
refer to the language denoted by a regular expression E as
L(E).
The regular expression itself is a string which may contain
symbols from Σ in addition to the digits 0 through 9
and the meta-symbols
“∅”,
“ε”,
“|”,
“,”,
“{”,
“INF”,
“}”,
“(”, and
“)”.
As strings, regular expressions are defined by the following
rules:[
1]
- The string “∅” is a regular expression
denoting the empty set.
- The string “ε” is a regular expression
denoting {ε}, the set containing the empty string.
- For any symbol x ∈ Σ (where Σ is the input
alphabet we are interested in), x is a regular expression
denoting {“x”}, i.e. the set whose sole
member is the string consisting of the symbol x.
- If F and G are regular expressions, then “F,
G” is a regular expression, denoting
{“xy”
| x ∈ L(F) ∧ y ∈ L(G)},
the set of strings made by concatenating any member of the language
defined by F with any member of the
language defined by G.
- If F and G are regular expressions, then
“F|G” is a regular expression, denoting
{“x” | x ∈ F ∨ x
∈ G}, the union of L(F) and L(G).
- If F is a regular expression, n is a non-negative integer,
m is either a non-negative integer or the symbol INF,
and n ≤ m, then
“F{n,m}” is a
regular expression, denoting {“x1x2...xk” | (0 ≤ n ≤ k
≤ m) ∧ for 1 ≤ i ≤ k, xi ∈
F}, the concatenation of at least n and at most
m strings each of which is a member of F. The
expression “{n,m}” is
an integer-range exponent which is said to dominate
expression F.
- If F is a regular expression, then
“(F)” is a regular expression which
denotes the same language as is denoted by F.
(Note on typography:
in the text, references to a specific symbol a will
take the form a, while a use of that symbol in a regular
expression will take the form a; these are the same
symbol for purposes of this paper, regardless of any differences of
typographic style or glyph.
Syntactic variables (e.g. the x, F, and G in the rules above) are
rendered as x, F, G in all cases.
In the rules above, I have enclosed strings, regular expressions,
string schemata, and regular-expression schemata in quotation marks
in the interests of making a clearer separation between the meta-level
discussion and the regular expressions themselves. In the following
discussion, the quotation marks will normally be dispensed with.)
Examples (assuming that Σ = {
a,
b,
c}):
Regular expression |
Denotation |
∅ |
the empty set (the language with no members) |
ε |
{ε} = the set containing just the empty string |
a |
{“a”} |
b |
{“b”} |
a,b |
{“ab”} |
(a,b) |
{“ab”} |
(a,b,a)|(b,b,b) |
{“aba”,
“bbb”} |
(a{3,4},b{0,1}){1,2} |
{“aaa”,
“aaaa”,
“aaab”,
“aaaab”,
“aaaaaa”,
“aaaaaaa”,
“aaaaaab”,
“aaabaaa”,
“aaaaaaaa”,
“aaaaaaab”,
“aaaabaaa”,
“aaabaaaa”,
“aaabaaab”,
“aaaaaaaab”,
“aaaabaaaa”,
“aaaabaaab”,
“aaabaaaab”,
“aaaabaaaab”},
i.e. one or two occurrences of strings matching the inner
expression (three or four occurrences of a
followed optionally by a b) |
a{0,10}{0,10} |
strings of length zero to 100, consisting exclusively of the
symbol a |
Note that the repetition operators of conventional regular
expressions can be reduced to integer-range exponents as defined
here. For any regular expression E,
E* ⇒ E{0,INF},
E+ ⇒ E{1,INF},
E? ⇒ E{0,1}.
The same symbol may occur more than once in a regular expression;
when it is necessary to distinguish the different occurrences of the
same symbol, we speak of the positions of the expression.
The expression (a{2,3}, b{1,4}, a{0,INF}){3,4} has three
positions: the first a, the b, and the
second a. We may write a_1,
b_2, a_3 to distinguish the positions of the
expression.
If an integer-range exponent dominates an expression E, then
it also dominates each position p in E.
1.2. Paths, ambiguity, and determinism
Note that as is usual for regular expressions, each string
matching a particular regular expression E can be thought of as
mapping out a path through the regular expression: each
symbol in the string can be associated with a particular position
in the regular expression. Thus the string “abba”
matches the regular expression a(a|b)*a and corresponds
to the path visiting a_1, then b_3
twice, then a_4.
For an unambiguous regular expression
E, each word in
L(
E)
corresponds to exactly one path through
E. If any word in
L(
E)
corresponds to more than one path through
E, then
E is weakly
ambiguous.[
2] Brüggemann-Klein distinguishes
this weak ambiguity from a strong ambiguity attributed to Sippu and
Soisalon-Soininen. Informally, a weakly unambiguous regular
expression allows semantic actions to be associated with the symbols
of the expression, while a strongly unambiguous expression allows
semantic actions to be associated unambiguously not only with
the symbols but also with the repetition operators.
The expression (a{1,2},b{0,1}){1,2} is weakly
unambiguous because any a in any word in the language
can only match the single a in the expression, and
similarly for each b in any word in the language. It
is strongly ambiguous, by contrast, because the word aa
can be taken either as a single occurrence of the expression
(a{1,2},b{0,1}), thus firing the outermost exponent
once, or as two occurrences of the inner expression, thus firing
the outer exponent twice.
By analogy with finite state automata, we can distinguish the
notions of ambiguity and determinism, which are in a subset/superset
relation: all unambiguous automata (and expressions) are
deterministic, but not vice versa. An expression
E is deterministic
if and only if for every prefix of any word ∈
L(
E), the prefix
determines a unique path through
E.[
3] Again, we can distinguish weak
determinism (there is a unique sequence of symbols) and strong
determinism (there is a unique sequence of symbols and exponent
firings).
The Unique Particle Attribute constraint of XML Schema effectively
requires that content models be weakly deterministic. Given the
complications which arise from strong nondeterminism, it might be
desirable to consider requiring that they be strongly deterministic as
well. An algorithm for testing weak determinism is given below.
1.3. Functions on REIREs
Following [Brüggemann-Klein 1993] we prepare the way for defining the algorithm
to map REIREs into counter-extended FSAs
by defining several functions on regular expressions:
- pos(E) → the positions of E
- first(E) → the positions of E
which come first in some word ∈ L(E).
- last(E) → the positions of E
which come last in some word ∈ L(E).
- nullable(E) → Boolean indicating whether
ε ∈ L(E)
- follow(E, p) → the positions of E which
can come after p in some word ∈ L(E)
The inductive definitions which follow match those given by
[Brüggemann-Klein 1993] but extend them to cover integer-range exponents.
The functions
first(
E),
last(
E),
and
nullable(
E) are:
- If E = ∅, then first(E) = last(E) = ∅,
nullable(E) = false.
- If E = ε, then first(E) = last(E) = ∅,
nullable(E) = true.
- If E = a ∈ Σ, then
first(E) = last(E) = {a},
nullable(E) = false.
- If E = F, G, then
first(
E) =
if nullable(F) then first(F) ∪ first(G)
else first(F)
last(
E) =
if nullable(G) then last(F) ∪ last(G)
else last(G)
nullable(E) = nullable(F) ∧ nullable(G)
- If E = F|G, then
first(E) = first(F) ∪ first(G)
last(E) = last(F) ∪ last(G)
nullable(E) = nullable(F) ∨ nullable(G)
- If E = F{n,m}, then
first(E) = first(F),
last(E) = last(F),
nullable(E) = ((n = 0) ∨ nullable(F))
The definition of follow(E,p)
is now straightforward.
For completeness, we stipulate that for all regular expressions
E
and all positions
p, if
p ∉
pos(
E), then
follow(
E,
p) = ∅.
Otherwise, for all
p ∈
pos(
E):
- If E = ∅, then E has no positions and follow(E,p) =
∅.
- If E = ε, then E has no positions and follow(E,p) =
∅.
- If E = a ∈ Σ, then follow(E,p) =
∅.
- If E = F, G, then follow(E,p) =
if p ∈ (pos(F) \ last(F)), then follow(F,p)
if p ∈ last(F), then follow(F,p)
if p ∈ pos(G), then follow(G,p)
follow(F,p) ∪ follow(G,p).
- If E = F|G, then follow(E,p) = follow(F,p) ∪ follow(G,p)
first(E) = first(F) ∪ first(G)
last(E) = last(F) ∪ last(G)
nullable(E) = nullable(F) ∨ nullable(G)
- If E = F{n,m}, then follow(E,p) =
if p ∈ (pos(F) \ last(F)) then follow(F,p)
else (p ∈ last(F)) follow(F,p) ∪ first(F)
We say that the transition from p ∈ last(F) to q ∈ first(F)
is licensed by the integer-range exponent {n,m};
in traversing the transition we say that the exponent
fires.
Readers familiar with Brüggemann-Klein's work or with the construction of the
Glushkov automaton for standard regular expressions should note that
for the regular expressions we use here, last(E) and follow(E,p) include
all of the positions which can
come last, or follow p, in any word in L(E), even though in a path
involving repeated visits to a particular position, the follow set
of the position may vary on each visit, and it may be allowed to be
the last position only on some visits, not all. For example, for
E = a{4,5}, last(E) = {a_1}, even
though that position can be the final one in a path only if three or
four visits to that position precede the final visit. That is,
the last positions of E become final states of the extended
Glushkov automaton
ME only when certain additional
conditions are met: specifically, the counters need to be in the right
position. Similarly, in the construction of the standard Glushkov
automaton, the follow set of p turns into a set of arcs starting in
p and leading to the members of the follow set. In the CFAs to be
constructed below, the arcs will be guarded so that they can be
traversed only when the counters have appropriate values.