W3C

XML Schema: Formal Description

W3C Working Draft, 25 September 2001

This version:
http://www.w3.org/TR/2001/WD-xmlschema-formal-20010925/
Latest version:
http://www.w3.org/TR/xmlschema-formal/
Previous version:
http://www.w3.org/TR/2001/WD-xmlschema-formal-20010320/
Editors:
Allen Brown (Microsoft) allenbr@microsoft.com
Matthew Fuchs (Commerce One) matthew.fuchs@commerceone.com
Jonathan Robie (Software AG) jonathan.robie@SoftwareAG-USA.com
Philip Wadler (Avaya) wadler@avaya.com

Abstract

XML Schema: Formal Description is a formal description of XML types and validity as specified by XML Schema Part 1: Structures.

Status of this document

This is a W3C Working Draft for review by members of the W3C and other interested parties in the general public.

It has been reviewed by the XML Schema Working Group and the Working Group has agreed to its publication. Note that not that all sections of the draft represent the current consensus of the WG. Different sections of the specification may well command different levels of consensus in the WG. Public comments on this draft will be instrumental in the WG's deliberations.

Please review and send comments to www-xml-schema-comments@w3.org (archive).

This specification of the formal description of XML Schema is a working draft of the XML Schema Working Group and has been adopted by the working group as the formalization of the XML Schema Definition Language, in fulfillment of its commitment to provide such a formalization at the conclusion of the Candidate Recommendation phase for XML Schema. This material is current with the Proposed Recommendation draft but not perfectly aligned with it. In the foreseeable future, work on this document will focus on improving the alignment with the normative parts of XML Schema and resolving the issues on the current issues list (http://www.w3.org/2001/03/xmlschema-fd-issues.html).

As with all working drafts, the reader is advised that all the material within is subject to change. The reader is advised that material found here may change drastically from draft to draft. In particular, future versions will incorporate some or all of the XML Schema Part 2: Datatypes specification, the key/keyref facility, and a bidirectional mapping between the XML Schema component model and the corresponding part of this formalization. Parts of this specification, such as the normalization of names, may eventually find a broader use. However the reader should be advised that these features will probably undergo revision before the working group warrants them for such uses.

The Schema Working Group expects this work to be of use to other Working Groups within the W3C as well as to others building tools and systems hoping to leverage XML Schemas.

A list of current W3C working drafts can be found at http://www.w3.org/TR/. They may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".

Table of contents

1 Introduction

2 Overview
  2.1 Normalization
  2.2 Components
  2.3 Documents and Forests

3 Structures
  3.1 Names
  3.2 Wildcards
  3.3 Atomic datatypes
  3.4 Content groups
  3.5 Components
  3.6 Constraints
  3.7 Ordered Forests
  3.8 Inference rules

4 Normalization

5 Derivation
  5.1 Base chain
  5.2 Extension
  5.3 Restriction
  5.4 Constraints

6 Validation
  6.1 Content validation
  6.2 ID/IDREF validation

7 XML Schema to Formalization Mapping
  7.1 Conversions for content model particles
  7.2 Correspondences from XML Schema to sorts

8 Index of notations

Appendix A Auxiliary Grammar

Appendix B Infidelities in the current formalization of XML Schema

Appendix C Bibliography


1  Introduction

This formalization is a formal, declarative system for describing and naming XML Schema information, specifying XML instance type information, and validating instances against schemas. The goals of the formalization are to:

Many potential applications of XML Schema may benefit from the definition of a formal model. We have focused on the material in Part I (Structures), as this is the most complex.

A basic understanding of first-order predicate logic, which is part of most computer science curricula, is adequate to understand this document. Where other notations are used, they are explained before use. Though the mathematical notation used in this formalization may be somewhat daunting for those not accustomed to formalisms, it should be possible to prepare a prose description directly from this formalism, which may be more approachable than a description based on an ad hoc understanding of XML Schema [11,3].

The approach followed here follows the best practices currently used in the programming languages community, although somewhat adapted for XML.  The hallmark of this approach is the use of context free grammars to provide syntactic checking and the use of inference rules to provide the semantics associated with each piece of syntax.  This means there is, essentially, one inference rule per context free grammar production.   This set of inference rules is not intended to be in any way minimal, but it is helpful from both a pedagogical and implementation standpoint - for each syntactic construct it is straightforward to identify its underlying semantics.

The remainder of this document is organized as follows.

2  Overview

This section uses a running example to introduce our representation of a schema, which uses a mathematical notation that is easier to manipulate formally than the XML syntax of Schema. This formalism divides a schema into a set of components, where each component can match against a particular fragment of XML in an instance.  Particular attention will be given to the use of normalization to provide a unique name for each component of a schema.

Here is a sample schema written in W3C XML Schema syntax. 

<xsd:schema targetNamespace = "http://www.example.com/baz.xsd"
    xmlns = "http://www.example.com/baz.xsd"
    xmlns:xsd = "http://www.w3.org/2001/XMLSchema">
  <xsd:element name="a" type="t"/>
  <xsd:simpleType name="b">
    <xsd:list itemType="xsd:integer"/>
  </xsd:simpleType>
  <xsd:complexType name="t">
    <xsd:attribute name="b" type="xsd:string"/>
    <xsd:attribute name="c" type="b" use="optional"/>
  </xsd:complexType>
  <xsd:complexType name="u">
    <xsd:complexContent>
      <xsd:extension base="t">
        <xsd:choice>
          <xsd:element name="d">
            <xsd:complexType>
              <xsd:sequence>
                <xsd:element name="a"
                    type="xsd:string"
                    minOccurs="1"
                    maxOccurs="unbounded"/>
              </xsd:sequence>
            </xsd:complexType>
          </xsd:element>
          <xsd:element name="e" type="xsd:string"/>
        </xsd:choice>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>
</xsd:schema> 

And here is an XML document which matches the above schema

<baz:a xmlns:baz="http://www.example.com/baz.xsd" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:type="baz:u"
    b="zero"
    c="1 2">
  <d>
    <a>three</a>
    <a>four</a>
  </d>
</baz:a> 

2.1  Normalization

We use a normalized form of a schema, which assigns a unique universal name to each component of a schema, and "flattens" the structure. (A component is anything which may be defined or declared: an element, an attribute, a simple type, a complex type, a group, an attribute group, or an identity constraint.)

Here are the normalized (universal) names of the components in our sample schema. For each name, we list two forms: the long form is the name proper, while the short form is an abbreviated version we use in examples to improve readability.

long form
       short form
http://www.example.com/baz.xsd#+element::a
       +a
http://www.example.com/baz.xsd#type::s
       s
http://www.example.com/baz.xsd#type::t
       t
http://www.example.com/baz.xsd#-type::t/attribute::b
       -@t/b
http://www.example.com/baz.xsd#-type::t/attribute::c
       -@t/c
http://www.example.com/baz.xsd#type::u
       u
http://www.example.com/baz.xsd#-type::u/element::d
       u/d
http://www.example.com/baz.xsd#type::u/element::d/type::*
       u/d/ *
http://www.example.com/baz.xsd#-type::u/element::d/type::*/element::a
       -u/d/ */a
http://www.example.com/baz.xsd#-type::u/element::e
       -u/e

The normalized names reflect the nesting structure of the original schema. Note that the nesting of declarations in W3C XML Schema also determines their scope. In the representation of a Schema, the names of nested components are constructed by appending their schema names to the normalized names of their parents; if a component is anonymous, it is given the name *. The syntax of names is chosen to be similar to XPath [5,12], but unlike XPath, * does not function as a wildcard in normalized names.

These normalized names clearly distinguish local names from global names. Where previously it might be possible to confuse the global a element with the local a element, now these are given distinct names, a and u/d/*/a respectively. The latter indicates that the global type u contains a local element d which contain an anonymous type * which contains a local element a. Each attribute or element contains at most one anonymous type, so using the name * for such types leads to no confusion.  In order to reflect the attribute form and element form attributes from XSDL, these names also reflect whether an element is to be qualified or not by a "-" or a "+" immediately following the # in the name.  A "+" indicates the name is to be qualified.   By default, only top-level elements (such as the "a") are qualified.   However, for example, if the declaration of the "b" attribute of type "t" had been:

<xsd:attribute name="b" type="xsd:string" formDefault="qualified"/>

then its normalized name would have been:

http://www.example.com/baz.xsd#+type::t/attribute::b

NOTE:  The normalized names described in this document exist only for the purposes described herein.  There is a separate effort underway by the Schema WG to generalize these names so they will be suitable for a variety of other uses.   When that effort is complete, this document will adopt that syntax.

2.2  Components

Next, we show how components are represented in this notation. Each component is one of six sorts: element, attribute, simple type, complex type, attribute group, or model group. Regardless of sort, each component is represented uniformly as a constructor (one of attribute, element, simpleType, complexType, attributeGroup, modelGroup, or identityConstraint) of a record with six fields:

name
is the name of the component;
base
is the name of the base component of the structure;
derivation
specifies how the component is derived from the base,
it is one of restriction or extension;
refinement
is the set of permitted derivations from this component as base,
it is a subset of {restriction, extension};
abstract
is a boolean, representing whether this type is abstract;
content
is the content of the component, describing the instance information which can match this component, following the context free grammar below.

Note that all seven sorts have a very consistent structure consisting of six properties, which have the same interpretation regardless of the sort.

2.2.1  Component content

The formalism uses standard regular expression notation [2] for component content.   Component content corresponds roughly to a union of both model groups and attributes groups.  In what follows, g stands for a content group (as does g1, g2, and so on). The constructors for groups include the following:

e       
empty sequence
Ø       
empty choice
g1,g2    
sequence, g1 followed by g2
g1 | g2    
choice, g1 or g2
g1 & g2  
all, g1 and g2 in either order
g{m,n}
repetition, g repeated between minimum m and maximum n times   (m is a natural number, n is a natural number or 8)
mixed(g)   
mixed content in group g
@w[g]   
attribute, with name in nameset w and content in group g
w[g]   
element, with name in nameset w and content in group g
p           
atomic datatype (such as xsi:string or xsi:integer)
x           
component name 

EDITOR'S NOTE: The grammar above does not distinguish between the precedence of the various connectors ("|", "&", and ","). This will not be a concern as in our examples we will always parenthesize appropriately. Rather than significantly expanding the grammar, we will provide a full grammar with parentheses in an appendix in a future version.

A nameset corresponds to a set of names.  A nameset may be a single element or attribute name, a substitution group, or the equivalent of an XSDL wildcard.  The syntax for defining a nameset will be given later.

Need to explain what a nameset is.

Here is an example of a model group in our notation, which corresponds to an a element with content of type u in our running example

+a[
    (-@t/b[xsi:string] & -@t/c[xsi:integer{0,8}]{0,1}),
    (-u/d[u/d/*/a[xsi:string]{1,8}] | -u/d/*/e[xsi:string])
] 

Note that the group constructors are used uniformly in several contexts. Repetition (g{m,n}) is used for lists of atomic datatypes, to indicate whether an attribute is optional, and for elements. Similarly, all (g1 & g2) is used for attributes and for elements.

2.2.2 Example Schema Components

Here are the components of the normalized schema represented in this notation

element(
  name = +a,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {restriction,extension},
  abstract = false,
  content = a[t]
) 
simpleType(
  name = s,
  base = xsi:UrSimpleType,
  derivation = restriction,
  refinement = {restriction},
  abstract = false,
  content = xsi:integer{0,8}
) 
complexType(
  name = t,
  base = xsi:UrType,
  derivation = restriction,
  refinement = {restriction,extension},
  abstract = false,
  content = @t/b & @t/c
) 
attribute(
  name = -@t/b,
  base = xsi:UrAttribute,
  derivation = restriction,
  refinement = {restriction},
  abstract = false,
  content = @t/b[xsi:string]
) 
attribute(
  name = -@t/c,
  base = xsi:UrAttribute,
  derivation = restriction,
  derivation = {restricition},
  abstract = false,
  content = @t/c[s]{0,1}
)
complexType(
  name = u,
  base = t,
  derivation = extension,
  refinement = {restriction,extension},
  abstract = false,
  content = (-@t/b & @t/c),
  (u/d | u/e)
) 
element(
  name = u/d,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d[u/d/*]
) 
complexType(
  name = u/d/*,
  base = xsi:UrType,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d/*{0,8}
) 
element(
  name = -u/d/*/a,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d/*/a[xsi:string]
) 
element(
  name = -u/e,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = -u/e[xsi:string]
) 

Note that we do not nest declarations to express their scope. The scope of a declaration is reflected in its normalized name.

2.3  Documents and Forests

It is not clear to me from our discussions whether this notation will stay. There's some suggestion that we need a notation that borders on the verbosity of the underlying XML.

There is also a compact notation for representing XML, both before and after normalization. This notation is not intended in any way to be used as a replacement for traditional XML syntax.  Having a more compact, but equally expressive, equivalent notation will aid significantly in the exposition below (in particular it keeps the inference rules to a reasonable size).  Here is the original document in this notation:

a[
  @xsi:type["u"],
  @b["zero"],
  @c[1,2],
  d[
    a["three"],
    a["four"]
    ]
  ] 

Note that attributes and elements are represented uniformly, as are sequences of attributes, sequences of elements, and lists of atomic datatypes. Note also that from this perspective a document is just a element without a parent. Therefore we will drop the term "document" and simply use the term element. The term "this document" will continue to refer to the document you are reading.

Here is the normalized element:

a[
  @xsi:type["u"],
  @t/b["zero"],
  @t/c[1,2],
  u/d[
    u/d/*/a["three"],
    u/d/*/a["four"]
    ]
  ] 

Given an input document such as this, validation can both establish the validity of the document and assert which types, elements, and attributes were used.  This can be conveyed textually by writing the document using normalized names.

Finally, here is the normalized element with type information added

a[
  u types @t/b[xsi:string types "zero"],
  @t/c[s types 1,2],
  u/d[
    u/d/* types u/d/*/a[xsi:string types "three"],
    u/d/*/a[xsi:string types "four"]
    ]
  ] 

Unlike the xsi:type convention, this notation allows one to uniformly express information about element and attribute types. The type of an element or attribute is indicated by writing x[t types d] where x is an attribute or element name, t is a type name, and d is the content of the attribute or element.

We will use the term forest or ordered forest when referring to lists of attributes and elements, as in the content of a element. For example, while the above is an element its content:

  u types @t/b[xsi:string types "zero"],
  @t/c[s types 1,2],
  u/d[
    u/d/* types u/d/*/a[xsi:string types "three"],
    u/d/*/a[xsi:string types "four"]
    ]

is an ordered forest. Note that a single element is itself an ordered forest.

2.4  Forest Grammars

The content of the components described above takes advantage of derivation and namesets to provide a compact description of the actual allowed content of an element in an instance.  This can be viewed as an intentional description of a regular expression.  However, we can also provide the full extensional description of the regular expression, given a finite set of components.  Doing so allows us to describe validation in terms of a more primitive model and would allow XML Schema to evolve without needing to change the core of the formalism -- as XML Schema evolves, the mapping from the intensional descriptions contained by the components to the extensional descriptions may change, but a formalism built around the extensional versions remains stable.

As an example of this conversion, we will unroll the content of the "a" element from our sample schema.  Note that "a" has declared type "t", and "t" has one subtype, "u".   Therefore the contents of an "a" are either the contents of "t", with an optional xsi:type attribute, or the contents of "u" with an obligatory xsi:type attribute with value "u":

a[(xsi:type="t"? & @t/b & @t/c) | ((xsi:type="u" & @t/b & @t/c), (u/d | u/e))]

A validator can validate against this extended regular expression and map back to the components by way of the value of xsi:type.  At this lowest level, xsi:type functions as an implied attribute, but takes on added signficance once we map from this level back into the level of schema.  It should also be noted that this expression does not follow the single attribution and unambiguity constraints of DTDs or Schema.  Those constraints are applied at the component level.

Another aspect of this translation is to treat element names in regular expressions as namesets containing the names of the elements in the substitution group of the base element.  For example, suppose there were an element "foo" whose content was just an "a" element, and that there also existed an element "f" in the substitution group of "a".  The the content of the "foo" component would be:

foo[a]

and the full regular expression would expand to:

foo[(a | f)].

2.n Description of general forest grammars and general description of how the component grammars can be converted into full forest grammars. Convert some extended type to a full grammar.

3  Structures

This section defines the structures used in this formalism: names, namesets, atomic datatypes, model groups, components, and forests.

3.1  Names

It is not clear how much of this needs to remain here now that NUNs are moving out. Perhaps we stick with this, as why make changes that will just be rendered irrelevant?

The other significant issue here, is that we don't take into account the 208 prefixing issues. However, to be complete, we need to do so. The least disturbing way I can find to do so is to include them in the names - so that a name also specifies if it matches a qualified or unqualified element/attribute (keeps it out of the components).

A namespace is a URI reference, and a local name is an NCName, as in the Namespace recommendation. We let i range over namespaces and the "non-namespace" (where items not in a namespace reside) and j range over local names.  In order to cover the case of naming components that are not in a namespace, i may be empty.  Using these names we can create unique identifiers for the components of a schema.  These names are for the abstract components of the schema and ultimately are not dependent on the XML Schema transfer syntax or any configuration of files the schema may be stored in.

    
    
namespace
    i
::=
URI  reference |    e
local name
    j
:=
NCName

A symbol space is one of the six symbol spaces in XML Schema. We let ss range over symbol spaces.

    
    
symbol space
    ss
::=
element    
   
|
attribute    
   
|
type    
   
|
attributeGroup    
   
|
modelGroup    
   
|
notation

A symbol name consists of a symbol space paired with a local name or with * (the latter names an anonymous component). A name consists of a URI reference followed by a sequence of one or more symbol names. We let sn range over symbol names, and x range over names.

    
    
symbol name
    sn
::=
ss::j
    symbol space ss, local name j
   
|
ss::*
    symbol space ss, anonymous name
name
    x
::=
i#sn1/…/snn
    namespace i, symbol names sn1,…,sn n

Here n = 1 is the scope length of the name.

An example of a name is:

http://www.example.com/baz.xsd#type::u/element::d/type::*/element::a

The scope length of this name is 4.

In this document we've found it convenient to use a short form of names.  Symbol names in the attribute symbol space are written @j, all other symbol names are written j (or * for the oxymoronic anonymous names) - in our examples we've ensured our local names are unique across all symbol spaces. The URI reference may be dropped when it is obvious from context. For example, the short form of the name above is u/d/*/a. Additional examples of names and short forms appear in Section 2.1.

Before normalization, all names in a forest have scope length equal to one. It is helpful to define functions to extract the namespace from a name, and to extract the symbol space and local name of the last symbol name. We define namespace(x) = i, symbol(x) = ss, and local(x) = j when x = i#sn1/…/snn and snn = ss::j.

We also introduce several subclasses of names. An attribute name is the name of an attribute component, and similarly for element, simple type, and complex type names. We let a, e, s, k range over attribute, element, simple type, and complex type names.

    
    
attribute name
    a
::=
x    
element name
    e
::=
x    
simple type name
    s
::=
x    
complex type name
    k
::=
x

A type name is a simple or complex type name. We let t range over type names.

    
    
type name
    t
::=
s
    simple type name
   
|
k
    complex type name

The class of a name must be consistent with its symbol space.

symbol(a)
=
attribute
symbol(e)
=
element
symbol(t)
=
type

3.2  Namesets

Generalization of wildcards and substitution groups - note that x1 UNION x2 is the same thing as putting x2 in the substitution group of x1. All leaves in regex's are now namesets.

A nameset denotes a set of qualified names. The form *:* denotes any name in any namespace (note this is a namespace name, not a prefix), i:* denotes any name in namespace i, and the form x denotes the single name x. Namesets may be combined by union or difference. Namesets constructed in this way have the functionality of wildcards. We let w range over wildcard items, and w range over wildcards.

    
    
nameset
    w
::=
*:*
    any namespace, any local name
   
|
i:*
    namespace i, any local name
   
|
x
    full name
   
|
w1UNION w2
    all names in w1 or in w2
   
|
w1DIFFERENCE w2
    all names in w1 and not in w2
   
|
w%strict
    a nameset with strict processing
   
|
w%lax
    a nameset with lax processing
   
|
w%skip
    a nameset with skip processing

For example, the nameset *:* DIFFERENCE(baz:* UNION xsi:string)%strict denotes any name in any namespace, except for names in namespace baz, and the local name string in namespace xsi. Since it will be convenient to write wildcards with minimal parentheses, we will assume the precedence UNION > DIFFERENCE > %.

3.3  Atomic datatypes

This part will require significant work. I'd like to treat a simple type as a union over facets - treating the value space as a facet, rather than something intrisic to some simple type. It makes unions just disjunctive normal forms of facets.

We take as given the atomic datatypes from XML Schema Part 2. We let p range over atomic datatypes, and c range over constants of such datatypes.

Typically, an atomic datatype is either a primitive datatype, or is derived from another atomic datatype by specifying a set of facets. Note lists of atomic datatypes are specified using repetition while unions are specified using alternation, as defined below.

An example of an atomic datatype is xsi:string, and a constant of that datatype is "zero".

3.4  Content groups

Let g range over content groups.

    
    
group
    g
::=
e
    empty sequence
   
|
Ø
    empty choice
   
|
g1 , g2
    sequence, g1 followed by g2
   
|
g1 | g2
    choice, g1 or g2
   
|
g1 & g2
    all, g1 and g2 in either order
   
|
g{m,n}
    repetition of g between m and n times
   
|
@w[g]
    attribute with name in w and content in g
   
|
w[g]
    element with name in w and content in g
   
|
mixed(g)
    mixed content in group g
   
|
p
    atomic datatype
   
|
x
    component name
minimum
    m
    natural number
maximum
    n
::=
m
    natural number
   
|
8
    unbounded

An example of a group appears in Section 2.2.1.

The empty sequence matches only the empty forest; it is an identity for sequence and all. The empty choice matches no forest; it is an identity for choice.

e,g
=
g
=
g ,e
Ø|g
=
g
=
g |e
e&g
=
g
=
g &e

For use with repetitions, we extend arithmetic to include 8 in the obvious way. For any n we have n + 8 = 8+ n = 8 and n = 8 is always true, and 8 < n is always false.

3.5  Components

A sort is one of the six sorts of component in XML Schema. We let srt range over sorts.

    
    
sort
    srt
::=
attribute    
   
|
element    
   
|
simpleType    
   
|
complexType    
   
|
attributeGroup    
   
|
modelGroup    
   
|
identityConstraint    

(Shortcoming: we make no further use of attributeGroup or modelGroup in this document.)

A derivation specifies how a component is derived from its base. We let der range over derivations, and ders range over sets of derivations.

    
    
derivation
    der
::=
extension    
   
|
refinement    
derivation set
    ders
::=
{der1,…,der l}

We let b range over booleans.

    
    
boolean
    b
::=
true    
   
|
false

A component is a record consisting of a constructor (srt) with six fields.

name
is the name of the component (x);
base
is the name of the base component of the structure (x);
derivation
specifies how the component is derived from the base (der);
refinement
is the set of permitted derivations from this component as base (ders);
abstract
is a boolean, representing whether this type is abstract (b);
content
is the content of the structure, a model group (g).

We let cmp range over components.

    
    
component
    cmp
::=
srt(    
   
    name = x    
   
    base = x    
   
    derivation = der    
   
    refinement = ders    
   
    abstract = b    
   
    content = g    
   
)

Examples of components appear in Section 2.2.2.

This general model of components is far looser than the is strictly allowed by XSDL. While this is not a problem because the creation of components is strongly controlled by the languages they are generated from (currently either an XSDL Schema or a set of XSDL components as described in Structures[11]), a "tighter" grammar, more closely corresponding to what is accepted by XSDL, is found in appendix A.

For a given schema, we assume a fixed dereferencing map that takes names to the corresponding components. We write deref(x) = cmp if name x corresponds to component cmp.

The dereferencing map must map attribute names to attribute components, and similarly for elements, simple types, and complex types.

deref(a).sort
=
attribute
deref(e).sort
=
element
deref(s).sort
=
simpleType
deref(k).sort
=
complexType

3.6  Constraints

Recall that the group constructs are used uniformly in several contexts. Repetition (g{m,n}) is used for lists of atomic datatypes, to indicate whether an attribute is optional, and for elements. Similarly, all (g1 & g2) is used for attributes and for elements. The advantage of this approach is that the semantics of groups is uniform, and need be given only once. Thus, for instance, how repetition acts is defined once, not separately for attributes and elements.

The group grammar is rather more accommodating than that implicit in XML schemas. It permits attributes to have complex local type definitions, multiple occurrences of the declaration of a given attribute name, and the mentioning of an attribute anywhere in the defintion of a group. As we do not generally translate our group grammars back into XML schemas, this permissiveness is of little consequence. We can, nonetheless, define various sub-grammars of the group grammar to constrain matters in so as to be consistent with the XML schemas specification.

However, it is helpful to define additional syntactic categories that specify various subsets of groups. These syntactic classes are then used to constrain the content of components, for instance, to indicate that the content of an element component should be an element, and that the content of a type component should consist of attributes followed by elements.

An attribute group contains only attribute names, combined using the all connector (&). An element group contains no attribute names. (Shortcoming: all groups in element groups should be further constrained as in Schema Part 1, Section 5.7, All Group Limited.)

    
 &n