XML Schema: Formal Description

W3C Working Draft, 25 September 2001

This version:: http://www.w3.org/TR/2001/WD-xmlschema-formal-20010925/
Latest version:: http://www.w3.org/TR/xmlschema-formal/
Previous version:: http://www.w3.org/TR/2001/WD-xmlschema-formal-20010320/
Editors:: Allen Brown (Microsoft) allenbr@microsoft.com
Matthew Fuchs (Commerce One) matthew.fuchs@commerceone.com
Jonathan Robie (Software AG) jonathan.robie@SoftwareAG-USA.com
Philip Wadler (Avaya) wadler@avaya.com

Abstract

XML Schema: Formal Description is a formal description of XML types and validity as specified by XML Schema Part 1: Structures.

Status of this document

This is a W3C Working Draft for review by members of the W3C and other interested parties in the general public.

It has been reviewed by the XML Schema Working Group and the Working Group has agreed to its publication. Note that not that all sections of the draft represent the current consensus of the WG. Different sections of the specification may well command different levels of consensus in the WG. Public comments on this draft will be instrumental in the WG's deliberations.

Please review and send comments to www-xml-schema-comments@w3.org (archive).

This specification of the formal description of XML Schema is a working draft of the XML Schema Working Group and has been adopted by the working group as the formalization of the XML Schema Definition Language, in fulfillment of its commitment to provide such a formalization at the conclusion of the Candidate Recommendation phase for XML Schema. This material is current with the Proposed Recommendation draft but not perfectly aligned with it. In the foreseeable future, work on this document will focus on improving the alignment with the normative parts of XML Schema and resolving the issues on the current issues list (http://www.w3.org/2001/03/xmlschema-fd-issues.html).

As with all working drafts, the reader is advised that all the material within is subject to change. The reader is advised that material found here may change drastically from draft to draft. In particular, future versions will incorporate some or all of the XML Schema Part 2: Datatypes specification, the key/keyref facility, and a bidirectional mapping between the XML Schema component model and the corresponding part of this formalization. Parts of this specification, such as the normalization of names, may eventually find a broader use. However the reader should be advised that these features will probably undergo revision before the working group warrants them for such uses.

The Schema Working Group expects this work to be of use to other Working Groups within the W3C as well as to others building tools and systems hoping to leverage XML Schemas.

A list of current W3C working drafts can be found at http://www.w3.org/TR/. They may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".

1 Introduction

2 Overview
  2.1 Normalization
  2.2 Components
  2.3 Documents and Forests

3 Structures
  3.1 Names
  3.2 Wildcards
  3.3 Atomic datatypes
  3.4 Content groups
  3.5 Components
  3.6 Constraints
  3.7 Ordered Forests
  3.8 Inference rules

4 Normalization

5 Derivation
  5.1 Base chain
  5.2 Extension
  5.3 Restriction
  5.4 Constraints

6 Validation
6.1 Content validation
6.2 ID/IDREF validation

7 XML Schema to Formalization Mapping
7.1 Conversions for content model particles
7.2 Correspondences from XML Schema to sorts

8 Index of notations

Appendix A Auxiliary Grammar

Appendix B Infidelities in the current formalization of XML Schema

Appendix C Bibliography

1 Introduction

This formalization is a formal, declarative system for describing and naming XML Schema information, specifying XML instance type information, and validating instances against schemas. The goals of the formalization are to:

Provide a semantic framework for software systems that use the W3C XML Schema specification, such as the W3C XML Query Algebra.
Specify names for all components of an XML Schema, so that they can be uniquely identified by URIs. Such unique identifiers may be useful to XML Query, RDF, and topic maps, among others.
Formally define validation at a declarative level.
Define the mapping from the current XML Schema syntax onto the structures described here, as well as the mapping between the XML Schema component mode and our component model.

Many potential applications of XML Schema may benefit from the definition of a formal model. We have focused on the material in Part I (Structures), as this is the most complex.

A basic understanding of first-order predicate logic, which is part of most computer science curricula, is adequate to understand this document. Where other notations are used, they are explained before use. Though the mathematical notation used in this formalization may be somewhat daunting for those not accustomed to formalisms, it should be possible to prepare a prose description directly from this formalism, which may be more approachable than a description based on an ad hoc understanding of XML Schema [11,3].

The approach followed here follows the best practices currently used in the programming languages community, although somewhat adapted for XML. The hallmark of this approach is the use of context free grammars to provide syntactic checking and the use of inference rules to provide the semantics associated with each piece of syntax. This means there is, essentially, one inference rule per context free grammar production. This set of inference rules is not intended to be in any way minimal, but it is helpful from both a pedagogical and implementation standpoint - for each syntactic construct it is straightforward to identify its underlying semantics.

The remainder of this document is organized as follows.

Section 2 gives an overview of the formalism.
Section 3 defines the basic structures used in the formalization including names, groups, and components.
Section 4 describes normalization.
Section 5 describes refinement.
Section 6 describes validation.
Section 7 describes the mapping from the XML Schema concrete syntax into our structures. (A mapping from XML Schema components to these components will appear in a later draft.)
Section 8 indexes the notations used.
Appendix B lists lacunae in the current articulation of the formalism relative to XML Schema

2 Overview

This section uses a running example to introduce our representation of a schema, which uses a mathematical notation that is easier to manipulate formally than the XML syntax of Schema. This formalism divides a schema into a set of components, where each component can match against a particular fragment of XML in an instance. Particular attention will be given to the use of normalization to provide a unique name for each component of a schema.

Here is a sample schema written in W3C XML Schema syntax.

<xsd:schema targetNamespace = "http://www.example.com/baz.xsd"
    xmlns = "http://www.example.com/baz.xsd"
    xmlns:xsd = "http://www.w3.org/2001/XMLSchema">
  <xsd:element name="a" type="t"/>
  <xsd:simpleType name="b">
    <xsd:list itemType="xsd:integer"/>
  </xsd:simpleType>
  <xsd:complexType name="t">
    <xsd:attribute name="b" type="xsd:string"/>
    <xsd:attribute name="c" type="b" use="optional"/>
  </xsd:complexType>
  <xsd:complexType name="u">
    <xsd:complexContent>
      <xsd:extension base="t">
        <xsd:choice>
          <xsd:element name="d">
            <xsd:complexType>
              <xsd:sequence>
                <xsd:element name="a"
                    type="xsd:string"
                    minOccurs="1"
                    maxOccurs="unbounded"/>

              </xsd:sequence>
            </xsd:complexType>
          </xsd:element>
          <xsd:element name="e" type="xsd:string"/>
        </xsd:choice>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>
</xsd:schema>

And here is an XML document which matches the above schema

<baz:a xmlns:baz="http://www.example.com/baz.xsd" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:type="baz:u"
    b="zero"
    c="1 2">
  <d>
    <a>three</a>
    <a>four</a>
  </d>
</baz:a>

2.1 Normalization

We use a normalized form of a schema, which assigns a unique universal name to each component of a schema, and "flattens" the structure. (A component is anything which may be defined or declared: an element, an attribute, a simple type, a complex type, a group, an attribute group, or an identity constraint.)

Here are the normalized (universal) names of the components in our sample schema. For each name, we list two forms: the long form is the name proper, while the short form is an abbreviated version we use in examples to improve readability.

long form

short form

http://www.example.com/baz.xsd#+element::a

+a

http://www.example.com/baz.xsd#type::s

s

http://www.example.com/baz.xsd#type::t

t

http://www.example.com/baz.xsd#-type::t/attribute::b

-@t/b

http://www.example.com/baz.xsd#-type::t/attribute::c

-@t/c

http://www.example.com/baz.xsd#type::u

u

http://www.example.com/baz.xsd#-type::u/element::d

u/d

http://www.example.com/baz.xsd#type::u/element::d/type::*

u/d/ *

http://www.example.com/baz.xsd#-type::u/element::d/type::*/element::a

-u/d/ */a

http://www.example.com/baz.xsd#-type::u/element::e

-u/e

The normalized names reflect the nesting structure of the original schema. Note that the nesting of declarations in W3C XML Schema also determines their scope. In the representation of a Schema, the names of nested components are constructed by appending their schema names to the normalized names of their parents; if a component is anonymous, it is given the name *. The syntax of names is chosen to be similar to XPath [5,12], but unlike XPath, * does not function as a wildcard in normalized names.

These normalized names clearly distinguish local names from global names. Where previously it might be possible to confuse the global a element with the local a element, now these are given distinct names, a and u/d/*/a respectively. The latter indicates that the global type u contains a local element d which contain an anonymous type * which contains a local element a. Each attribute or element contains at most one anonymous type, so using the name * for such types leads to no confusion. In order to reflect the attribute form and element form attributes from XSDL, these names also reflect whether an element is to be qualified or not by a "-" or a "+" immediately following the # in the name. A "+" indicates the name is to be qualified. By default, only top-level elements (such as the "a") are qualified. However, for example, if the declaration of the "b" attribute of type "t" had been:

<xsd:attribute name="b" type="xsd:string" formDefault="qualified"/>

then its normalized name would have been:

http://www.example.com/baz.xsd#+type::t/attribute::b

NOTE: The normalized names described in this document exist only for the purposes described herein. There is a separate effort underway by the Schema WG to generalize these names so they will be suitable for a variety of other uses. When that effort is complete, this document will adopt that syntax.

2.2 Components

Next, we show how components are represented in this notation. Each component is one of six sorts: element, attribute, simple type, complex type, attribute group, or model group. Regardless of sort, each component is represented uniformly as a constructor (one of attribute, element, simpleType, complexType, attributeGroup, modelGroup, or identityConstraint) of a record with six fields:

name: is the name of the component;
base: is the name of the base component of the structure;
derivation: specifies how the component is derived from the base,
it is one of restriction or extension;
refinement: is the set of permitted derivations from this component as base,
it is a subset of {restriction, extension};
abstract: is a boolean, representing whether this type is abstract;
content: is the content of the component, describing the instance information which can match this component, following the context free grammar below.

Note that all seven sorts have a very consistent structure consisting of six properties, which have the same interpretation regardless of the sort.

2.2.1 Component content

The formalism uses standard regular expression notation [2] for component content. Component content corresponds roughly to a union of both model groups and attributes groups. In what follows, g stands for a content group (as does g₁, g₂, and so on). The constructors for groups include the following:

e: empty sequence
Ø: empty choice
g₁,g₂: sequence, g₁ followed by g₂
g₁ | g₂: choice, g₁ or g₂
g₁ & g₂: all, g₁ and g₂ in either order
g{m,n}: repetition, g repeated between minimum m and maximum n times (m is a natural number, n is a natural number or 8)
mixed(g): mixed content in group g
@w[g]: attribute, with name in nameset w and content in group g
w[g]: element, with name in nameset w and content in group g
p: atomic datatype (such as xsi:string or xsi:integer)
x: component name

EDITOR'S NOTE: The grammar above does not distinguish between the precedence of the various connectors ("|", "&", and ","). This will not be a concern as in our examples we will always parenthesize appropriately. Rather than significantly expanding the grammar, we will provide a full grammar with parentheses in an appendix in a future version.

A nameset corresponds to a set of names. A nameset may be a single element or attribute name, a substitution group, or the equivalent of an XSDL wildcard. The syntax for defining a nameset will be given later.

Need to explain what a nameset is.

Here is an example of a model group in our notation, which corresponds to an a element with content of type u in our running example

+a[
    (-@t/b[xsi:string] & -@t/c[xsi:integer{0,8}]{0,1}),
    (-u/d[u/d/*/a[xsi:string]{1,8}] | -u/d/*/e[xsi:string])
]

Note that the group constructors are used uniformly in several contexts. Repetition (g{m,n}) is used for lists of atomic datatypes, to indicate whether an attribute is optional, and for elements. Similarly, all (g₁ & g₂) is used for attributes and for elements.

2.2.2 Example Schema Components

Here are the components of the normalized schema represented in this notation

element(
  name = +a,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {restriction,extension},
  abstract = false,
  content = a[t]
)

simpleType(
  name = s,
  base = xsi:UrSimpleType,
  derivation = restriction,
  refinement = {restriction},
  abstract = false,
  content = xsi:integer{0,8}
)

complexType(
  name = t,
  base = xsi:UrType,
  derivation = restriction,
  refinement = {restriction,extension},
  abstract = false,
  content = @t/b & @t/c
)

attribute(
  name = -@t/b,
  base = xsi:UrAttribute,
  derivation = restriction,
  refinement = {restriction},
  abstract = false,
  content = @t/b[xsi:string]
)

attribute(
  name = -@t/c,
  base = xsi:UrAttribute,
  derivation = restriction,
  derivation = {restricition},
  abstract = false,
  content = @t/c[s]{0,1}
)

complexType(
  name = u,
  base = t,
  derivation = extension,
  refinement = {restriction,extension},
  abstract = false,
  content = (-@t/b & @t/c),
  (u/d | u/e)
)

element(
  name = u/d,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d[u/d/*]
)

complexType(
  name = u/d/*,
  base = xsi:UrType,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d/*{0,8}
)

element(
  name = -u/d/*/a,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d/*/a[xsi:string]
)

element(
  name = -u/e,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = -u/e[xsi:string]
)

Note that we do not nest declarations to express their scope. The scope of a declaration is reflected in its normalized name.

2.3 Documents and Forests

It is not clear to me from our discussions whether this notation will stay. There's some suggestion that we need a notation that borders on the verbosity of the underlying XML.

There is also a compact notation for representing XML, both before and after normalization. This notation is not intended in any way to be used as a replacement for traditional XML syntax. Having a more compact, but equally expressive, equivalent notation will aid significantly in the exposition below (in particular it keeps the inference rules to a reasonable size). Here is the original document in this notation:

a[
  @xsi:type["u"],
  @b["zero"],
  @c[1,2],
  d[
    a["three"],
    a["four"]
    ]
  ]

Note that attributes and elements are represented uniformly, as are sequences of attributes, sequences of elements, and lists of atomic datatypes. Note also that from this perspective a document is just a element without a parent. Therefore we will drop the term "document" and simply use the term element. The term "this document" will continue to refer to the document you are reading.

~~Here is the normalized element:~~

a[
  @xsi:type["u"],
  @t/b["zero"],
  @t/c[1,2],
  u/d[
    u/d/*/a["three"],
    u/d/*/a["four"]
    ]
  ]

Given an input document such as this, validation can both establish the validity of the document and assert which types, elements, and attributes were used. This can be conveyed textually by writing the document using normalized names.

Finally, here is the normalized element with type information added

a[
  u types @t/b[xsi:string types "zero"],
  @t/c[s types 1,2],
  u/d[
    u/d/* types u/d/*/a[xsi:string types "three"],
    u/d/*/a[xsi:string types "four"]
    ]
  ]

Unlike the xsi:type convention, this notation allows one to uniformly express information about element and attribute types. The type of an element or attribute is indicated by writing x[t types d] where x is an attribute or element name, t is a type name, and d is the content of the attribute or element.

We will use the term forest or ordered forest when referring to lists of attributes and elements, as in the content of a element. For example, while the above is an element its content:

  u types @t/b[xsi:string types "zero"],
  @t/c[s types 1,2],
  u/d[
    u/d/* types u/d/*/a[xsi:string types "three"],
    u/d/*/a[xsi:string types "four"]
    ]

is an ordered forest. Note that a single element is itself an ordered forest.

2.4 Forest Grammars

The content of the components described above takes advantage of derivation and namesets to provide a compact description of the actual allowed content of an element in an instance. This can be viewed as an intentional description of a regular expression. However, we can also provide the full extensional description of the regular expression, given a finite set of components. Doing so allows us to describe validation in terms of a more primitive model and would allow XML Schema to evolve without needing to change the core of the formalism -- as XML Schema evolves, the mapping from the intensional descriptions contained by the components to the extensional descriptions may change, but a formalism built around the extensional versions remains stable.

As an example of this conversion, we will unroll the content of the "a" element from our sample schema. Note that "a" has declared type "t", and "t" has one subtype, "u". Therefore the contents of an "a" are either the contents of "t", with an optional xsi:type attribute, or the contents of "u" with an obligatory xsi:type attribute with value "u":

a[(xsi:type="t"? & @t/b & @t/c) | ((xsi:type="u" & @t/b & @t/c), (u/d | u/e))]

A validator can validate against this extended regular expression and map back to the components by way of the value of xsi:type. At this lowest level, xsi:type functions as an implied attribute, but takes on added signficance once we map from this level back into the level of schema. It should also be noted that this expression does not follow the single attribution and unambiguity constraints of DTDs or Schema. Those constraints are applied at the component level.

Another aspect of this translation is to treat element names in regular expressions as namesets containing the names of the elements in the substitution group of the base element. For example, suppose there were an element "foo" whose content was just an "a" element, and that there also existed an element "f" in the substitution group of "a". The the content of the "foo" component would be:

foo[a]

and the full regular expression would expand to:

foo[(a | f)].

2.n Description of general forest grammars and general description of how the component grammars can be converted into full forest grammars. Convert some extended type to a full grammar.

3 Structures

This section defines the structures used in this formalism: names, namesets, atomic datatypes, model groups, components, and forests.

3.1 Names

It is not clear how much of this needs to remain here now that NUNs are moving out. Perhaps we stick with this, as why make changes that will just be rendered irrelevant?

The other significant issue here, is that we don't take into account the 208 prefixing issues. However, to be complete, we need to do so. The least disturbing way I can find to do so is to include them in the names - so that a name also specifies if it matches a qualified or unqualified element/attribute (keeps it out of the components).

A namespace is a URI reference, and a local name is an NCName, as in the Namespace recommendation. We let i range over namespaces and the "non-namespace" (where items not in a namespace reside) and j range over local names. In order to cover the case of naming components that are not in a namespace, i may be empty. Using these names we can create unique identifiers for the components of a schema. These names are for the abstract components of the schema and ultimately are not dependent on the XML Schema transfer syntax or any configuration of files the schema may be stored in.

namespace

::=

URI reference | e

local name

NCName

A symbol space is one of the six symbol spaces in XML Schema. We let ss range over symbol spaces.

symbol space

::=

element

attribute

type

attributeGroup

modelGroup

notation

A symbol name consists of a symbol space paired with a local name or with * (the latter names an anonymous component). A name consists of a URI reference followed by a sequence of one or more symbol names. We let sn range over symbol names, and x range over names.

symbol name

::=

ss::j

symbol space ss, local name j

ss::*

symbol space ss, anonymous name

name

::=

i#sn₁/…/sn_n

namespace i, symbol names sn₁,…,sn_n

Here n = 1 is the scope length of the name.

An example of a name is:

http://www.example.com/baz.xsd#type::u/element::d/type::*/element::a

The scope length of this name is 4.

In this document we've found it convenient to use a short form of names. Symbol names in the attribute symbol space are written @j, all other symbol names are written j (or * for the oxymoronic anonymous names) - in our examples we've ensured our local names are unique across all symbol spaces. The URI reference may be dropped when it is obvious from context. For example, the short form of the name above is u/d/*/a. Additional examples of names and short forms appear in Section 2.1.

Before normalization, all names in a forest have scope length equal to one. It is helpful to define functions to extract the namespace from a name, and to extract the symbol space and local name of the last symbol name. We define namespace(x) = i, symbol(x) = ss, and local(x) = j when x = i#sn₁/…/sn_n and sn_n = ss::j.

We also introduce several subclasses of names. An attribute name is the name of an attribute component, and similarly for element, simple type, and complex type names. We let a, e, s, k range over attribute, element, simple type, and complex type names.

attribute name

::=

element name

::=

simple type name

::=

complex type name

::=

A type name is a simple or complex type name. We let t range over type names.

type name

::=

simple type name

complex type name

The class of a name must be consistent with its symbol space.

symbol(a)

attribute

symbol(e)

element

symbol(t)

type

3.2 Namesets

Generalization of wildcards and substitution groups - note that x1 UNION x2 is the same thing as putting x2 in the substitution group of x1. All leaves in regex's are now namesets.

A nameset denotes a set of qualified names. The form *:* denotes any name in any namespace (note this is a namespace name, not a prefix), i:* denotes any name in namespace i, and the form x denotes the single name x. Namesets may be combined by union or difference. Namesets constructed in this way have the functionality of wildcards. We let w range over wildcard items, and w range over wildcards.

nameset

::=

*:*

any namespace, any local name

i:*

namespace i, any local name

x

full name

w₁UNION w₂

all names in w₁ or in w₂

w₁DIFFERENCE w₂

all names in w₁ and not in w₂

w%strict

a nameset with strict processing

w%lax

a nameset with lax processing

w%skip

a nameset with skip processing

For example, the nameset *:* DIFFERENCE(baz:* UNION xsi:string)%strict denotes any name in any namespace, except for names in namespace baz, and the local name string in namespace xsi. Since it will be convenient to write wildcards with minimal parentheses, we will assume the precedence UNION > DIFFERENCE > %.

3.3 Atomic datatypes

This part will require significant work. I'd like to treat a simple type as a union over facets - treating the value space as a facet, rather than something intrisic to some simple type. It makes unions just disjunctive normal forms of facets.

We take as given the atomic datatypes from XML Schema Part 2. We let p range over atomic datatypes, and c range over constants of such datatypes.

Typically, an atomic datatype is either a primitive datatype, or is derived from another atomic datatype by specifying a set of facets. Note lists of atomic datatypes are specified using repetition while unions are specified using alternation, as defined below.

An example of an atomic datatype is xsi:string, and a constant of that datatype is "zero".

3.4 Content groups

Let g range over content groups.

group

::=

empty sequence

empty choice

g₁ , g₂

sequence, g₁ followed by g₂

g₁ | g₂

choice, g₁ or g₂

g₁ & g₂

all, g₁ and g₂ in either order

g{m,n}

repetition of g between m and n times

@w[g]

attribute with name in w and content in g

w[g]

element with name in w and content in g

mixed(g)

mixed content in group g

atomic datatype

component name

minimum

natural number

maximum

::=

natural number

unbounded

An example of a group appears in Section 2.2.1.

The empty sequence matches only the empty forest; it is an identity for sequence and all. The empty choice matches no forest; it is an identity for choice.

e,g

g ,e

Ø|g

g |e

e&g

g &e

For use with repetitions, we extend arithmetic to include 8 in the obvious way. For any n we have n + 8 = 8+ n = 8 and n = 8 is always true, and 8 < n is always false.

3.5 Components

A sort is one of the six sorts of component in XML Schema. We let srt range over sorts.

sort

srt

::=

attribute

element

simpleType

complexType

attributeGroup

modelGroup

identityConstraint

(Shortcoming: we make no further use of attributeGroup or modelGroup in this document.)

A derivation specifies how a component is derived from its base. We let der range over derivations, and ders range over sets of derivations.

derivation

der

::=

extension

refinement

derivation set

ders

::=

{der₁,…,der_l}

We let b range over booleans.

boolean

::=

true

false

A component is a record consisting of a constructor (srt) with six fields.

name: is the name of the component (x);
base: is the name of the base component of the structure (x);
derivation: specifies how the component is derived from the base (der);
refinement: is the set of permitted derivations from this component as base (ders);
abstract: is a boolean, representing whether this type is abstract (b);
content: is the content of the structure, a model group (g).

We let cmp range over components.

component

cmp

::=

srt(

name = x

base = x

derivation = der

refinement = ders

abstract = b

content = g

)

Examples of components appear in Section 2.2.2.

This general model of components is far looser than the is strictly allowed by XSDL. While this is not a problem because the creation of components is strongly controlled by the languages they are generated from (currently either an XSDL Schema or a set of XSDL components as described in Structures[11]), a "tighter" grammar, more closely corresponding to what is accepted by XSDL, is found in appendix A.

For a given schema, we assume a fixed dereferencing map that takes names to the corresponding components. We write deref(x) = cmp if name x corresponds to component cmp.

The dereferencing map must map attribute names to attribute components, and similarly for elements, simple types, and complex types.

deref(a).sort

attribute

deref(e).sort

element

deref(s).sort

simpleType

deref(k).sort

complexType

3.6 Constraints

Recall that the group constructs are used uniformly in several contexts. Repetition (g{m,n}) is used for lists of atomic datatypes, to indicate whether an attribute is optional, and for elements. Similarly, all (g₁ & g₂) is used for attributes and for elements. The advantage of this approach is that the semantics of groups is uniform, and need be given only once. Thus, for instance, how repetition acts is defined once, not separately for attributes and elements.

The group grammar is rather more accommodating than that implicit in XML schemas. It permits attributes to have complex local type definitions, multiple occurrences of the declaration of a given attribute name, and the mentioning of an attribute anywhere in the defintion of a group. As we do not generally translate our group grammars back into XML schemas, this permissiveness is of little consequence. We can, nonetheless, define various sub-grammars of the group grammar to constrain matters in so as to be consistent with the XML schemas specification.

However, it is helpful to define additional syntactic categories that specify various subsets of groups. These syntactic classes are then used to constrain the content of components, for instance, to indicate that the content of an element component should be an element, and that the content of a type component should consist of attributes followed by elements.

An attribute group contains only attribute names, combined using the all connector (&). An element group contains no attribute names. (Shortcoming: all groups in element groups should be further constrained as in Schema Part 1, Section 5.7, All Group Limited.)

attribute group

::=

empty sequence

ag₁ & ag₂

all

attribute nameset

element group

::=

empty sequence

empty choice

eg₁ ,eg₂

sequence

eg₁ |eg₂

choice

eg₁ &eg₂

all

eg{m,n}

repetition

nameset

Given the above, we can specify the allowed contents of the four sort of component as follows. An attribute content is either an attribute or an optional attribute, where the content is a simple type name. An element content is an element where the content is a type name. A simple type content is an atomic datatype or a list of atomic datatype. A complex type content is an attribute group followed by either a simple type content or an element group.

attribute content

::=

a[s]

attribute

a[s]{ 0,1}

optional attribute

element content

::=

e[t]

element

union content

::=

simple type

uc | uc

union of simple types

simple content

::=

union content

uc{m,n}

list of a union content

complex content

::=

ag ,sc

attributes follwed by simple content

ag ,eg

attributes followed by elements

ag,mixed( eg)

mixed content

The content of each sort of component corresponds to the syntax above. That is, the content of an attribute component is always an attribute content ac; the content of an element component is always an element content ec; the content of a simple type component is always a simple content sc; and the content of a complex type component is always a complex content kc. Further, for an attribute or element component the name of the component is the same as the name of the attribute or element in its content.

It is easy to confirm that the example components in Section 2.2.2 satisfy these constraints.

3.7 Ordered Forests

A forest is a sequence of items, where each item is either an atomic datatype, or an attribute (with a list of items of a single atomic datatype as content), or an element (with a forest as content). We let d range over forests.

ordered forest

::=

empty forest

d₁,d₂

sequence

constant of atomic datatype

a[d]

attribute, with name a and content d

e[d]

element, with name e and content d

Examples of forests appear in Section 2.3.

A typed forest is an ordered forest with added type information for each element and attribute. We let td range over typed ordered forests.

typed ordered forest

::=

empty forest

td₁,td₂

sequence

constant of atomic datatype

a[s types td]

attribute, with simple type name s

e[t types td]

element, with type name t

Add a section describing any additional rules for the full forest grammars.

3.8 Inference rules

Operations such as normalization and matching are described with an inference rule notation. Originally developed by logicians [15,16,7], this notation is now widely used for describing type systems and semantics of programming languages [10]. In this notation, when all judgements above the line hold, then the judgement below the line holds as well. Here is an example of a rule used later on. We write d in g if forest d matches group g.

Sequence:

d₁ in g₁ d₂ in g₂

d₁ ,d₂ in g₁, g₂

The rule says that if both d₁ in g₁ and d₂ in g₂ hold, then d₁ ,d₂ in g₁ , g₂ holds as well. For instance, take d₁ = a[1], d₂ = b["two"], g₁ = a[xsi:integer], and g₂ = b[xsi:string]. Then since both

a[1] in a[xsi:integer ] and b["two"] in b[xsi:string ]

hold, we may conclude that

a[1] ,b["two"] in a[xsi:integer ] ,b[xsi: string]

holds.

4 Normalization

This section needs to be drastically revised, as normalization will need to occur as part of validation. Therefore the inference rules are certainly misplaced.

Normalization of a forest replaces each name by the corresponding normalized name, and adds type information to the forest. Normalization is performed with respect to a given schema; in our formalism the schema is determined by the dereferencing map, deref(). Section 2.3 gives an example of a document before and after normalization.

Prior to normalization, all names in the forest have exactly one symbol name. We build normalized names by extending a name with an additional symbol name. Let x₁ and x₂ be two names, with namespace(x₁) = namespace(x₂), and where the second forest has only one symbol name. We write x₁ |> x₂ for the operation that extends x₁ by adding the symbol name of x₂, defined as follows.

i#sn₁/…/sn_l |> i#sn

i#sn₁/…/sn_l/sn

On the other hand, if x₁ and x₂ are from different namespaces (namespace(x₁ ) = i, namespace(x₂) = j and i ? j), then x₁ |> x₂ is defined as:

i#sn₁/…/sn_l |> j#sn

j#sn

Finally, for each sort there is a root type (AnyType, AnyElement, etc.) These are fixed points, i.e.,

x |> AnyElement

AnyElement.

(Since these constants will eventually be defined in a fixed namespace, this is really subsumed by the previous rule.)

We write x |- a ?a' and x |- e ?e' to indicate that in the context specified by name x that attribute name a normalizes to a' or that element name e normalizes to e'. To normalize an attribute or element we extend the context (if the extended name is in the domain of deref(); these are the `extend' rules), or use the element name directly (otherwise; these are the `reset' rules). We write x in dom(deref()) and x notin dom(deref()) to indicate whether or not x is in the domain of the dereferencing map; that is, whether or not deref(x) is defined. If x₁ and x₂ are in different namespaces, then x₁ |> x₂ is undefined, and we say x notin deref() holds.

Extend Attribute Transitive:

y |> a in dom(deref())
x <: y

x |- a ? y |> a

If the attribute name does not appear for the current type, check for that name among supertypes.

Extend Element Transitive:

y |> e in dom(deref())
x <: y

x |- e ? y |> e

If the element name does not appear for the current type, check for that name among supertypes.

Extend Attribute Base:

a in dom(deref())

x |- a ? a

An attribute name can be normalized if it corresponds to a known name.

Extend Element Base:

e in dom(deref())

x |- e ? e

An attribute name can be normalized if it corresponds to a known name.

EDITORS NOTE: These inference rules will need to be revisited with regards to the typing of wildcards, and inclusion of model and attribute groups.

We write x |- d ?dt to indicate that in the context specified by name x that forest d normalizes to typed forest td. We write @xsi:type notin d if d does not contain an xsi:type attribute. Note that the type names in xsi:type attributes always refer to global types and need not be normalized.

Constant:

x |- c ? c'

Empty Document:

x |- e ? e

Sequence:

x |- d₁ ? td₁ x |- d₂ ? td₂

x |- d₁,d₁ ? td₁ , td₂

Attribute:

x |- a ? a'
deref(a').content = a'[s] or deref(a').content = a'[s]{0,1} s |- d ? td

x |- a[d] ? a'[s types td]

Untyped Element:

x |- e ? e'
@xsi:type notin d
deref(e').content = e'[t]
t |- d ? td

x |- e[d] ? e'[t types td]

Typed Element:

x |- e ? e'
deref(e').content = e'[t]t <: t't |- d ? td

x |- e[@xsi:type[t], d] ? e'[t types td]

The (Typed Element) rule uses the relation x < :x', which is defined Section 5.1.

5 Derivation

5.1 Base chain

We write x <: :x' if starting from the component with name x one can reach the component with name x' by successively following base links.

Reflexive:

x <: x

Transitive:

x <: x' x' <: x''

x <: x''

Base:

deref(x).base = x'

x <: x'

5.2 Extension

We write g <:_extg' if group g is derived from group g' by adding attributes and elements. It is specified using attribute groups ag and element groups ag as defined in Section 3.6.

Extend:

(ag & ag') , eg , eg' < :_ext ag , eg

5.3 Restriction

We write g < :_res g' if the instances of group g are a subset of the instances of group g'. That is, g <:_res g' if for every forest d such that d in g it is also the case that d in g'.

5.4 Constraints

A schema must satisfy certain constraints on derivation to be well-formed. Define x <:_der x' to be x <:_res x' if der = restriction and x <:_ext x' if der = extension. We write |- x to indicate that the component with name x is well-formed with respect to derivation. (N.B. This use of turnstile differs from that in normalization. One of us [ALB] suggests that the normalization turnstile should be replaced with |~.)

Refinement:

x' = deref(x).base
der = deref(x').derivation
deref(x).content < :_der deref(x').content

|- x

6 Validation

6.1 Mapping component models to regular tree grammars

We have already introduced a linear representation of an ordered (typed) forest. Content models are, in effect, grammars for such forests. Indeed, content models can be converted to regular tree grammars [ref]. We call the result of this conversion the tree closure of the content model. This conversion process, roughly speaking,

replaces component names with the regular closures of their content models
replaces an individual element (component) name with a choice of all the element names from the same substitution group as the original element
replaces an individual type (component) name with the a choice of all the type names consisting of the original type names and all the subtypes thereof
replaces wild cards with a choice of element names (we assume name sets to be finite)
replaces a mixed content model with an interleaving og (atomic) string datatypes with the unmixed constituents of the original contnet model

From the tree closure we can construct by well known methods [ref] a regular tree automaton. Validation is then a question of whether a forest of the kind defined earlier is accepted by the automaton constructed from the tree closure. We contrast this "compiled" version of validation with its "interpreted" alternative. In the latter case we would deconstruct forests and content groups in parallel, demonstrating that each constituent of the forest had a corresponding constituent in the content group. As for compiled validation, the terminal symbols of the tree grammar are, in fact, arbitrary strings, constrained only by the atomic datatypes of which they are declared to be instances: strings, dates, integers, etc. Of course, all of these lexical forms have corresponding regular expression definitions are just special cases of regular tree grammars. For the primitive atomic datatype p we will assume that π(p) yields the regular tree grammar for that type and that a is the generic

Let tk range over tree closures.

tree closure

::=

empty sequence

empty choice

tk₁ , tk₂

sequence, tk₁ followed by tk₂

tk₁ | tk₂

choice, tk₁ or tk₂

tk₁ & tk₂

all, tk₁ and tk₂ in either order

tk{m,n}

repetition of tk between m and n times

@a[tk]

attribute with name a and content in tk

e[tk]

element with name e and content in tk

π(p)

regular tree grammar for a primitive atomic datatype

minimum

natural number

maximum

::=

natural number

unbounded

We see that (syntactically) a tree closure is a kind of content group. It is natural then to think of the relation ship between a content group and its tree closure as simply another kind of reduction relationship on content groups, hence we write it as < :_tk. < :_tk is preserved by rewriting content models using any of the matching rules of section 3.4

empty sequence:

e < :_tk e

empty choice:

Ø < :_tk Ø

nonempty Sequence:

tk₁ < :_tkg₁ tk₂ < :_tk g₂

tk₁ , tk₂ < :_tk g₁ , g₂

nonempty choice:

tk₁ < :_tkg₁ tk₂ < :_tk g₂

tk₁ | tk₂ < :_tk g₁ | g₂

Ordered alternative:

tk₁ < :_tkg₁ tk₂ < :_tk g₂

tk₁ & tk₂ < :_tk g₁ & g₂

Group repetition:

tk < :_tk g

tk{m,n} < :_tk g{m,n}

attribute nameset:

tk < :_tk ag w = w' UNION {a} tk' < :_tk w'[tk]

@a[tk] | tk' < :_tk @w[ag]

element nameset:

tk < :_tk eg w = w' UNION {e} tk' < :_tk w'[tk]

e[tk] | tk' < :_tk w[eg]

mixed content:

tk₁ < :_tkmixed(eg₁) tk₂ < :_tk mixed(eg₂)

xsd:string{0,1} , tk₁ , xsd:string{0,1} , tk₂ , xsd:string{0,1} < :_tk mixed(eg₁,eg₂)

A fine point : What is the status of a string followed by a string in a mixed content model? It is presumed here that it's still just a string not a sequence of strings.

Model group component:

deref(x)..content = eg tk < :_tk eg x.srt = modelGroup

tk < :_tk x

Attribute group component:

deref(x)..content = ag tk < :_tk ag x.srt = attributeGroup

tk < :_tk x

Element component:

deref(x)..content = c tk < :_tk w[tk'] tk' < :_tk c

w = {y s.t. y <: x deref(y).abstract = false y.srt = element}

tk < :_tk x

attribute component:

deref(x)..content = c tk < :_tk c x.srt = group

@x[tk] < :_tk x

unsubtyped concrete complex Type component:

deref(x).content = false deref(x)..content = c tk < :_tk c x.srt = complexType

There is no y s.t. y <: x

tk < :_tk x

unsubtyped abstract complex Type component:

deref(x).content != abstract deref(x)..content = c tk < :_tk c x.srt = complexType

There is no y s.t. y <: x

Ø < :_tk x

subtyped concrete complex Type component:

y₁ ... y_m are the immediate subtypes of x by extension

z₁ ... z_n are the immediate subtypes of x by restriction

tk < :_tk deref(x)..content x.srt = complexType deref(x).abstract = false

sk₁ < :_tk deref(y₁)..content _... sk_m < :_tk deref(y_m)..content

rk₁ < :_tk deref(z₁)..content _... rk_n < :_tk deref(z_n)..content

tk | (tk , (( sk₁ | ... | sk_m ) | ( rk₁ | ... | rk_n ))) < :_tk x

subtyped abstract complex Type component:

y₁ ... y_m are the immediate subtypes of x by extension

z₁ ... z_n are the immediate subtypes of x by restriction

tk < :_tk deref(x)..content x.srt = complexType deref(x).abstract = true

sk₁ < :_tk deref(y₁)..content _... sk_m < :_tk deref(y_m)..content

rk₁ < :_tk deref(z₁)..content _... rk_n < :_tk deref(z_n)..content

tk , (( sk₁ | ... | sk_m ) | ( rk₁ | ... | rk_n )) < :_tk x

unsubtyped concrete simple Type component:

deref(x).content = false deref(x)..content = c tk < :_tk c x.srt = simpleType

There is no y s.t. y <: x

tk < :_tk x

unsubtyped abstract simple Type component:

deref(x).content != abstract deref(x)..content = c tk < :_tk c x.srt = simpleType

There is no y s.t. y <: x

Ø < :_tk x

subtyped concrete simple Type component:

z₁ ... z_n are the immediate subtypes of x by restriction

tk < :_tk deref(x)..content x.srt = complexType deref(x).abstract = false

rk₁ < :_tk deref(z₁)..content _... rk_n < :_tk deref(z_n)..content

tk | rk₁ | ... | rk_n < :_tk x

subtyped abstract simple Type component:

z₁ ... z_n are the immediate subtypes of x by restriction

tk < :_tk deref(x)..content x.srt = complexType deref(x).abstract = true

rk₁ < :_tk deref(z₁)..content _... rk_n < :_tk deref(z_n)..content

rk₁ | ... | rk_n < :_tk x

Primitive Atomic datatype:

π(p) < :_tk p

6.2 Content validation by inference rules

This section is largely made obsolete by the previous section. Some of the rules above depend on element and type reducibility defined in this section. In some future iteration of this document we elide the irrelevant rules.

We write c in_p p if constant c matches atomic datatype p. We do not specify this relation further, but simply assume it is as defined in XML Schema Part 2.

We write x in_w w and e in_w w if name x belongs to nameset w. We write e notin_w w if it is not the case that x in_w w.

Any Namespace:

x in_w*:*

Given Namespace:

namespace(x) = i

x in_w i:*

Singleton Namespace:

deref(x) <: deref(i#j)

x in_w i#j

Wildcard Union 1:

x in_W w₁

x in_W w₁ UNION w₂

Wildcard Union 2:

x in_W w₂

x in_W w₁ UNION w₂

Wildcard Difference:

x in_W w₁ e notin_W w₂

e in_W w₁ DIFFERENCE w₂

We write d |?_unmix d' (read ``d unmixes to d''') if d is a sequence of elements and characters and d' is the same sequence with all character data removed. This is used for processing mixed content.

Unmix Empty:

e |?_unmix e

Unmix Sequence:

d₁ |?_unmixd'₁ d₂ |?_unmix d'₂

d₁ , d₂ |?_unmix d'₁ , d'₂

Unmix Element:

e[d] |?_unmix e[d']

Unmix Character Data:

c |?_unmix e

We write d |?_inter d';d'' (read "d interleaves d' and d'' ") if some interleaving of d' and d'' yields d. This relates one sequence to many pairs of sequences. This is used for processing g₁ &g₂.

Interleave Empty:

e |?_inter e;e

Interleave Sequence:

d₁ |?_interd'₁;d"₁ d₂ |?_inter d'₂;d"₂

d₁ , d₂ |?_inter d'₁ , d'₂ ;d"₁,d"₂

Interleave Item 1:

d |?_inter d;e

Interleave Item 2:

d |?_inter e;d

We write d in g if forest d matches group g.

Empty:

e in e

Sequence:

d₁ in g₁ d₂ in g₂

d₁ , d₂ in g₁ , g₂

Choice 1:

d in g₁

d in g₁ | g₂

Choice 2:

d in g₂

d in g₁ | g₂

Interleave:

d |?_inter d'₁;d"₂ d₁ in g₁ d₂ in g₂

d in g₁ & g₂

Repetition 1:

d₁ in g d₂ in g{m,n}

d₁ ,d₂ in g{m +1,n+1}

Repetition 2:

d₁ in g d₂ in g{0,n}

d₁ ,d₂ in g{0,n+1}

Repetition 3:

e in g{0,n}

Attribute:

d in g a in_ww deref(a).content=g

a[d] in @w

Element:

d in g e in_ww deref(e).content=g

e[d] in w

Attribute Skip:

a in_ww

a[d] in @w%skip

Element Skip:

e in_ww

e[d] in w%skip

Mixed Group:

d |?_unmix d' d' in g

d in mixed(g}

Typed Attribute:

d in s

e[s types d] in e[s]

Typed Element:

d in t t <: t'

e[t types d] in e[t']

Atomic Datatype:

c in_p p

c in p

Element Refinement:

d in e e <: e'

d in e'

Component Name:

d in g g = deref(x).content

d in x

The (Typed Element) and (Element Derivation) rules use the relation x < :x', which is defined in Section 5.1. The check that t < :t' in the (Typed Element) rule is redundant, as it is also performed during normalization.

When processing a normalized document with types, the ( Attribute) and (Element) rules are not required, the (Typed Attribute) and ( Typed Element) rules are used instead. However, we will write td in g whenever |- d ?td and d in g.

6.3 ID/IDREF validation

Append section on identity constraints.

We write nc in IDREFs(td) if nmtoken constant nc appears as the content of an attribute or element of type IDREF with typed forest td. We write nc in IDs(td) if nmtoken constant nc appears as the content of an attribute or element of type ID somewhere with typed forest td. We write nc in ! IDs(td) if nmtoken constant nc appears exactly once as the content of an attribute or element of type ID somewhere with typed forest td. We write nc in IDs!(td) if nc in! IDs(td). We write td OK if typed forest td is valid with respect to ID and IDREF, that is, if no ID is defined twice, and every IDREF has a corresponding ID.

IDREFs(d) ? IDREFs(td) IDs(td) ? IDs!(tc)

td OK

A value nc counts as a reference whenever there is an element or attribute of type IDREF in an instance having nc as its value:

A-Refs:

nc in IDREFs(a[IDREF types nc])

E-Refs:

nc in IDREFs(e[IDREF types nc])

A value nc counts as a referenece whenever there is an element among whose children nc counts as a referenece:

Child-Refs:

nc in IDREFs(td')

nc in IDREFs(e[t' types td'])

A value nc counts as a reference among a sequence of instances whenever it counts as a reference among a subsequence of those instances:

Subref 1:

nc in IDREFs(td₁)

nc in IDREFs(e[td₁ , td₂])

Subref 2:

nc in IDREFs(td₂)

nc in IDREFs(e[td₁ , td₂])

The notation in! indicates that a structure of a given shape occurs uniquely within another. An instance td satisfies the id-uniqueness criterion whenever any NCNAME that is among the IDs of td is also among its uniquely occurring IDs:

Not-ID:

nc notin IDs(td)

nc in IDs(td) ? nc in! IDs(td)

ID:

nc in! IDs(td)

nc in IDs(td) ? nc in! IDs(td)

A value nc counts as an ``id'' whenever there is an element or attribute of type ID in an instance having nc as its value::

A-ID:

nc in! IDs(a[ID types nc])

E-ID:

nc in! IDs(e[ID types nc])

A value nc counts as an ``id'' among a sequence of instances whenever it counts as an ``id'' among a subsequence of those instances, and does not count as an ``id'' among the remaining subsequence of those instances:

ID 1:

nc in IDs(td₁)

nc in IDs(td₁ , td₂)

ID 2:

nc in IDs(td₂)

nc in IDs(td₁ , td₂)

A value nc counts as a unique ``id'' among a sequence of instances whenever it counts as an ``id'' among a subsequence of those instances, and does not count as an ``id''among the remaining subsequence of those instances:

UN-ID 1:

nc in! IDs(td₁) nc notin IDs(td₂)

nc in! IDs(td₁ , td₂)

UN-ID 2:

nc in! IDs(td₂) nc notin IDs(td₁)

nc in IDs(td₁ , td₂)

Lastly, we have the following obvious relationship between occurrence in IDs and unique occurrence in IDs:

InnIDs-InIDs:

nc in! IDs(td)

nc in IDs(td)

7 XML Schema to Formalization Mapping

EDITOR'S NOTE: Future versions of this document will contain a mapping from the XML Schema components described in [11] to sorts. Both XML Schema and models make a distinction between the underlying model (components in the case of XML Schema, sorts in the current case) and the surface syntax (the set of valid <schema> elements accepted for either). In both cases, the underlying model is considered more significant.

This section will describe the mapping from XML Schema instance syntax to sorts. This requires performing two mutually recursive activities:

converting from XML Schema surface syntax into a set of sorts
for each sort, building its contents (note that component contents do not nest, the content of one component contains only references to other components).

To do so, we also introduce two new relations:

d -? g (eg writes as g) takes the xsdl syntax represented by eg and returns a content description fragment, valid according to the grammar production rules enumerated previously. Generally, f is the name (or names, if eg is a sequence) of the sort corresponding to eg.
e[d] =? cmp (e corresponds to cmp) takes an xsdl structure definition, such as a complexType definition, and maps it into an sort, such as a complex.

In any language of the complexity of XML Schema, there are both syntactic constraints expressible directly in the syntax of the language, and semantic constraints which further limit allowed expressions. In the following we enumerate those fragments of XML Schema which correspond to semantically valid expressions. While this could be done by enumerating all such sequences, that would include enumerating all the different cases of optionality and order - a daunting task. Instead, we reuse the grammars for content models developed above to specify fragments of XML and then pattern match against portions of this. The goal is to cover all fragments of XML Schema which are semantically coherent - other fragments may be syntactically valid, but not correspond to useable constructs. In each of the inference rules, the left hand side will be a string in this grammar. Strings in bold match strings in the XML instance. Strings in italics pattern match as variables to be used in the inference rule. As these represent fragments of schemas valid against the Schema for Schemas, we assume they have been validated and all default/fixed values for attributes are present. The only particular addition to this is to consider ``*'' as the default value for the name attribute of a nested complexType.

The following fragment gives an example:

anElem[@attr1[val1], @attr2[val2], nest[eg], ev]

Here we have:

An element with GI anElem with a required attribute named attr1 whose value is assigned to the variable val1, and a second attribute named attr2 whose value is assigned to the variable val2.
Within anElem another element named nest whose contents is assigned to the variable eg.
And finally, another nested element whose value is assigned to the variable ev.

Note that we provide a sequence for the attributes. This is purely for the convenience of the reader - in an actual instance these would appear in any order.

We also make the following additions to the grammar specified up to this point:

attribute model

::=

attribute[ag , eg]*

attribute defintion

content model root

::=

sequence[ag , eg]

sequence group

all[ag , eg]

all group

choice[ag , eg]

choice group

simple content

::=

simple[ag , eg]

simple type defintion

This formalism extends derivation to cover all sorts. Therefore, every kind of sort has a base sort from which all user-defined sorts must, ultimately, be derived. Of these, only two, for complex and simple types, are currently available from XML Schema. For the time being, the rest (such as for elements and attributes) are restricted to sorts and are not available to users.

7.1 Conversions for content model particles

Write All:

x |- ev₁ -? e_1

...x |- ev_n -? e_n

x |- all[ev₁...ev_n] -? (e₁&...&e_n)

If there is an idall element with content ev₁,...,ev_n and each ev_i converts to e_i in the context of x, then the element converts to (e₁,...,e_n).

Write Sequence:

x |- ev₁ -? e_1

...x |- ev_n -? e_n

x |- sequence[@minOccurs[m] , @maxOccurs[k], ev₁...ev_n] -? (e₁&...&e_n){m.k}

Write Choice:

x |- ev₁ -? e_1

...x |- ev_n -? e_n

x |- choice[@minOccurs[m] , @maxOccurs[k], ev₁...ev_n] -? (e₁|...|e_n){m.k}

For a sequence or choice with minOccurs or maxOccurs, where one or the other has a value other than 1, that sequence or choice converts to a representation with min and max values indicated at the end.

Write Model Ref:

g = deref(x |> n).content

x |- group[@ref[n]] -? g

A reference to a group writes as the contents of the corresponding group sort.

Write AttGroup Ref:

g = deref(x |> n).content

x |- attributeGroup[@ref[n]] -? g

A reference to an attributeGroup writes as the contents of the corresponding group sort.

Write Complex:

x |- complexType[t] =? c

x |- complexType[t] -? c.name

If a complexType corresponds to some sort, c, then it writes as the name of that sort.

Write Attribute:

x |- attribute[@name[s] , ag , eg] =? c

x |- attribute[@name[s] , ag , eg] -? c.name

If an attribute corresponds to some sort, c, then it writes as the name of that sort.

Write Attribute Ref:

x |- attribute[@ref[s] , ag , ev₁,...,ev_n] ? deref(x |> s).name

If an attribute (in a content model) refers to some other attribute, c, then it writes as the name of the sort corresponding to that attribute.

Write Element Ref 1:

x |- element[@name[s] , ag , eg] =? c

x |- element[@name[s] , ag , eg] -? c.name

If an element corresponds to some sort, then it writes as the name of that sort.

Write Occurs:

x |- element[ag , eg] -? y

x |- element[@minOccurs[m] , @maxOccurs[k] , ag , eg] -? y{m, k}

If an element has minOccurs or maxOccurs attributes, one of which has a value other than 1, and the element without the attributes writes as y, then the element with the attributes writes as y{m, n}, where m and n are the values of the attributes.

Write Element Ref 2:

x |- element[@ref[s] , ag , eg] -? deref( x |> s).name

If an element (in a content model) refers to some other element, c, then it writes as the name of the sort corresponding to that attribute.

7.2 Correspondences from XML Schema to sorts

Valid Schema Doc:

x|- e₁=?srt₁
...
x|- e_k =? srt_k

|- schema[@targetNamespace[x], e₁, ..., e_k]

A schema element is accepted if all of its children convert to components in the context of its targetNamespace.

Map Complex A:

y = x |> n
y |- cm -? elems y |- am -? atts

x |- complexType[@abstract[a] , @final[f] , @block[b] ,
@mixed[m] , @name[n] , cm , am]
=? complex(name = y, base = xsi:AnyType,
derivation = restriction,
refinement = if (f == ##all) then {restriction, extension} else f,
abstract = a
content = (atts, if (m == true) then mixed(elems) else elems))

Given a complex type which is not a derivation, we determine its name and convert its attribute and content definitions into component content (within the context of the name). We then use these to create a complex sort for the complex type.

Map Restriction 1:

y = x |> n
z = deref(x |> p)
x.content = (am₂ ,cm₂)
y |- cm -? cont y |- am -? atts

x |- complexType[@abstract[a] , @final[f] , @block[b] ,
@mixed[m] , @name[n] ,
complexContent[restriction[@base[p],cm , am]]
=? complex(name = y, base = x.name,
derivation = restriction,
refinement = if (f == ##all) then {restriction, extension} else f,
abstract = a
content = (((am₂ - attrs), attrs), if (m == true) then mixed(cont) else cont))

If a complex type is a restriction of another type, then we proceed as above, except we retrieve the contents of the super type and the set of attributes of the derived type is a combination of its own and those of the parent. In this case, am₂ - attrs represents the attributes defined in am₂ less any attributes in attrs with the same name. (Note, this definition may require more work.)

Map Complex Extension:

y = x |> n
z = deref( x |> p)
x.content = (am₂ ,cm₂)
y |- cm -? cont y |- am -? atts

x |- complexType[@abstract[a] , @final[f] , @block[b] ,
@mixed[m] , @name[n] ,
complexContent[extension[@base[p],cm , am]]
=? complex(name = y, base = z.name,
derivation = extension,
refinement = if (f == ##all) then {restriction, extension} else f,
abstract = a
content = (((am₂ - attrs), attrs), if (m == true) then mixed(cm₂, cont) else cm₂ , cont))

If the complex type is an extension of the complex type, then we retrieve the contents of the super type. The attributes of the derived type is the union of its attributes and those of the supertype. Its content model is a sequence consisting of the content model of the supertype followed by that of the subtype declaration.

Map Restriction 2:

y = x |> n
z = deref( x |> p)
x.content = (st₂ ,am₂)
y |- st -? simp y |- am -? attrs

x |- complexType[@abstract[a] , @final[f , @block[b] ,
@mixed[m] , @name[n] ,
simpleContent[restriction[@base[p],st , am]]
=? complex(name = y, base = z.name,
derivation = extension,
refinement = if (f == ##all) then {restriction, extension} else f,
abstract = a
content = ((am₂ , attrs), if (m == true) then mixed(simp) else cm₂ , simp))

If a complex type is a restriction with simple content, then the base type refers to a complex type with simple content. The content of that type (z, above) must be a simple type and 0 or more attributes. We convert the simple type description and create a new complex sort.

Map Model Group Definition

y = x |> n
y |- mg --> g

x |- group[@name[n], mg]
=? modelGroup(name = y,
base = AnyGroup,
derivation = restriction,
refinement = {},
abstract = false
content = g)

Map Attribute Group Definition

y = x |> n
y |- mg --> g

x |- attributeGroup[@name[n], mg]
=? attributeGroup(name = y,
base = AnyGroup,
derivation = restriction,
refinement = {},
abstract = false
content = g)

Map Global Element 1:

y = x |> n
z = x |> s
c = deref( x |> t)
c in complex

x |- element[@abstract[a] , @final[f , @block[b] , @name[n] ,
@nillable[u], @substitutionGroup[s] , @type[t]]
=? element(name = y,
base = if (z |- null) then z else AnyElement,
derivation = c.derivation,
refinement = {restriction, extension},
abstract = a
content = c.name)

To create an element sort using a type reference to a complex type.

Map Global Element 2:

y = x |> n
z = x |> s
y |- t ? c

x |- element[@abstract[a] , @final[f , @block[b] , @name[n] , @form[o] , default[d] ,
@fixed[i] , @nillable[u], @substitutionGroup[s] , @type[t]]
=? element(name = y,
base = if (z |- null) then z else AnyElement,
derivation = c.derivation,
refinement = {restriction, extension},
abstract = a
content = c.name)

To create an element containing an inlined type definition, find the sort corresponding to the type definition and create the element sort using the name of the type definition for the content.

Map Local Element 1:

y = x |> n
c = x |> t

x |- element[@default[d], @fixed[i] , @name[n] , @form[o] ,
@maxOccurs[m] , @minOccurs[n] , @nullable[u], @type[t]]
=? element(name = y,
base = AnyElement,
derivation = restriction,
refinement = {},
abstract = false
content = c.name)

A local element which references a type creates a new element structure, whose content is the name of the referenced type.

Map Local Element 2:

y = x |> n
y |- t ? c

x |- element[@default[d], @fixed[i] , @name[n] , @form[o] ,
@maxOccurs[m] , @minOccurs[n] , @nullable[u], t]
=? element(name = y,
base = AnyElement,
derivation = restriction,
refinement = {},
abstract = false
content = c.name)

A local element with an inlined type declaration.

Map Attribute 1:

y = x |> n
c = x |> t

x |- attribute[@name[nd], @form[o] , @type[tn] , @use[u] , @value[v]]
=? attribute(name = y,
base = AnyAttribute,
derivation = restriction,
refinement = {},
abstract = false
content = c)

An attribute declaration referencing a type.

Map Attribute 2:

y = x |> n
y |- s ? c

x |- attribute[@name[nd], @form[o] , @use[u] , @value[v], s]
? attribute(name = y,
base = AnyAttribute,
derivation = restriction,
refinement = {},
abstract = false
content = c.name)

An attribute with an inlined type definition.

8 Mapping from XSDL Components for Formalization Components

This section specifies the mapping from the components described in XML Schemas, Part II, to the sorts described in this document. The consequent of each rule provides the XSDL component on the left and the corresponding formal component on the right. We've represented each XSDL component type as a structure similar to our own, so, for example, an XSDL particle is represented as a structure with name particle and three fields, minOccurs, maxOccurs, and term. The value of each field is a variable which can be used either in the antecedent, or in the second part of the consequent, which specifies the corresponding formal structure. In the case of a particle or model group, the formal correspondance is to a piece of content group syntax, but in the other cases it is to a sort. As there are a number of optional fields in the XSDL components, if a field does not have a value we assume it is null.

We represent the mapping from an XSDL component to a formal component by:

x ~> s (x converts to s). If x is null, then s is null.

Global Element:

s = global
u |> n |- t ~> t'
e ~> e'

element(name=n,
targetNamespace=u,
typeDefinition=t,
scope=s,
valueConstraint=v,
nillable=b,
identityConstraintDefinitions=d,
substitutionGroup=e,
substitutionGroupExclusions=x,
disallowedSubstitutions=d,
abstract=a,
annotation=z)
~>
element(name= u|>n
abstract= a
base= if e' = null then UrElement else e'.name
content= u|>n[t'.name]
derivation= if t' != e'.type then deref(t').derivation else restriction
refinement= if substitutions EO d then {} else x)

local element

s=local
y = e |> n
y |- t ~> t'

e |- element(name=n,
targetNamespace=u,
typeDefinition=t,
scope=s,
valueConstraint=v,
nillable=b,
identityConstraintDefinitions=d,
substitutionGroup=e,
substitutionGroupExclusions=x,
disallowedSubstitutions=d,
annotation=z)
~>
element(name= y,
abstract= false,
base= UrElement,
content= y[t'.name],
derivation= restriction,
refinement= {} )

complex type

y = if n == null then e/type::* else t |> n
b ~> b'
y |- a ~> a'
y |- g ~> g'

e |- complexType(name= n
targetNamespace= t
baseTypeDefinition= b
derivationMethod= m
final= f
abstract= a
attributeUsePairs= a
attributeWildcard= aw
contentType= g
prohibitedSubstitutions= r
annotations= ann)
~>
complex(name= y
abstract= a,
base= if b' = null then UrType else b'
content= (a', g'),
derivation= if b' == null then restrictionion else m
refinement= {refinement, restriction} - r)

Attribute Type - Global

sc=global
t |- s ~> s'
n' = t |> n

|- attribute(
name= n
target namespace= t
type definition= s
scope= sc
valueConstraint= vc
annotation= an)
~>
attribute(name= n'
abstract= false,
base= AnyAttribute
content= n'[s'.name]
derivation= restriction
refinement= {})

Attribute Type - Local

sc=local
t |- s ~> s'
n' = x |> n

x |- attribute(
name= n
target namespace= t
type definition= s
scope= sc
valueConstraint= vc
annotation= an)
~>
attribute(name= n'
abstract= false,
base= AnyAttribute
content= n'[s'.name]
derivation= restriction
refinement= {})

Model Group - Sequence

{x|- p_1 ~> g_1
...
x|- p_n ~> g_n

{x |- modelGroup(
compositor= sequence,
particles= p_1,...,p_n
annotation)
~>
(g_1 , ... , g_n)

Model Group - Choice

{x|- p_1 ~> g_1
...
x|- p_n ~> g_n

{x |- modelGroup(
compositor= choice,
particles= p_1,...,p_n
annotation)
~>
(g_1 | ... | g_n)

Model Group - All

{x|- p_1 ~> g_1
...
x|- p_n ~> g_n

{x |- modelGroup(
compositor= all,
particles= p_1,...,p_n
annotation)
~>
(g_1 & ... & g_n)

Particle

x|- t ~> g

{x |- particle(
minOccurs= m
maxOccurs= n
term= t)
~>
g{m,n}

8 Index of notations

8: infinity (§3.4)
e: empty sequence (§3.4), empty forest (§3.7)
Ø: empty choice (§3.4), empty set
g₁ , g₂: sequence group (§3.4)
g₁|g₂: choice group (§3.4)
g₁ &g₂: all group (§3.4)
g{m,n}: repetition group (§3.4)
a[g]: attribute group (§3.4)
e[g]: element group (§3.4)
d₁ ,d₂: sequence forest (§3.7)
a[d]: attribute forest (§3.7)
e[d]: element forest (§3.7)
a[s types td]: typed attribute forest (§3.7)
e[t types td]: typed element forest (§3.7)
d in g: forest d matches group g (§6)
|-: turnstile (§4,§5.4)
x₁ |> x₂: x1 extends x2 (§4)
x |- d ?dt: d normalizes to dt in context x (§4)
c in_p p: c matches datatype p (§6.1)
e in_v v: e matches wildcard item v (§6.1)
e in_w w: e matches wildcard w (§6.1)
d |?_unmix d': d unmixes to d' (§6.1)
d |?_inter d';d'': d interleaves d' and d'' (§6.1)
d ? d': d is a subset of d' (§6.2)
nc in IDREFs(td): nc is in the set of IDREFs in td (§6.2)
d -? g: d writes as g (§7)
e[d] =? cmp: e corresponds to cmp (§7)
x < :x': derivation (§5.1)
g < :_extg': extension (§5.2)
g < :_resg': restriction (§5.2)
x < :_der x': extension or restriction (§5.4)
a: attribute name (§3.1)
b: boolean (§3.5)
c: constant of atomic datatype (§3.3)
d: forest (§3.7)
e: element name (§3.1)
g: model group (§3.4)
i: namespace (§3.1)
j: local identifier (§3.1)
k: complex type name (§3.1)
l: scope length of name (§3.1)
m: minimum in repetition (§3.4)
n: maximum in repetition (§3.4)
p: atomic datatype (§3.3)
s: simple type name (§3.1)
t: type name (§3.1)
u: union datatype (§3.6)
v: wildcard item (§§3.2)
u: wildcard expression (§§3.2)
w: wildcard (§§3.2)
x: name (§3.1)
ag: attribute group (§3.6)
eg: element group (§3.6)
ac: attribute content (§3.6)
ec: element content (§3.6)
kc: complex type content (§3.6)
sc: simple type content (§3.6)
uc: union type content (§3.6)
der: derivation (§3.5)
srt: sort (§3.5)
deref(x): dereferencing (§3.5)
local(x): local name (§3.1)
namespace(x): namespace (§3.1)
symbol(x): symbol space (§3.1)

Appendix A Auxiliary Grammar

This section contains stricter definitions of some of the non-terminals found in the context free grammars found throughout the main body of the description. In general, these more restricted forms are not required, as sorts will be constructed from XSDL transfer syntax or XSDL components which already enforce these restrictions. In addition, they are considerably longer than the versions in the main body of the text. We separate them out here so they do not interfere with understanding the main text.

We start by redefining sorts so as to constrain the values of the different fields. To do this we introduce new non-terminals for the different types of sorts. A sort is one of the six sorts of component in XML Schema. We let srt range over sorts.

sort

srt

::=

at

el

st

ct

agr

mgr

complex type

::=

complexType(

name = t

base = t

derivation = der

refinement = ders

abstract = b

content = ag,eg

)

element

::=

element(

name = e

base = e

derivation = der

refinement = ders

abstract = b

content = e[t]

)

attribute group

agr

::=

attributeGroup(

name = an

base = an

derivation = der

refinement = ders

abstract = b

content = ag

)

model group

mgr

::=

modelGroup(

name = mn

base = mn

derivation = der

refinement = ders

abstract = b

content = eg

)

attribute

::=

attribute(

name = a

base = a

derivation = der

refinement = ders

abstract = b

content = a[p]

)

simple types

::=

simpleType(

name = t

base = t

derivation = der

refinement = ders

abstract = b

content = p

)

Appendix B Infidelities in the current formalization of XML Schema

This document is based on the syntax of XML Schema in the proposed recommendation draft. The Formalization Taskforce will continue in operation until XML Schema becomes a recommendation.

Also, in a few places the specification is not quite equivalent to that in XML Schema, notably for the definition of derivation. We believe that with extra time it should be possible to model all or most of XML Schema.

Restriction. This document allows one type to be a subtype of another if the content group of the subtype validates a subset of the forests validated by the content of the base type. XML Schema places several further restrictions on this relationship which are not currently modelled.
Content models. Impose additional constraints on content models until they are exactly like XML Schema.
- mixed element content is not fully integrated;
- unambiguity constraint on content models;
- checking that abstract components are not instantiated;
- subtyping permissions such as final and block;
- null elements;
- formal model of atomic datatypes.
- support for form and form default
- support for block, use, and value
- mapping from sorts to XML Schema components and coverage of the post-schema-validation Infoset
Identity constraints. Due to the drastic changes in the identity constraints at the London WG face-to-face, identity constraints have been dropped from this version.
Schema composition.
- schema import is not considered
- schema include is not considered
- schema redefine is not considered

Appendix C Bibliography

[1] Serge Abiteboul and Peter Buneman and Dan Suciu, Data on the Web : From Relations to Semistructured Data and XML Morgan Kaufmann, 1999

[2] Alfred V. Aho, "Algorithms for Finding Patterns in Strings", Handbook of Theoretical Computer Science, Jan van Leeuwen, MIT Press Elsevier, Cambridge, Massachusetts, 1990

[3] Paul V. Biron and Ashok Malhotra, XML Schema Part 2: Datatypes, , World Wide Web Consortium, 2001, available at http://www.w3.org/TR/xmlschema-2/

[4] Tim Bray and Jean Paoli and C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0, , World Wide Web Consortium, 1998, available at http://www.w3.org/TR/REC-xml

[5] James Clark and Steve DeRose, XML path language (XPath) version 1.0, World Wide Web Consortium, 1999, available at http://www.w3.org/TR/xpath

[6] David C. Fallside, XML Schema Part 0: Primer , World Wide Web Consortium, 2001, available at http://www.w3.org/TR/xmlschema-0/

[7] Gottlob Frege, "Begriffsschrift, a formula language, modeled upon that of arithmetic, for pure thought (1879)", From Frege to Godel: A Source Book in Mathematical Logic, 1879-1931, ed. Jan van Heijenoort, , Harvard University Press, Cambridge, Massachusetts, 1967

[8] Haruo Hosoya and Jerome Vouillon and Benjamin C. Pierce, "Regular Expression Types for XML", Proceedings of the International Conference on Functional Programming (ICFP), 2000

[9] Tova Milo and Dan Suciu, "Type Inference for Queries on Semistructured Data", Proceedings of the ACM Symposium on Principles of Database Systems, 1999

[10] John C. Mitchell, Foundations for Programming Languages, , MIT Press, 1996, Cambridge, Massachusetts

[11] Henry S. Thompson, David Beech, Murray Maloney and Noah Mendelsohn, XML Schema Part 1: Structures, , World Wide Web Consortium, 2001, available at http://www.w3.org/TR/xmlschema-1/

[12] Philip Wadler, "A formal semantics of patterns in XSLT", Proceedings of Markup Technologies, ed. B. Tommie Usdin and Deborah A. Lapeyre and C. M. Sperberg-McQueen, Philadelphia, 1999

[13] Philip Wadler, "New Language, Old Logic", Dr Dobbs Journal, December, 2000

[14] Allen L. Brown, Jr., "Derivation, tolerance and validity: a formal model of type definition in XML Schemas", Proceedings of XML Europe 2000, ed. Pamela Gennusa, Paris, 2000

[15] Gaisi Takeuti, Proof Theory, Studies in Logic and the Foundations of Mathematics, vol. 81, North-Holland/American Elsevier, Amsterdam, 1975

[16] Dag Prawitz, Natural Deduction: A Proof-Theoretical Study, , Almqvist & Wiksell, Stockholm, 1965,

[17] Solomon Feferman, "Theories of Finite Type Related to Mathematical Practice", Handbook of Mathematical Logic, ed. Jon Barwise, Elsevier North Holland, Amsterdam, 1977

XML Schema: Formal Description

W3C Working Draft, 25 September 2001

Abstract

Status of this document

Table of contents

1 Introduction

2 Overview

2.1 Normalization

2.2 Components

2.2.1 Component content

2.2.2 Example Schema Components

2.3 Documents and Forests

2.4 Forest Grammars

3 Structures

3.1 Names

3.2 Namesets

3.3 Atomic datatypes

3.4 Content groups

3.5 Components

3.6 Constraints

3.7 Ordered Forests

3.8 Inference rules

4 Normalization

5 Derivation

5.1 Base chain

5.2 Extension

5.3 Restriction

5.4 Constraints

6 Validation

6.1 Mapping component models to regular tree grammars

6.2 Content validation by inference rules

6.3 ID/IDREF validation

7 XML Schema to Formalization Mapping

7.1 Conversions for content model particles

7.2 Correspondences from XML Schema to sorts

8 Mapping from XSDL Components for Formalization Components

8 Index of notations

Appendix A Auxiliary Grammar

Appendix B Infidelities in the current formalization of XML Schema

Appendix C Bibliography