Validation in W3C XML Schema

Henry S. Thompson

C. M. Sperberg-McQueen

9 January 2002

1. Questions to be answered

These questions are taken from a message from the W3C XML Query WG:

If an XML Schema processor is used to parse an XML file which has no associated schema, does the XML Schema specification state what type information is associated with the attributes and elements in the PSVI?
In the PSVI, what type information is associated with attributes and elements found in laxly-validated XML?
In the PSVI, what type information is associated with attributes and elements found in skip-validated XML?

Some subsidiary or related questions, which probably also need to be publicly answered:

What are the space of possible distributions of values for the PSVI properties [validation context], [validation attempted], [validity] and [type definition] on EIIs? on AIIs?
What choices does a conformant processor have with respect to what it attempts to validate and what it doesn't?

2. Fundamental references to the REC

The following three quotes are fundamental background for answering the above questions:

Initiating validation: 5.2 Assessing Schema Validity, clause 3:
The processor starts from Schema-Validity Assessment (Element) (§3.3.4) with no stipulated declaration or definition, and either ·strict· or ·lax· assessment ensues, depending on whether or not the element information and the schema determine either an element declaration (by name) or a type definition (via xsi:type) or not.
Recursion after lax validation: Validation Rule: Schema-Validity Assessment (Element):
[Definition:] If either case of clause 1 above holds, the element information item has been strictly assessed.

If the item cannot be ·strictly assessed·, because neither clause 1.1 nor clause 1.2 above are satisfied, [Definition:] an element information item's schema validity may be laxly assessed if its ·context-determined declaration· is not skip by ·validating· with respect to the ·ur-type definition· as per Element Locally Valid (Type) (§3.3.4).

Note the use of the word may in the second quoted paragraph above: the implication is that processors have a choice here.
PSVI when laxly validating: The ur-type will be assigned as the [type definition] of laxly validated elements, as a consequence of the quote above. This is ratified by Schema Information Set Contribution: Assessment Outcome (Element), last paragraph:
Note that if an element is ·laxly assessed·, then the [type definition] and [member type definition] properties, or their alternatives, are based on the ·ur-type definition·.

But note also that lax assessment is not the only alternative to strict assessment---as pointed out above, no assessment is also an option.

3. Answers to the Questions

3.1. Parsing without a schema

Some clarification of the question is required before giving the answer. Even in the absence of any user- or file-specified schema document, there is none-the-less always a schema present when schema validity assessment is performed, namely the schema composed of all those components a conformant schema processor is required to have built-in, namely the ur-type definition, the simple ur-type definition and the primitive and derived built-in simple type definitions, as well as the attributes from the XMLSchema-instance namespace. Let's call this the built-in schema. So the revised question is "What type information is associated with the attributes and elements in the PSVI of a document schema-validated using only the components of the built-in schema, assuming no built-in type definition is mandated for use in validation?".

Since there are no element declarations available in the built-in schema, and by stipulation none of the available types is called for (if it were, then only the first or second outcomes below would be possible), the alternative quoted above at Initiating Validation must be used by the processor to begin validation. It follows that there are four possible answers to the question:

A built-in simple type is assigned to the document element (there are no attributes), because of an xsi:type declaration on the document element. There are no other elements and no attributes.
No type is assigned and an error is noted, because either an xsi:type declaration is present on the document element but its contents are not valid with respect to that declaration, or because it has attributes, or because the named type definition is not present in the built-in schema. What happens to the other elements and attributes in the document, if any, depends on the processor's error recovery policy.
The ur-type is assigned to the document element and the simple ur-type is assigned to all its attributes, because the processor chooses to assess laxly, see Recursion after lax validation. The children of the document element, if any, will have the same four possibilities, recursively.
No type is assigned to any elements or attributes, because the processor chooses not to assess the document element laxly, see Recursion after lax validation.

One would hope that a processor which chooses option (3) would not, during the resulting recursion, ever choose option (4), but the REC does not rule this out.

3.2. Lax validation

Lax validation is never mandated -- it always occurs at processor option. There are at least four ways in which it may arise:

At the beginning of a validation episode (see above);
Because an element in element or mixed content was allowed because of a wildcard particle with processContents='lax';
Because an element in element or mixed content was allowed because of a wildcard particle with processContents='strict', but no global declaration for the element was found (an error will be recorded, but processors may continue validation);
Because some property of a component involved in the local validity assessment of an element is absent because of a reference failure (e.g. reference to a missing type, group or identity constraing definition or element or attribute declaration).

The REC does not rule out an obvious 5th case, namely the one in which a processor validates laxly consequent on a validation failure being detected. This would presumably be treated very similarly to case (3) above.

The PSVI outcome of case (1) was dealt with in the previous section. For the others, it is as follows (cases (1) and (2) below are not possible for case (4) above):

A type is assigned to the element, because of an xsi:type declaration on the element which was satisfied by its contents and attributes. Validation proceeds as normal for the attributes and children, if any.
No type is assigned and an error is noted, because either an xsi:type declaration is present on the element but its contents/attributes are not valid with respect to that declaration, or because the named type definition is not present in the schema. What happens to the element's children and attributes, if any, depends on the processor's error recovery policy.
The ur-type is assigned to the element and the simple ur-type is assigned to all its attributes, because the processor chooses to assess laxly, see Recursion after lax validation. The children of the element, if any, will have the same four possibilities, recursively.
No type is assigned to the element or to any of its children or attributes, because the processor chooses not to assess the element laxly, see Recursion after lax validation.

3.3. Skip validation

Skip validation may arise in two ways: Because an element in element or mixed content was allowed because of a wildcard particle with processContents='skip', or because one of the circumstances in which lax validation might arise occurs, but the processor declines the option. The results are indistinguishable: no type information is present in the PSVI, that is, there is no [type definition] property, or any of its alternatives and associates, on the relevent element information item.

4. PSVI validation outcomes

The PSVI contains one rooted, well-connected set of attribute and element information items per validation episode which share a single value for the [validation context] property, namely the root of the set. If skip validation occurs at any point, the lower frontier of this set may not co-incide with the frontier of the original Infoset. Every node in the set has values for the [validity] and [validation attempted] properties, which together with the value of the [type definition] property contain sufficient information to diagnose exactly what happened at each step in the validation episode.

C. M. Sperberg McQueen transcribed and elaborated a table of possible outcomes which we and Richard Tobin constructed last June on a whiteboard in Edinburgh.