This document describes the results of
schema validation as described in the W3C Recommendation
XML Schema 1.0, in particular the ways those
results differ from the results of (DTD-based) validation
as described in ISO 8879 and the XML 1.0 specification.
It is based on email sent by the author to the XML Query
Working Group in June 2001.
1. Introduction
At 2001-05-02 10:37, C. M. Sperberg-McQueen wrote: (full text at
http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001May/0028.html)
Section 3.3 [of the 27 April data model] has a paragraph which reads
A "schema-invalid document" is an XML document that has
a corresponding schema but whose schema-validity assessment
has resulted in one or more element/attribute information
items being assigned values other than 'valid' for the
[validity] property in the PSVI.
I think the concept outlined here is going to be important, but I am
uncomfortable with the term used for it (without being able to
propose a better one at the moment).
...
XML Schema distinguishes valid nodes (on which strict assessment of
validity was attempted, and found the node valid), invalid nodes (on
which strict assessment was attempted, and found an error), and nodes
for which the validity status is 'notKnown'. Our definition covers
documents within which any element or attribute is 'invalid' or
'notKnown', which means it also covers documents within which any
node was processed as a 'black box' (i.e. skipped during validation),
in addition to documents within which there is some detected error.
The term 'schema-invalid' will tend to suggest to the non-paranoid
reader a meaning narrower than that given by the definition.
We discussed this at the XML Query face to face in May and Mary
suggested substituting the term "incompletely validated". Although
I had suggested this term myself in my note of 2 May, I find that I
am as troubled by it as by the term "invalid".
Upon consideration, I think I now know why. The fact of the matter is
that XML Schema (a) provides more information about schema-validity
than a single bit (valid/invalid), and (b) provides schema-validity
information not just about the document as a whole but about each
element and attribute. If we want the data model to cover more than
only the set of schema-valid documents, I think we need (a) to make
more than a binary distinction ourselves, and (b) to consider validity
as a property of (and validation an operation on) elements / subtrees,
not solely documents.
At the very least, if we want to go beyond fully-validated
schema-valid documents, we need to come to grips with which set of
documents, other than the schema-valid documents, we actually want to
cover, and how.
The purpose of the rest of this note is to provide a list of the cases
I think can usefully be distinguished, and note where we have
decisions to make. If people agree that we need something more than
a binary switch, I will be willing to attempt formulating specific
language for the document-model document.
A conforming XML Schema processor provides information on (inter alia)
- the ancestor element at which schema-validation ('assessment')
started
- whether this particular element and its descendants were
schema-validated or not
- the result of the assessment
- the type associated with the element
The various combinations of values of the [validation attempted],
[validity], and [type definition] properties can usefully distinguish
several cases: eight by my count. A diagram showing the various
combinations of [validation attempted] and [validity] is at
http://www.w3.org/XML/2001/06/validity-outcomes if
that helps. [Note 2003-03-26: this table is now reproduced below,
since I was tired of not finding it here. -CMSMcQ]
(Pedantic note: conforming XML Schema processors are allowed not to
provide the [type definition] property, if instead they provide a
bundle of properties including the [type definition name], [type
definition namespace], and [type definition anonymous] properties. I
ignore such light-weight processors here because I assume that XML
Query will require access to the type definition components
themselves. Anyone not sharing that assumption may implicitly insert
the phrase "(or [type definition name] and related properties)"
wherever I mention the [type definition] property, as long as you
adjust quantifiers and negation properly.)
The outcome of XML-Schema-conformant validity assessment
is conveyed by two attributes, each of which has three
possible values. In the resulting three-by-three matrix,
not all combinations are possible; the others distinguish
several different states of affairs.
In the following table, the different
cases outlined below are labeled by number. Shorthand codes
are used to indicate whether the node, its children, and its
descendants were assessed, whether they are valid, and whether
a [type definition] and/or [schema error code] properties
are present or not.
|
Validity |
| Validation attempted |
valid |
notKnown |
invalid |
full
(this node and all descendants were fully assessed)
|
OK.
Entire subtree from here down has been strictly assessed and is valid.
Case 1
Assess: +N +C +D
Valid: +N +C +D
Props: +td -sec
|
Not possible: validation-attempted="full" implies
strict assessment,
and strict assessment implies validity is either "valid" or "invalid" |
OK. Entire subtree from here down has been strictly assessed and has an error, either here or at some descendant.
Case 2
There is a problem with the current element: it's not locally valid.
Assess: +N +C +D
Valid: +N +C +D
Props: -td +sec
Case 3
The current element
is locally valid, but a child is invalid
or missing a required declaration.
Assess: +N +C +D
Valid: +N -C
Props: +td -sec
|
partial
(either this node was fully assessed but some descendant was
not, or vice versa) |
OK. This node was strictly assessed and was valid;
none of its attributes or children was invalid, and none was missing
any required declaration.
Case 4
Assess: +N
Valid: +N
Valid-or-notKnown: +C
Required-decls-found: +C
Props: +td -sec
|
OK. This node was not strictly assessed (but
one of its descendants was.)
Case 7
Assess: -N +D
Props: -td -sec
|
OK. This node was strictly assessed and was invalid,
or else at least one of its direct dependents was invalid.
Also, some descendant was not strictly assessed.
Case 5
This node is locally invalid.
Assess: +N -D
Valid: +N
Props: -td +sec
Case 6
A descendant is invalid.
Assess: +N
Valid: +N -D
Props: +td -sec
|
none
(neither this node nor any descendant was strictly assessed) |
Not possible: validation-attempted="none" implies
no strict assessment,
and validity "valid" implies strict assessment. |
OK. This subtree was skipped: no strict assessment, no
validity information.
Case 8
A skipped subtree.
Assess: -N -C -D
Props: -td -sec
|
Not possible: validation-attempted="none" implies
no strict assessment,
and validity="invalid" implies strict assessment. |
2. When the entire subtree has been schema-validated
2.1. Full validation, valid
First, we can distinguish three cases in which the entire subtree has
been schema-validated:
1 This element, and all of its descendants, have been checked and are
schema-valid. This is the rough equivalent of DTD-based validation:
everything has a declaration, and everything conforms to the
declaration.
[validation attempted] = "full"
[validity] = "valid"
[type definition] property is present
The Query/XPath data model has to cover these elements.
2.2. Full validation, invalid
2 This element, and all of its descendants, have been checked and
there is a problem right here at this element (and possibly also with
some descendant).
[validation attempted] = "full"
[validity] = "invalid"
[type definition] property is not present
We need to decide whether the Query/XPath data model should cover
these elements and/or their descendants. It seems plausible to want
to cover at least all fully-assessed schema-valid descendants
(i.e. descendants in class 1). We can also cover the element with the
problem by treating it as if it had the urType.
2.3. Full validation, locally valid
3 This element, and all of its descendants, have been checked and
while this element is 'locally valid', some descendant is invalid.
[validation attempted] = "full"
[validity] = "invalid"
[type definition] property is present
This will be the description of the top-level element in a database,
if one attribute in one record is out of bounds. It seems plausible
to want to cover at least these elements, and probably at least some
of their descendants (i.e. at least those descendants which are also
in this class).
3. Partial schema-validation
Second, there are four cases in which part of the subtree has been
schema-validated and part not.
Schema-validity will not be assessed on elements or attributes if they
or some ancestor matches an ANY wildcard which prescribes "skip"
processing. Skip processing forbids schema-validity assessment and
creates a 'black-box' location in a document in which any well-formed
XML is legal. Schema-validity will also not be assessed for elements
and attributes if (a) they or some ancestor matches an a wildcard
which prescribes "lax" processing and (b) no declaration was available
for some descendant. Lax processing calls for schema-validity to be
assessed for elements and attributes if matching declarations are
available, and skipped if declarations are not available; it creates a
'white box' in which undeclared elements and attributes are allowed,
but in which all elements and attributes are schema-validated if
declarations are available for them.
3.1. Partial validation, valid item
4 This element has been schema-validated, and is schema-valid (which
means also that none of its attributes or children is invalid or
missing a required declaration), but some descendant is not marked
"valid".
[validation attempted] = "partial"
[validity] = "valid"
[type definition] property is present
I believe we want to cover these elements in our data model.
3.2. Partial validation, (locally) invalid
5 This element has been schema-validated, and is invalid because there
is a problem right here at this element.
[validation attempted] = "partial"
[validity] = "invalid"
[type definition] property is not present
I believe we need to decide whether we want to cover these elements in
our data model. If we do wish to cover them, we can do so (I think) by
assigning them the urType.
3.3. Partial validation, locally valid
6 This element has been schema-validated, and is invalid because
although it's OK 'locally', it has some invalid descendant.
[validation attempted] = "partial"
[validity] = "invalid"
[type definition] property is present
I believe we do want to cover these elements in our data model.
3.4. Partial validation, locally unvalidated
7 This element has not been schema-validated, but at least one of
its descendants has been.
[validation attempted] = "partial"
[validity] = "notKnown"
[type definition] property not present
I believe we need to decide whether we want to cover these elements in
our data model. I believe we do, and that we can do so by assigning
them the urType.
4. When the subtree was skipped
Finally, there is one case in which no part of the subtree has been
schema-validated.
4.1. Unvalidated
8 Neither this element nor any of its descendants has been
schema-validated.
[validation attempted] = "none"
[validity] = "notKnown"
[type definition] and [type definition name] property not present
The Query/XPath data model can easily cover these elements by
assigning the urSimpleType to all attributes and the urType to
all elements.
I think we can cover all these cases, if we simply assign
the urType and urSimple type to items which have no [type definition]
property. The question is, so we wish to do so? (If we do, we need
to be careful to distinguish elements associated with the urType by
the schema validator and those for which the association with the
urType came from the query system, not the schema validator.)