Canonical XML forms

for post-schema-validation infosets

A preliminary reconnaissance

C. M. Sperberg-McQueen

24 April 2002

1. Introduction
2. Infoset enhancements
- 2.1. A simple approach
- 2.2. Names of types
- 2.3. Attaching information to attributes
- 2.4. Full list of properties and information items
  - 2.4.1. For core information items
  - 2.4.2. Additional information items
3. Schema components
- 3.1. Dump format based on existing transfer syntax
- 3.2. Dump format unlike transfer syntax
- 3.3. A possible enhancement
4. Special considerations
5. Conclusion

A. References

This document describes one possible way to provide access to the post-schema-validation information set: by defining canonical XML forms both (1) for the original document together with the `infoset enhancements' (type attribution, validity information, etc.) and (2) for the schema components themselves.

This document is written with the intention of submitting it to the XML Schema WG as a working paper; it has not been approved by that WG or anyone else and has no official status of any kind. It has benefitted from the author's conversations with Dave Hollander, Philippe Le Hégaret, and Peter Fankhauser, to whom thanks.

1. Introduction

There may be many things we expect from schema processors in practice, but the W3C XML Schema 1.0 specification [Thompson et al. 2001] specifies only a single process, namely schema-validity assessment, which is defined as a mapping from an input information set (infoset) to a richer output information set.

The input information set is an instance of the standard infoset defined by the XML Information Set specification [Cowan/Tobin 2001]. We will refer to the infoset defined by the XML Information Set specification as the core infoset.

The output information set contains the input information set as a proper subset, and adds further information about the datatypes and validity of elements and attributes, and so on.

The information in the post-schema-validation information set may usefully be partitioned into three sets of information items and properties:

the input information set, which we assume we wish to make available without loss of information
`enhancements' to the input information set; these take the form of
- new properties on existing information items
- new information items of core infoset types (for example, if the schema declares a default value for an element not present in the input infoset, a new attribute information item will be constructed)
- new information items of new types not defined in the XML Information Set specification [Cowan/Tobin 2001]; see below for further discussion.
schema components, represented as new kinds of information items; N.B. processors are not required to provide these; they may instead provide the properties [type definition name], [type definition namespace], and [type definition anonymous]. We expect that lightweight processors may prefer to provide just the type names instead of the types themselves.

This document considers design questions associated with making this information accessible in canonical XML forms.[1] It first discusses possible methods of providing the infoset enhancements, then methods of providing access to schema components, and in conclusion considers some issues related to streaming processors.

2. Infoset enhancements

2.1. A simple approach

One simple approach to providing an XML form for infoset enhancements is shown by the use of xsi:type as already defined in the XML Schema specification.

If we used the xsi:type attribute to record the type associated by the schema with each element[2] in the document, part of the sample purchase order in the XML Schema tutorial might look like this:

 <shipTo country="US"
         xsi:type="po:USAddress">
  <name   xsi:type="xsd:string">Alice Smith</name>
  <street xsi:type="xsd:string">123 Maple Street</street>
  <city   xsi:type="xsd:string">Mill Valley</city>
  <state  xsi:type="xsd:string">CA</state>
  <zip    xsi:type="xsd:decimal">90952</zip>
 </shipTo>

The basic approach outlined here is just an extension and modification of this basic idea.

First, in order to avoid any confusion between any xsi:type attributes in the input and the attributes supplied as part of an XML serialization of the PSVI, we might not want to use the xsi:type attribute itself.[3] In the examples in this document, all the attributes added as part of re-serializing the PSVI will be in the http://www.example.com/psvi namespace, with the prefix psvi. Our example now might look like this:

 <shipTo country="US"
         psvi:type="po:USAddress">
  <name   psvi:type="xsd:string">Alice Smith</name>
  <street psvi:type="xsd:string">123 Maple Street</street>
  <city   psvi:type="xsd:string">Mill Valley</city>
  <state  psvi:type="xsd:string">CA</state>
  <zip    psvi:type="xsd:decimal">90952</zip>
 </shipTo>

Next, note that the XML Schema specification identifies a number of information set contributions to be attached to element information items, not just the [type definition name] property conveyed by xsi:type and psvi:type. For the sake of simplicity in exposition, for now let us consider just a few such properties, namely the properties

[member type definition] (for an element valid against a union type, the member type definition against which it was successfully validated),
[validation attempted] ("full", "none", or "partial")[4]
[validation context] (the element information item at which validation started), and
[validity] ("valid", "invalid", or "notKnown")[5]

If we use attributes similar to psvi:type to record these properties of elements in the document, and change one of the values to be illegal, for variety's sake, our sample might look like this:

 <shipTo country="US"
         psvi:type="po:USAddress"
         psvi:validation-attempted="full"
         psvi:validation-context="e1"
         psvi:validity="valid">
  <name psvi:type="xsd:string"
        psvi:validation-attempted="full"
        psvi:validation-context="e1"
        psvi:validity="valid">Alice Smith</name>
  <street psvi:type="xsd:string"
          psvi:validation-attempted="full"
          psvi:validation-context="e1"
          psvi:validity="valid">123 Maple Street</street>
  <city psvi:type="xsd:string"
        psvi:validation-attempted="full"
        psvi:validation-context="e1"
        psvi:validity="valid">Mill Valley</city>
  <state psvi:type="xsd:string"
         psvi:validation-attempted="full"
         psvi:validation-context="e1"
         psvi:validity="valid">CA</state>
  <zip psvi:type="xsd:decimal"
       psvi:validation-attempted="full"
       psvi:validation-context="e1"
       psvi:validity="invalid">90952-9765</zip>
 </shipTo>

An XML document to which special attributes have been added to record PSVI-related information I will call a PSVI-decorated document.

A full working out of this approach to serializing PSVI infoset enhancements will differ from the example just given in two ways. First, there will be additional attributes to cover the other properties of elements in the PSVI. Second, the approach needs to be extended in order to solve a couple of technical problems. The next few sections describe these technical problems and sketch solutions for them.

2.2. Names of types

What happens when a schema defines a complex type locally, within an element declaration? The schema will specify no name for that type: it's illegal to use the name attribute on local type declarations.

There are at least three easy solutions to this problem:

generate a unique name for the anonymous types, as schema processors are allowed to do
use an XPath expression which points into the canonical XML form of the schema (in this case we'd probably want to do the same for all types including the named ones)
use the syntax for NUNs (normalized universal names) specified in the current draft of the Formal Description document

Using generated names, we might get something like the following for another part of the sample purchase order:

 <items psvi:type="po:Items">
  <item partNum="872-AA" 
	psvi:type="_anon001">
   <productName psvi:type="xsd:string">Lawnmower</productName>
   <quantity psvi:type="_anon002">1</quantity>
   <USPrice psvi:type="xsd:decimal">148.95</USPrice>
   <comment psvi:type="xsd:string">Confirm this is electric</comment>
  </item>
  <item partNum="926-AA" 
	psvi:type="_anon001">
   <productName psvi:type="xsd:string">Baby Monitor</productName>
   <quantity psvi:type="_anon002">1</quantity>
   <USPrice psvi:type="xsd:decimal">39.98</USPrice>
   <shipDate psvi:type="xsd:date">1999-05-21</shipDate>
  </item>
 </items>

For brevity I've omitted the properties other than psvi:type.

If we use XPath expressions pointing in a canonical form of the schema, and if we assume we'll want to use XPath expressions for the named types as well, we might get something like the following:

 <items psvi:type="//xsd:complexType[@ns='http://www.example.com/PO1'
                   and @name='Items']">
  <item partNum="872-AA" 
	psvi:type="//xsd:complexType[@ns='http://www.example.com/PO1'
                   and @name='Items']
                  /xsd:sequence
                  /xsd:element[@name="item"]
                  /xsd:complexType">
   <productName psvi:type="//xsd:simpleType[
                           @ns='http://www.w3.org/2001/XMLSchema'
                           and name='string']">Lawnmower</productName>
   <quantity psvi:type="//xsd:complexType[
                   @ns='http://www.example.com/PO1'
                   and @name='Items']
                  /xsd:sequence
                  /xsd:element[@name="item"]
                  /xsd:complexType
                  /xsd:element[@name="quantity"]
                  /xsd:simpleType">1</quantity>
   <USPrice psvi:type="//xsd:simpleType[
                           @ns='http://www.w3.org/2001/XMLSchema'
                           and name='decimal']">148.95</USPrice>
   <comment psvi:type="//xsd:simpleType[
                           @ns='http://www.w3.org/2001/XMLSchema'
                           and name='string']"
   >Confirm this is electric</comment>
  </item>
  ...
 </items>

If we use a syntax for NUNs something like the long syntax of XPath, we might get something like the following:

 <items psvi:type="/type::po:Items">
  <item partNum="872-AA" 
	psvi:type="/type::po:Items/element::item/type::*">
   <productName psvi:type="/type::xsd:string">Lawnmower</productName>
   <quantity psvi:type="/type:po:Items/element::item
                        /element::quantity/type::*">1</quantity>
   <USPrice psvi:type="/type::xsd:decimal">148.95</USPrice>
   <comment psvi:type="/type:xsd:string">Confirm this is electric</comment>
  </item>
  ...
 </items>

If we use a syntax for NUNs something like that in the current draft of FD, we might get something like the following:

 <items psvi:type="/%po:Items">
  <item partNum="872-AA" 
	psvi:type="/%po:Items/item/%*">
   <productName psvi:type="/%xsd:string">Lawnmower</productName>
   <quantity psvi:type="/type:po:Items/item/quantity/%*">1</quantity>
   <USPrice psvi:type="/%xsd:decimal">148.95</USPrice>
   <comment psvi:type="/%xsd:string">Confirm this is electric</comment>
  </item>
  ...
 </items>

The choice among these solutions or variations upon them is a design decision the WG will have to make if we go in this direction. Whatever approach is chosen should (I think) work both in cases where the PSVI-decorated document is transmitted and used in isolation and in cases where it is accompanied by an XML representation of the schema components.

2.3. Attaching information to attributes

The second major technical problem with the simple approach outlined above is that xsi:type only works for elements. What about types for attributes?

One possible solution is to use the basic idea of the W3C submission on datatypes for DTD notation [Buck et al. 2000], and provide a single attribute, which I'll call psvi:atttypes which contains a sequence of (attribute-name, attribute-type-name) pairs for all of the `normal' attributes on the original element.[6]

Decorated with type information, the beginning of the purchase order might look like this:

<purchaseOrder orderDate="1999-10-20"
	       psvi:type="po:purchaseOrderType" 
	       psvi:atttypes="orderDate xsd:date"
	       xmlns="http://www.example.com/PO1"
	       xmlns:po="http://www.example.com/PO1"
	       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	       xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>
 <shipTo country="US"
	 psvi:type="po:USAddress"
	 psvi:atttypes="country xsd:NMTOKEN">
  <name psvi:type="xsd:string">Alice Smith</name>
 ...

It should be noted that I am fully aware that this solution is ugly. Long meditation on this problem has convinced me that every available solution to this problem is ugly: attributes were designed to have atomic or simple list values, not to have attributes of their own, and I no longer expect to find a pretty way to go against the grain of XML here. The notation described is simple to understand, is parallel to the notation used for xsi:schemaLocation, has already been defined in a publicly accessible document [Buck et al. 2000], and is at least as good as any alternative I have thought of.

2.4. Full list of properties and information items

As noted above, the PSVI includes a lot of properties beyond those illustrated above. I won't try to work out a full proposal for dealing with all of them here, but it does seem convenient to reproduce the relevant part of the list provided in Appendix C.2 of the Structures spec.

2.4.1. For core information items

The full list of additional properties supplied for core information items by a schema processor (and the information-set contribution rules that define them) is as follows. For each of these items, the definition of a canonical XML form will need to define an attribute to carry the information and rules for conveying the information by means of the attribute value. For example,

For element information items, the additional properties are:

[element declaration] (Element Declaration)
[ID/IDREF table] (ID/IDREF Table)
[identity-constraint table] (Identity-constraint Table)
[member type definition] (Element Validated by Type)
[member type definition anonymous] (Element Validated by Type)
[member type definition name] (Element Validated by Type)
[member type definition namespace] (Element Validated by Type)
[nil] (Element Declaration)
[notation] (Validated with Notation)
[notation public] (Validated with Notation)
[notation system] (Validated with Notation)
[schema default] (Element Validated by Type)
[schema error code] (Validation Failure (Element))
[schema information] (Schema Information)
[schema normalized value] (Element Validated by Type)
[schema specified] (Element Default Value)
[type definition] (Element Validated by Type)
[type definition anonymous] (Element Validated by Type)
[type definition name] (Element Validated by Type)
[type definition namespace] (Element Validated by Type)
[type definition type] (Element Validated by Type)
[validation attempted] (Assessment Outcome (Element))
[validation context] (Assessment Outcome (Element))
[validity] (Assessment Outcome (Element))

For attribute information items:

[attribute declaration] (Attribute Declaration)
[member type definition] (Attribute Validated by Type)
[member type definition anonymous] (Attribute Validated by Type)
[member type definition name] (Attribute Validated by Type)
[member type definition namespace] (Attribute Validated by Type)
[schema default] (Attribute Validated by Type)
[schema error code] (Validation Failure (Attribute))
[schema normalized value] (Attribute Validated by Type)
[schema specified] (Assessment Outcome (Attribute))
[type definition] (Attribute Validated by Type)
[type definition anonymous] (Attribute Validated by Type)
[type definition name] (Attribute Validated by Type)
[type definition namespace] (Attribute Validated by Type)
[type definition type] (Attribute Validated by Type)
[validation attempted] (Assessment Outcome (Attribute))
[validation context] (Assessment Outcome (Attribute))
[validity] (Assessment Outcome (Attribute))

2.4.2. Additional information items

Some properties in the infoset enhancement point to information items which are not defined in the core XML infoset. Deciding on the best way of providing XML access to these as part of a PSVI-decorated instance is an open problem. I have no proposals at the moment.

For ID/IDREF binding information items:

[binding] (ID/IDREF Table)
[id] (ID/IDREF Table)

For Identity-constraint Binding information items:

[definition] (Identity-constraint Table) [node table] (Identity-constraint Table)

For namespace schema information information item properties

[schema components] (Schema Information) [schema documents] (Schema Information) [schema namespace] (Schema Information)

For the schema document information item:

[document location] (Schema Information)
[document] (Schema Information)

3. Schema components

Full access to the PSVI by means of XML representations requires an XML representation for schema components. To distinguish this XML representation from the XML representation already defined in the XML Schema Recommendation, I will refer to it as a schema component dump format.

I do not propose here to work out a full design for a dump format. It seems clear that several very different designs are possible; my hope here is only to suggest some of the axes of variation and to identify some of the basic design questions.

I assume that a canonical XML form for schema components should have the following characteristics:

There should be a single XML document for the entire schema, rather than separate documents for each namespace which contributes components to the schema.[7]
It should be easy (within limits) to construct an XPath which points to any given schema component, without knowledge of the document structure of the transfer-syntax schema documents from which it may have been constructed. It should be easy for us to define a `canonical' XPath expression to be used to identify any construct, thus making schema components `first-class objects' on the Web.[8]
The canonical form must provide information about the type derivation tree and other semantically important information from the original schema documents, even if such information is no longer essential for validation.

If these are accepted as desiderata or requirements, it is clear that the existing XML transfer syntax is not as it stands a suitable dump format. It does not commingle declarations for components in different namespaces. It has many features for reducing redundancy and improving maintainability, but which make it hard to provide simple rules for constructing XPath expressions to denote arbitrary constructs. The xsd:attribute element which declares an attribute, for example, may be a child of the relevant xsd:complexType element, or may be very distant from it in the schema document, connected only by a long chain of inheritance and group reference pointers.

I conclude that if we want a dump format with the qualities described, we need to design one; we don't already have one in the spec.

One obvious design question is: should the dump format be made as similar as possible to the existing transfer syntax, or made more similar to the components?

3.1. Dump format based on existing transfer syntax

This approach to the dump format is derived more or less directly from a proposal made by Lee Buck in October of 1999 at our first meeting in Reston. We did not take it up then because it seemed likely to delay the completion of XML Schema 1.0. As I remember it, Lee's proposal was simple: to define a subset of the transfer syntax that eliminated everything that contributed most to making it hard to use XPointers conveniently and reliably to identify schema constructs.

If we wish to make the dump format resemble the transfer syntax, I think the salient characteristics are that the dump syntax

should not include inclusions, imports, or redefines
should not include named groups
should not include local declarations of elements or types[9]
must change some of the rules for linking references to their corresponding definitions
should otherwise be as close to the existing transfer syntax as reasonable

If we stay close to the transfer syntax where possible, while still making all components top-level objects, one of the complex type declarations in the Primer example might turn into something like this:


 <xsd:complexType id="_ct_Items" name="Items">
  <xsd:sequence>
   <xsd:element ref="_el_item" minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:complexType id="_ct_anon01">
  <xsd:sequence>
   <xsd:element ref="_el_productName"/>
   <xsd:element ref="_el_quantity"/>
   <xsd:element ref="_el_USPrice"/>
   <xsd:element ref="_el_comment"  minOccurs="0"/>
   <xsd:element ref="_el_shipDate" minOccurs="0"/>
  </xsd:sequence>
  <xsd:attribute name="partNum" type="SKU" use="required"/>
 </xsd:complexType>

 <xsd:simpleType id="_st_anon02">
  <xsd:restriction base="xsd:positiveInteger">
   <xsd:maxExclusive value="100"/>
  </xsd:restriction>
 </xsd:simpleType>

 <xsd:element id="_el_item" name="item" type="_ct_anon01"/>
 <xsd:element name="productName" type="xsd:string"/>
 <xsd:element name="quantity" type="_st_anon02"/>
 <xsd:element name="USPrice"  type="xsd:decimal"/>
 <xsd:element name="shipDate" type="xsd:date"/>

I have supplied IDs for anonymous items as needed -- in fact I have supplied IDs for everything, and used ID references to link all references to their components.[10]

3.2. Dump format unlike transfer syntax

It might be preferable to make the dump syntax intentionally unlike the transfer syntax.[11] An XML syntax drawn more or less directly from the existing component + properties description of the component set might produce declarations like these:

 <df:element-declaration name="item" 
			 targetns="http://www.example.com/PO1"
			 typedef="_ct_anon01"
			 scope="_ct_Items"
			 nillable="false"
			 identity-constraint-defs=""
			 substitution-group-affiliation=""
			 disallowed-substitutions=""
			 abstract="false"
			 >
  <df:annotation> ... </df:annotation>
 </df:element-declaration>

 <df:complexTypeDefinition id="_ct_anon01"
			   name=""
			   derivation-method="restriction"
			   final=""
			   abstract="false"
			   attributeUses=""
			   attributeWildcard=""
			   prohibitedSubstitutions=""
			   >
  <!--* no name="..." on anonymous types *-->
  <df:element-only/>
  <df:sequence>
   <df:element ref="_el_productName"/>
   <df:element ref="_el_quantity"/>
   <df:element ref="_el_USPrice"/>
   <df:element ref="_el_comment"  minOccurs="0"/>
   <df:element ref="_el_shipDate" minOccurs="0"/>
  </df:sequence>
  <df:attribute name="partNum" type="SKU" use="required"/>
 </df:complexTypeDefinition>

Some WG members might prefer a dump syntax that more closely resembled the notation used to display component structure in the current Formal Description draft.

3.3. A possible enhancement

Depending on the internal structure eventually chosen for the dump format, XPointer expressions or even NUNs for individual constructs may prove unwieldy. We might well prefer to assign unique local names to every construct, so that they can be referred to using simple QNames. One way to do this would be to include a small table in the dump format which gives the QName of each construct together with an XPath expression that (when applied to the dump-format document itself) will locate the component so named.

4. Special considerations

It is easiest to think about the canonical XML forms for the PSVI as if the infoset enhancements would be in one document, and the schema components in a separate XML document. In some applications, it will also be easiest to work with things this way.

In some applications, however, it will be most convenient if the PSVI-decorated instance and the dump of its schema could be transmitted in a single stream. For this, we may choose to use methods similar to those used to embed a schema in the same document as the material it is describing:

<!DOCTYPE envelope:doc-schema-pair [
<!ATTLIST envelope:schema id ID #REQUIRED>
]>
<envelope:doc-schema-pair xmlns:envelope="...">
 <envelope:schema id="schema"> 
  <!--* schema in (transfer or) dump format goes here *-->
 </envelope:schema>
 <envelope:instance xsi:schemaLocation="#schema">
  <!--* original instance doc goes here *-->
 </envelope:instance>
</envelope:doc-schema-pair>

Note, however, that this requires us to emit the schema in its entirety before we have emitted any of the document instance. It's hard to see how a streaming processor using just-in-time schema component construction could emit such a document.

If we are simply writing to files, this need not be a problem. But if we are sending information from the schema processor to a downstream application in a single datastream, we may need to find a way to interleave instance and schema.

So we may wish to explore the possibility of interleaving parts of the schema components in dump format with the decorated document instance. The decorated document instance comes in this way to resemble the input XML document less and less: we have added a large number of new attributes to the instance already, and interleaving would require that we also add a certain number of elements in the dump-format namespace.

Ideally, we would interleave the two in such a way that any process which correctly handles the input infoset can also handle the single-streamed PSVI; I don't know how to achieve that ideal, but some designs come closer to it than others.

Both the additional attributes and the additional elements may get in the way of normal processing of the instance document (which means they may defeat the purpose of serializing the infoset as a decorated instance rather than using one of the normal forms for dumping graphs into XML). It may be noted, however, that if we add only attributes, then any process which refers to attributes only by name will produce exactly the same results from a document and from its PSVI. Only if the process refers to attributes using a wildcard of some kind will the results differ.

Similarly, if we add schema component elements like cpsvi:complexType to the output stream, any process which refers to elements only by name will produce exactly the same results. Any process which refers to wildcards (e.g. a stylesheet which includes an unrestricted apply-templates) will encounter unexpected declarations and produce divergent results.

We can thus formulate a simple rule for writing processes which will produce the same results even if a document has been decorated or interleaved with PSVI information: avoid the use of XSLT wildcards (including xsl:apply-templates without a select attribute) or their equivalents in the implementation language. Alternatively, include special rules to catch attributes and elements in the psvi namespace, and handle them appropriately. (Also need to handle annotation correctly.)

5. Conclusion

This document identifies some design questions we must face if we wish to specify canonical XML forms for the post-schema-validation information set. I think it establishes fairly clearly that such canonical XML forms can be specified in such a way as to make them useful; it also identifies some design questions which can usefully be answered by a design. This document thus can be used to argue that defining canonical XML forms for the PSVI is feasible.

The next step is to explore other possibilities for specifying such canonical forms and decide on an overall design direction.

A. References

[Boyer 2001] Boyer, John. 2001. Canonical XML Version 1.0. W3C Recommendation 15 March 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/xml-c14n

[Buck et al. 2000] Buck, Lee, Charles F. Goldfarb, and Paul Prescod, ed. Datatypes for DTDs (DT4DTD) 1.0. W3C Note 13 January 2000. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/dt4dtd

[Cowan/Tobin 2001] Cowan, John, and Richard Tobin, ed. 2001. "XML Information Set." W3C Recommendation 24 October 2001. [Cambridge, Sophia-Antipolis, Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/xml-infoset/

[Thompson et al. 2001] Thompson, Henry S., David Beech, Murray Maloney, and Noah Mendelsohn. 2001. XML Schema Part 1: Structures. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/