This has ended up reading like a W3C spec., which the TAG doesn't do, but it's the way it turned out. . . We'll have to discuss what we do about that. . .

The elaborated infoset: A proposal

Henry S. Thompson
30 Jan 2007

1. Publication state

This is a TAG working document---no decision has yet been taken on its eventual disposition

2. The default processing model

TAG issue xmlFunctions-34 represents the TAG's commitment to consider the question of whether there is a 'default' XML processing model, and if so what it looks like. That is, aside from the obligations imposed by the XML (and XML Namespace) recommendations themselves, what, if anything, ought to be done with a document whose media type tells you it's an XML document, before any application-specific processing is attempted? Or, to put it another way, if an author takes responsibility for the information in an XML document, exactly what is s/he taking responsibility for?

3. The infoset

The XML Information Set specification defines a vocabulary for referring to the information content of an XML document, in the form of an abstract data model. It identifies XML parsers as the most likely source of such information, but acknowledges that other sources are possible, and several subsequent W3C specs (e.g. XInclude, XML Schema) are defined in terms of mappings from infosets to infosets.

The default processing model question can be rephrased as "Is there an infoset other than the one produced by a conformant XML parser which can and should be defined?" Indeed exactly what the infoset of an XML document is is already somewhat under-determined, in that a well-formed XML document as processed by a conformant processor may yield two distinct infosets, depending on whether that processor processes all the external parameter entities in the document's DTD.

4. Generic operations and the elaborated infoset

Just as applications today can express the requirement that certain minimal processing has been done and/or that certain information must be available from the XML documents they take as input, by simply referring to the Infoset, we propose to define a more extended form of processing whose results, in information terms, can then be simply identified as the starting point for applications. Since the specification of XML and the XML information set, a number of generic XML applications have been specified, in terms of functions from infosets to infosets, which arguably should (almost) always be implemented before any more specific processing is attempted. By 'generic' I mean that their elements and/or attributes may usefully appear in almost any XML document, and are coherently interpretable without reference to the syntax or semantics of the surrounding XML (but see quoting below). Furthermore, the resulting infoset is consistent with the media type of the original XML document.

The inventory of such 'generic' applications is small, and identifying its membership correctly is likely to be one of the hard parts of this project, but here are three candidates:

5. Elaboration defined: top-down treewalk, signals and namespaces

The basic idea is that the elaborated infoset is constructed by a top-down traversal of the original infoset, replacing each element information item which signals that it is an elaborating element, typically by being in an elaborating namespace, e.g. the XInclude namespace, with its elaboration as determined by the appropriate specification. This process applies to its own output, that is, for example, if the result of XInclude processing of an element is a sequence of elements, one of which is itself in the XML Encryption namespace, that element will in turn be elaborated.

More formally, the elaborated infoset of a document is defined by a function E from information items ('II' for short) to (sequences of) information items, by cases over the kind of information item, applied to its basic or ordinary infoset. In each case we refer to the original information item as o and the result of a single elaboration, that is E(o), as e, and to the values of properties of information items using a '.' and the property name, e.g.. o.local name.

An elaboration signal is a [namespace name] which is an elaborating namespace or an [attributes] containing an attribute II whose [namespace name] is an elaborating namespace

element II
If o's properties contain an elaboration signal then
  • e is a (possibly empty) sequence of element, processing instruction, unexpanded entity reference, character, and comment information items, the result of processing o according to the specification governing the namespace of the elaboration signal;
otherwise
  • e is an element II with the same properties and values as o except for the [children] property, whose value is the concatenation of E*(c) for each child c in o.children, in order. By E* is meant the result of repeated applications of E to (the members of) its own value until a fixed-point is reached.
document II
e is a document II with the same properties and values as o except for the [document element] and [children] properties: e.document element is F(E*(o.document element)), which also becomes the single element II among e.children. It is an error if E*(o.document element) is not a single element II. F is defined in Infoset fixup, below.
all other kinds of II
E is the identity, that is e is an infoitem of the same kind as o, with the same properties and values.

The elaboration process as a whole fails if any individual elaboration fails with an error.

6. Elaborating namespaces

We need to establish just what the elaborating namespaces are, that is, what specs define one or more generic processes which it's useful to include in the definition of elaboration as a whole. Just what fits that description (which itself begs a question with the word 'useful') is an open question, but as suggested above we start with three candidates:

inclusion
The http://www.w3.org/2001/XInclude namespace is an elaborating namespace, and elements in it should be elaborated by reference to the XInclude specification.
decryption
The http://www.w3.org/2001/04/xmlenc# namespace is an elaborating namespace, and the EncryptedData element II in it should be elaborated by reference to the XML Encryption specification. It is always an error if a decryption fails because a key is supplied but is not accepted. There are roughly three non-error cases:
No key
no change, that is, the EncryptedData element II itself;
XML data
That is, there is a CipherValue or CipherReference element II with Type 'element' or 'content'. In this case the result is the [children] of the document II which results from parsing the decrypted octet sequence as a stream of UTF-8 encoded characters;
other data
That is, Type is unspecified or is not 'element' or element 'content'. Not clear what to do here -- this is mostly in the spec to support decrypting keys, which we won't elaborate anyway. . .
signature checking
The http://www.w3.org/2000/09/xmldsig# namespace is an elaborating namespace, and the Signature in it should be elaborated by reference to the XML Signature specification. This is not a clear or simple case, as XML Signature provides for at least three distinct kinds of signing (Enveloped, enveloping and detached), and supports signing of multiple objects. As a starting point elaboration of signing should always fail if the signature is not valid, and its value when the signature is valid should be as follows:
Multiple References
That is, more than one things is signed. No change, that is, the Signature element II itself;
enveloped
That is, the thing signed is the enclosing document. In this case the result should be the empty sequence;
enveloping
That is, the thing signed is an Object within the signature. In this case the result should be the signed subtree within the Object, as processed by any specified Transformations;
detached, local
That is, the thing signed is in the same document as the signature, but not inside it. In this case the result should be the empty sequence;
detached, remote
That is, the thing signed is elsewhere, identified by a URI. We treat this as a signed XInclude, and the result is the referenced external subtree, as processed by any specified Transformations.

6.1. Extensibility

This spec. identifies three elaborating namespaces. It should be possible for W3C specs published subsequently to identify one or more additional elaborating namespaces, by specifying what elaboration means for them.

6.2. Quoting

In special circumstances, the author of a document may wish to prevent the operation of elaboration within certain sub-trees of a document. Accordingly, we define http://www.example.org/quote as an elaborating namespace, designed to be used on an eq:quote attribute. The elaboration of an element II with this attribute is defined to be an otherwise identical element eII with the attribute removed, and the special property that it short-circuits further applications of E in search of a fixed-point.

7. Infoset fixup

The infoset as defined in the Infoset spec. has several properties whose values are non-local, that is, they cannot be determined or checked for consistency solely by reference to the subtree rooted at their host II. These are

As recognized by the XInclude spec. (see references Property Fixup and subsequent sections), it follows that some fixup may be required after constructing an infoset by replacing some subtrees within an original infoset with subtrees from elsewhere. In some cases fixup means adding new attribute information items, in others a combination of that and changing the values of some infoset properties. It is conjectured that fixup can be done once, on the entire result infoset, after all elaborations have been carried out

8. Issues