The elaborated infoset: A proposal

This has ended up reading like a W3C spec., which the TAG doesn't do, but it's the way it turned out. . . We'll have to discuss what we do about that. . . The elaborated infoset: A proposal Henry S. Thompson 27 November 2007

Publication state

This is a TAG working document---no decision has yet been taken on its eventual disposition

http://www.w3.org/2001/tag/doc/elabInfoset-20071127/ http://www.w3.org/2001/tag/doc/elabInfoset/ http://www.w3.org/2001/tag/doc/elabInfoset-20070130/

The main change in this version is a substantial expansion of the discussion of quotation, see section Quoting. This also involved a high-level re-ordering of sections. The rhetoric was also changed to eliminate references to elaborating namespaces.

Still needs a section on the overall model -- relation of elaborated infoset to document interpretation, overall control flow, the role of the application

The default processing model

TAG issue xmlFunctions-34 represents the TAG's commitment to consider the question of whether there is a 'default' XML processing model, and if so what it looks like. That is, aside from the obligations imposed by the XML (and XML Namespace) recommendations themselves, what, if anything, ought to be done with a document whose media type tells you it's an XML document, before any application-specific processing is attempted? Or, to put it another way, if an author takes responsibility for the information in an XML document, exactly what is s/he taking responsibility for?

The infoset

The XML Information Set specification defines a vocabulary for referring to the information content of an XML document, in the form of an abstract data model. It identifies XML parsers as the most likely source of such information, but acknowledges that other sources are possible, and several subsequent W3C specs (e.g. XInclude, XML Schema) are defined in terms of mappings from infosets to infosets.

The default processing model question can be rephrased as "Is there an infoset other than the one produced by a conformant XML parser which can and should be defined?" Indeed exactly what the infoset of an XML document is is already somewhat under-determined, in that a well-formed XML document as processed by a conformant processor may yield two distinct infosets, depending on whether that processor processes all the external parameter entities in the document's DTD.

Generic operations and the elaborated infoset

Just as applications today can express the requirement that certain minimal processing has been done and/or that certain information must be available from the XML documents they take as input, by simply referring to the Infoset, we propose to define a more extended form of processing whose results, in information terms, can then be simply identified as the starting point for applications. Since the specification of XML and the XML information set, a number of generic XML applications have been specified, in terms of functions from infosets to infosets, which arguably should (almost) always be implemented before any more specific processing is attempted. By 'generic' I mean that their elements and/or attributes may usefully appear in almost any XML document, and are coherently interpretable without reference to the syntax or semantics of the surrounding XML (but see quoting below). Furthermore, the resulting infoset is consistent with the media type of the original XML document.

The inventory of such 'generic' applications is small, and identifying its membership correctly is likely to be one of the hard parts of this project, but here are three candidates:

XInclude XML Encryption XML Signature

Quoting

There are three different ways in which the process of elaboration can be avoided, so that the unelaborated infoset is preserved: opting out, implicit quotation and explicit quotation. Opting out is trivial: Nothing in the definition of elaborated infosets requires a specification or processor to use it. So, for example, the next edition of XSLT probably should not mandate the elaboration of stylesheets, since on balance the presense therein of e.g. an xi:include element is most likely to be specifying a literal result element, and should not be elaborated.

In the context of an application which does call for elaboration of (some parts of) its input, two distinct kinds of quotation may be needed:

Implicit quotation

Implicit quotation provides for quotation of some parts of all documents in a particular namespace. The semantics of some parts of a particular application namespace may be best handled by blocking elaboration. Even different kinds of processing of a particular namespace may require different choices with respect to elaboration. Consider SOAP, for example. SOAP intermediaries might best be specified as elaborating down as far as the SOAP body, but no further, whereas SOAP recipients would elaborate the body. Constructors of SOAP messages might take yet a different approach. This means that both specifications and implementations may need to go into considerable detail with respect to what parts of an infoset are not elaborated. This in turn means that implementations of elaboration must provide controls which allow applications to specify which domains (subtrees) are to be treated as quoted.

Explicit quotation

Explicit quotation provides for quotation of parts of individual documents. In special circumstances, the author of a document may wish to prevent the operation of elaboration within certain sub-trees of a document. Accordingly, we define http://www.example.org/quote as an elaborating namespace, specified for use only on an eq:quote attribute, which quotes any subtree it appears at the root of.

The elaboration of an element II with this attribute is defined to be an otherwise identical element eII with the attribute removed, and the special property that it short-circuits further applications of E in search of a fixed-point.

Elaboration signals

We need to establish just what the elaboration signals are, that is, what specs define one or more generic processes which it's useful to include in the definition of elaboration as a whole. Just what fits that description (which itself begs a question with the word 'useful') is an open question, but as suggested above we start with three candidates:

The include EII in the http://www.w3.org/2001/XInclude namespace is an elaboration signal, and it should be elaborated by reference to the XInclude specification. The EncryptedData EII in the http://www.w3.org/2001/04/xmlenc# namespace is an elaboration signal, and it should be elaborated by reference to the XML Encryption specification. It is always an error if a decryption fails because a key is supplied but is not accepted. There are roughly three non-error cases: no change, that is, the EncryptedData element II itself; That is, there is a CipherValue or CipherReference element II with Type 'element' or 'content'. In this case the result is the [children] of the document II which results from parsing the decrypted octet sequence as a stream of UTF-8 encoded characters; That is, Type is unspecified or is not 'element' or element 'content'. Not clear what to do here -- this is mostly in the spec to support decrypting keys, which we won't elaborate anyway. . . The Signature EII in the http://www.w3.org/2000/09/xmldsig# namespace is an elaboration signal, and the in it should be elaborated by reference to the XML Signature specification. This is not a clear or simple case, as XML Signature provides for at least three distinct kinds of signing (Enveloped, enveloping and detached), and supports signing of multiple objects. As a starting point elaboration of signing should always fail if the signature is not valid, and its value when the signature is valid should be as follows: That is, more than one things is signed. No change, that is, the Signature element II itself; That is, the thing signed is the enclosing document. In this case the result should be the empty sequence; That is, the thing signed is an Object within the signature. In this case the result should be the signed subtree within the Object, as processed by any specified Transformations; That is, the thing signed is in the same document as the signature, but not inside it. In this case the result should be the empty sequence; That is, the thing signed is elsewhere, identified by a URI. We treat this as a signed XInclude, and the result is the referenced external subtree, as processed by any specified Transformations.

Extensibility

This spec. identifies three elaboration signals. It should be possible for W3C specs published subsequently to identify one or more additional elaboration signals, by specifying what elaboration means for them.

Elaboration defined: top-down treewalk and signals

The basic idea is that the elaborated infoset is constructed by a top-down traversal of the original infoset, replacing each element information item which signals that it is an elaborating element, either by itself being an elaboration signal, or by being the owner of an attribute II which is an elaboration signal. For example, the an EII whose name is include in the XInclude namespace is an elaborating element, with its elaboration as determined by the XInclude spec. The elaboration process applies to its own output, that is, for example, if the result of XInclude processing of an element is a sequence of elements, one of which is itself named EncryptedData in the XML Encryption namespace, that element will in turn be elaborated.

More formally, the elaborated infoset of an infoitem is defined by a function E from information items ('II' for short) and a set of implicit quotation element names to (sequences of) information items (IQNs), by cases over the kind of information item. In each case we refer to the original information item as o and the result of a single elaboration, that is E(o,IQNs), as e, and to the values of properties of information items using a '.' and the property name, e.g.. o.local name.

The elaboration of an II o is F(E(o,IQNs)), where F is defined in Infoset fixup below and E is defined as follows:

If e was named as an implicit quotation element, a member of IQNs, then o is an infoitem of the same kind as o, with the same properties and values otherwise iff o.attributes contains an AII whose name is quote in the elaboration quotation namespace, then e is an element II with the same properties and values as o except for the [attributes] property, from which the eq:quote attribute is removed otherwise if o's is an elaboration signal or o.attributes contains an elaboration signal then e is a (possibly empty) sequence of element, processing instruction, unexpanded entity reference, character, and comment information items, the result of processing o according to the specification governing the elaboration signal; otherwise e is an element II with the same properties and values as o except for the [children] property, whose value is the concatenation of E*(c,IQNs) for each child c in o.children, in order. By E* is meant the result of repeated applications of E to (the members of) its own value until a fixed-point is reached. e is a document II with the same properties and values as o except for the [document element] and [children] properties: e.document element is E*(o.document element), which also becomes the single element II among e.children. It is an error if E*(o.document element) is not a single element II. E is the identity, that is e is an infoitem of the same kind as o, with the same properties and values.

The elaboration process as a whole fails if any individual elaboration fails with an error.

Infoset fixup

The infoset as defined in the Infoset spec. has several properties whose values are non-local, that is, they cannot be determined or checked for consistency solely by reference to the subtree rooted at their host II. These are

the [references] property of attribute IIs, whose value when its sibling [attribute type] property is IDREF or IDREFS is the set of referenced element IIs, which may be anywhere in the surrounding document; the [in-scope namespaces] property of element IIs, whose value should be consistent with the impact of the values not only of its sibling [namespace attributes] property, but also the values of that property up the [parent] chain; the [base URI] property of element and processing instruction IIs, whose value should be consistent with the values of the xml:base attribute in the sibling [attributes] property and up the [parent] chain; the [language] property of element IIs, whose value should be consistent with the values of the xml:lang attribute in the sibling [attributes] property and up the [parent] chain.

As recognized by the XInclude spec. (see references Property Fixup and subsequent sections), it follows that some fixup may be required after constructing an infoset by replacing some subtrees within an original infoset with subtrees from elsewhere. In some cases fixup means adding new attribute information items, in others a combination of that and changing the values of some infoset properties. It is conjectured that fixup can be done once, on the entire result infoset, after all elaborations have been carried out

Issues Should we allow attributes to be elaboration signals? If not, do we use eq:quote to wrap quoted elements? If we do, what do we do about multiple signals on a single EII? Quoting clearly takes precedence, but how do we order the others? Is doing fixup only at the end good enough? Presumably we should do fixup on notation references and unparsed entity references as well as ID ones. Should we require all external parameter entity references to be processed? What do we do when we hit encrypted non-XML data. . . The whole Signature thing is very complicated, and I'm not sure almost any of it is right. . .