<?xml version='1.0'?>
<?xml-stylesheet type="text/xsl" href="../../../../../lib/xml/doc.xsl" ?>
<!DOCTYPE doc SYSTEM "../../../../../lib/xml/doc.dtd" >
<doc>
 <head>
  <note>This has ended up reading like a W3C spec., which the TAG doesn't do,
but it's the way it turned out. . .  We'll have to discuss what we do about
that. . .</note>
  <title>The elaborated infoset:  A proposal</title>
  <author>Henry S. Thompson</author>
  <date>27 November 2007</date>
 </head>
 <body>
  <div>
   <title>Publication state</title>
   <p>This is a TAG working document---no decision has yet been taken on its
eventual disposition</p>
   <list type="1defn">
    <item term="This version"><link href="http://www.w3.org/2001/tag/doc/elabInfoset-20071127/">http://www.w3.org/2001/tag/doc/elabInfoset-20071127/</link></item>
    <item term="Latest version"><link href="http://www.w3.org/2001/tag/doc/elabInfoset/">http://www.w3.org/2001/tag/doc/elabInfoset/</link></item>
    <item term="Previous version"><link href="http://www.w3.org/2001/tag/doc/elabInfoset-20070130/">http://www.w3.org/2001/tag/doc/elabInfoset-20070130/</link></item>
   </list>
   <p>The main change in this version is a substantial expansion of
the discussion of quotation, see section <link href="#quoting">Quoting</link>. 
This also involved a high-level re-ordering of sections.  The rhetoric was also
changed to eliminate references to elaborating <emph>namespaces</emph>.</p>
   <note>Still needs a section on the overall model -- relation of elaborated
infoset to document interpretation, overall control flow, the role of the application</note>
  </div>
  <div id="default">
   <title>The default processing model</title>
   <p>TAG issue <link href="http://www.w3.org/2001/tag/issues.html?type=1#xmlFunctions-34">xmlFunctions-34</link> represents the TAG's commitment to consider the question of whether there is a 'default' XML processing model, and if so what it looks like.  That is, aside from the obligations imposed by the XML (and XML Namespace) recommendations themselves, what, if anything, <emph>ought</emph> to be done with a document whose media type tells you it's an XML document, before any application-specific processing is attempted?  Or, to put it another way, if an author takes responsibility for the information in an XML document, exactly <emph>what</emph> is s/he taking responsibility for?</p>
  </div>
  <div id="infoset">
   <title>The infoset</title>
   <p>The <link href="http://www.w3.org/TR/xml-infoset/">XML Information
Set</link> specification defines a vocabulary for referring to the information
content of an XML document, in the form of an abstract data model.  It
identifies XML parsers as the most likely source of such information, but
acknowledges that other sources are possible, and several subsequent W3C specs
(e.g. <link href="http://www.w3.org/TR/xinclude/">XInclude</link>, <link href="http://www.w3.org/TR/xmlschema-1/">XML Schema</link>) are defined in terms of mappings <emph>from</emph> infosets <emph>to</emph> infosets.</p>
   <p>The default processing model question can be rephrased as "Is there an
infoset other than the one produced by a conformant XML parser which can and
should be defined?"  Indeed exactly what <emph>the</emph> infoset of an XML
document is is already somewhat under-determined, in that a well-formed XML
document as processed by a conformant processor may yield two distinct
infosets, depending on whether that processor processes all the external
parameter entities in the document's DTD.</p>
  </div>
  <div id="generic">
   <title>Generic operations and the elaborated infoset</title>
   <p>Just as applications today can express the requirement that certain
minimal processing has been done and/or that certain information must be
available from the XML documents they take as input, by simply referring to the
Infoset, we propose to define a more extended form of processing whose results,
in information terms, can then be simply identified as the starting point for
applications.  Since the specification of XML and the XML information set, a
number of <emph>generic</emph> XML applications have been specified, in terms
of functions from infosets to infosets, which
arguably should (almost) always be implemented before any more specific
processing is attempted.  By 'generic' I mean that their elements and/or
attributes may usefully appear in almost any XML document, and are coherently
interpretable without reference to the syntax or semantics of the surrounding
XML (but see <link href="#quoting">quoting</link> below).  Furthermore, the
resulting infoset is consistent with the media type of the original XML document.</p>
   <p>The inventory of such 'generic' applications is small, and identifying
its membership correctly is likely to be one of the hard parts of this project,
but here are three candidates:</p>
   <list>
    <item><link href="http://www.w3.org/TR/xinclude/">XInclude</link></item>
    <item><link href="http://www.w3.org/TR/xmlenc-core/">XML Encryption</link></item>
    <item><link href="http://www.w3.org/TR/xmldsig-core/">XML Signature</link></item>
   </list>
  </div>
  <div id="quoting">
    <title>Quoting</title>
    <p>There are three different ways in which the process of elaboration can
be avoided, so that the unelaborated infoset is preserved: opting out, implicit
quotation and explicit quotation.  Opting out is trivial:  Nothing in the
definition of elaborated infosets <emph>requires</emph> a specification or
processor to use it.  So, for example, the next edition of XSLT probably should
<emph>not</emph> mandate the elaboration of stylesheets, since on balance the
presense therein of e.g. an <code>xi:include</code> element is most likely to
be specifying a literal result element, and should not be elaborated.</p>
    <p>In the context of an application which does call for elaboration
of (some parts of) its input, two distinct kinds of quotation may be needed:</p>
    <div id="implicit">     
     <title>Implicit quotation</title>
     <p>Implicit quotation provides for quotation of some parts of all
documents in a particular namespace.  The semantics of some parts of a particular application namespace may
be best handled by blocking elaboration.  Even different kinds of processing of
a particular namespace may require different choices with respect to
elaboration.  Consider SOAP, for example.  SOAP intermediaries might best be
specified as
elaborating down as far as the SOAP body, but no further, whereas SOAP
recipients would elaborate the body.  <emph>Constructors</emph> of SOAP
messages might take yet a different approach.  This means that both
specifications and implementations may need to go into considerable detail with
respect to what parts of an infoset are not elaborated.  This in turn means
that implementations <emph>of elaboration</emph> must provide controls which
allow applications to specify which domains (subtrees) are to be treated as quoted.</p>
    </div>
    <div id="explicit">
     <title>Explicit quotation</title>
     <p>Explicit quotation provides for quotation of parts of individual documents.  In special circumstances, the author of a document may wish to
prevent the operation of elaboration within certain sub-trees
of a document.  Accordingly, we define
<code>http://www.example.org/quote</code> as an <link href="#elab_ns">elaborating namespace</link>, specified for use only on an <code>eq:quote</code> attribute, which quotes any subtree it appears at the root of.</p>
     <p>The elaboration of an element II with this attribute is defined to be an otherwise identical element eII with the attribute removed, <emph>and</emph> the special property that it short-circuits further applications of <name>E</name> in search of a fixed-point.</p>
    </div>
   </div>
  <div id="elab_sig">
   <title>Elaboration signals</title>
   <p>We need to establish just what the elaboration signals are, that is,
what specs define one or more generic processes which it's useful to include in
the definition of elaboration as a whole.  Just what fits that description
(which itself begs a question with the word 'useful') is an open question, but
as suggested above we start with three candidates:</p>
   <list type="defn">
    <item term="inclusion">The <code>include</code> EII in the <code>http://www.w3.org/2001/XInclude</code>
namespace is an <emph>elaboration signal</emph>, and it should
be elaborated by reference to the <link href="http://www.w3.org/TR/xinclude/">XInclude</link> specification.</item>
    <item term="decryption">The <code>EncryptedData</code> EII in the
<code>http://www.w3.org/2001/04/xmlenc#</code> namespace is an
<emph>elaboration signal</emph>, and it should
be elaborated by reference to the <link href="http://www.w3.org/TR/xmlenc-core/">XML Encryption</link> specification.  It is always an error if a decryption fails because a key is supplied but is not accepted.  There are roughly three non-error cases:
     <list type="defn">
      <item term="No key">no change, that is, the
<code>EncryptedData</code> element II itself;</item>
      <item term="XML data">That is, there is a CipherValue or CipherReference element II with
Type 'element' or 'content'.  In
this case the result is the <name>[children]</name> of the document II which
results from parsing the decrypted octet sequence as a stream of UTF-8 encoded characters;</item>
      <item term="other data">That is, Type is unspecified or is not 'element'
or element 'content'.  Not clear what to do here -- this is <emph>mostly</emph>
in the spec to support decrypting keys, which we won't elaborate anyway. . .</item>
     </list>
    </item>
    <item term="signature checking">The <code>Signature</code> EII in the
<code>http://www.w3.org/2000/09/xmldsig#</code> namespace is an
<emph>elaboration signal</emph>, and the  in it should
be elaborated by reference to the <link href="http://www.w3.org/TR/xmldsig-core/">XML Signature</link> specification.  This is not a clear or simple case, as XML Signature provides for at least three distinct kinds of signing (Enveloped, enveloping and detached), and supports signing of multiple objects.  As a starting point elaboration of signing should always fail if the signature is not valid, and its value when the signature <emph>is</emph> valid should be as follows:
     <list type="defn">
      <item term="Multiple References">That is, more than one things is signed.
 No change, that is, the
<code>Signature</code> element II itself;</item>
      <item term="enveloped">That is, the thing signed is the enclosing
document.  In this case the result should be the empty sequence;</item>
      <item term="enveloping">That is, the thing signed is an Object within the
signature.  In this case the result should be the signed subtree within the
Object, as processed by any specified Transformations;</item>
      <item term="detached, local">That is, the thing signed is in the same
document as the signature, but not inside it.  In this case the result should be the empty sequence;</item>
      <item term="detached, remote">That is, the thing signed is elsewhere,
identified by a URI.  We treat this as a signed XInclude, and the result is the
referenced external subtree, as processed by any specified Transformations.</item>
     </list>
    </item>
   </list>
   <div id="extend">
    <title>Extensibility</title>
    <p>This spec. identifies three elaboration signals.  It should be
possible for W3C specs published subsequently to identify one or more
additional elaboration signals, by specifying what elaboration means for them.</p>
   </div>
  </div>
  <div id="definition">
   <title>Elaboration defined: top-down treewalk and signals</title>
   <p>The basic idea is that the elaborated infoset is constructed by a
top-down traversal of the original infoset, replacing each element information item which
signals that it is an <emph>elaborating element</emph>, either by itself being
an <emph>elaboration signal</emph>, or by being the owner of an attribute II
which is an <emph>elaboration signal</emph>.  For example, the
an EII whose name is <code>include</code> in the XInclude namespace is an
<emph>elaborating element</emph>, with its
elaboration as determined by the XInclude spec.  The elaboration process
applies to its own output, that is, for example, if the result of XInclude
processing of an element is a sequence of elements, one of which is itself
named <code>EncryptedData</code> in
the XML Encryption namespace, <emph>that</emph> element will in turn be elaborated.</p>
   <p>More formally, the elaborated infoset of an infoitem is defined by a function <name>E</name> from
information items ('II' for short) and a set of <link href="#implicit">implicit
quotation</link> element names to (sequences of) information items (<name>IQNs</name>), by cases over the kind
of information item.  In each case we
refer to the original information item as <name>o</name> and the result of a
single elaboration, that is <name>E(o,IQNs)</name>, as <name>e</name>, and to the values of
properties of information items using a '.' and the property name, e.g..
<name>o.local name</name>.</p>
   <p>The
elaboration of an II <name>o</name> is <name>F(E(o,IQNs))</name>, where
<name>F</name> is defined in <link href="#fixup">Infoset fixup</link> below and <name>E</name> is defined as follows:</p>
   <list type="defn">
    <item term="element II">If <name>e</name> was named as an <link href="#implicit">implicit
quotation</link> element, a member of <name>IQNs</name>, then <list type="naked">
<item><name>o</name> is an infoitem of the same kind as <name>o</name>, with the same
properties and values</item>
</list>
otherwise iff <name>o.attributes</name> contains an AII whose
name is <code>quote</code> in the <link href="#explicit">elaboration quotation
namespace</link>, then <list type="naked">
<item><name>e</name> is an element II with the same properties and values as
<name>o</name> except for the <name>[attributes]</name> property, from which
the <code>eq:quote</code> attribute is removed</item>
</list> 
otherwise if <name>o</name>'s is an
<link href="#elab_sig">elaboration
signal</link> or <name>o.attributes</name> contains an
<link href="#elab_sig">elaboration
signal</link> then <list type="naked" indent="1em">
<item><name>e</name> is a (possibly empty) sequence of element, processing
instruction, unexpanded entity reference, character, and comment information
items, the result of processing <name>o</name> according to the specification
governing the <emph>elaboration signal</emph>;</item>
</list> otherwise
     <list type="naked" indent="1em">
      <item><name>e</name> is an element II with the same properties and values
as <name>o</name> except for the <name>[children]</name> property, whose value
is the concatenation of <name>E*(c,IQNs)</name> for each child <name>c</name> in
<name>o.children</name>, in order.  By <name>E*</name> is meant the result of repeated applications of <name>E</name> to
(the members of) its own value until a fixed-point is reached.</item>
     </list>
       </item>
    <item term="document II"><name>e</name> is a document II with the same
properties and values as <name>o</name> except for the <name>[document
element]</name> and <name>[children]</name> properties: <name>e.document
element is E*(o.document element)</name>, which also becomes the single element II among
<name>e.children</name>.  It is an error if <name>E*(o.document element)</name> is not a
single element II.</item>
    <item term="all other kinds of II"><name>E</name> is the identity, that is
<name>e</name> is an infoitem of the same kind as <name>o</name>, with the same
properties and values.</item>
   </list>
   <p>The elaboration process as a whole fails if any individual elaboration
fails with an error.</p>
  </div>
  <div id="fixup">
   <title>Infoset fixup</title>
   <p>The infoset as defined in the Infoset spec. has several properties whose
values are non-local, that is, they cannot be determined or checked for consistency solely by reference to
the subtree rooted at their host II.  These are</p>
   <list>
    <item>the <name>[references]</name>
property of attribute IIs, whose value when its sibling
<name>[attribute type]</name> property is <code>IDREF</code> or
<code>IDREFS</code> is the set of referenced element IIs, which may be anywhere
in the surrounding document;</item>
    <item>the <name>[in-scope namespaces]</name>
property of element IIs, whose value should be consistent with the impact of
the values not only of its sibling <name>[namespace attributes]</name> property, but
also the values of that property up the <name>[parent]</name> chain;</item>
    <item>the <name>[base URI]</name> property of element and
processing instruction IIs, whose value should be consistent with the values of
the <name>xml:base</name> attribute in the sibling <name>[attributes]</name>
property and
up the <name>[parent]</name> chain;</item>
    <item>the <name>[language]</name> property of element IIs, whose value should be consistent with the values of
the <name>xml:lang</name> attribute in the sibling <name>[attributes]</name>
property and
up the <name>[parent]</name> chain.</item>
   </list>
   <p>As recognized by the XInclude spec. (see <link href="http://www.w3.org/TR/xinclude/#references-property">references Property Fixup</link> and subsequent sections), it follows that some fixup may be required after constructing an infoset by replacing some subtrees within an original infoset with subtrees from elsewhere.  In some cases fixup means adding new attribute information items, in others a combination of that and changing the values of some infoset properties.  It is conjectured that fixup can be done once, on the entire result infoset, after all elaborations have been carried out</p>
  </div>
  <div id="issues">
   <title>Issues</title>
   <list>
    <item>Should we allow attributes to be <emph>elaboration signals</emph>? If
not, do we use <code>eq:quote</code> to wrap quoted elements?</item>
    <item>If we do, what do we do about multiple signals on a single EII? 
Quoting clearly takes precedence, but how do we order the others?</item>
    <item>Is doing fixup only at the end good enough?  Presumably we should do
fixup on notation references and
unparsed entity references as well as ID ones.</item>
    <item>Should we require all external parameter entity references to be processed?</item>
    <item>What do we do when we hit encrypted non-XML data. . .</item>
    <item>The whole Signature thing is very complicated, and I'm not sure
almost any of it is right. . .</item>
   </list>
  </div>
 </body>
</doc>
