The XML Processing Model

Author:: Philippe Le Hégaret, W3C

Status of this document

This document is a position paper for the XML processing Model Workshop. In any case, this paper does not represent the position of the W3C nor of the DOM Working Group.

Introduction

The XML Processing Model defines how an XML document is interpreted by an application. The document could then be rendered on a screen using differents views, used to process XML datas, etc. The model can be separated in two phases: the XML pipeline, intended to define the Data Model, and the post XML pipeline, intended to use the Data Model.

XML Pipeline

The XML Pipeline is the specification of each step involved in the XML Processor. How and when specifications (defined by Unicode, IETF, W3C, or other organisations) take place in the XML Processor.

Case studies

Unstable Infoset: XML 1.0: Depending on the presence or not of the DTD, you'll end up with a different Infoset: the normalization of the attribute values will not be the same, attribute type informations might not be there. As a generic application, you cannot rely on these informations if you don't control the XML Pipeline.
Modification of the Infoset: XML+Namespaces and XML Inclusions: Depending if your implementation supports XML Inclusions or not, the resulting Infoset will be different. If you intent to develop a generic XML processor, mixing XML applications (such as SVG or MathML) with and without XInclude support is not possible.
Incompatibilites with the Infoset: XHTML 1.0: XHTML 1.0 was released with its own base URI resolution mechanism and thus this specification cannot entirely cannot rely on the Infoset (see also Proposed behaviors of baseURI in document).
Modification in the Infoset: XML Schemas: XML Schemas are introducing locally typed elements. The type of a DOM node was fixed at creation time and cannot be changed after that, moving a DOM node in the tree can now change its type.

Data Model

What are the expectations from an XML application? Being based on a defined subset set of specifications: Unicode 3.1.0, RFC 2396, XML 1.0, Namespaces, XML Base, XInclude, XML Schemas, and XLink/XPointer? Or still continue our current approach (<7!), i.e. each XML application defines its own set? Of course, the answer is not easy but hopefully, the Infoset will reduce the number of specifications involved in the XML Pipeline: Infoset, XInclude, XML Schemas, XLink/XPointer (<4!). The PSV Infoset reduces this number: PSVI, XInclude, XLink/XPointer.

This leads us to a common data model. For historical reasons, several data models are developed in the W3C: DOM, XPath 1.0, Infoset, PSV Infoset, XML Query, etc. Each of them is adding/removing informations to the previous. For example, the recent XQuery 1.0 and XPath 2.0 Data Model is adding reference node information items on top of the PSV Infoset.

The DOM Data Model adds more informations such as CDATA sections or entity references. It would be diffcult to change the DOM Data Model for backward compatibility reasons but, using the Load and Save model, the DOM is able to address requirements from the Infoset without breaking backward compatibility. We are also able to represent the PSV Infoset using the Abstract Schemas/PSVI Object Model. IMHO, each new XML application should be defined against the PSVI, including XInclude.

Post XML pipeline

Each XML application defines its own XML Processing Model. The MathML 2.0 specification defines how to read and interpret an MathML 2.0 document, ditto for SVG 1.0. Each XML application is (almost) well defined in its own space. Problems appear why you start using the major property of the XML Namespaces recommendation: mixing XML applications in a same XML document. What does it mean to put a SVG graphic in a MathML document? or a XML Query in a MathML document? For the latter, the user might want to do the query and render the resulting Infoset, or he might want to render the XML Query itself.

Adding timing constraint (using SMIL animation or other advanced SMIL features) is also a new concept and, for the moment, only has implications on the styling model.

Note: The upcoming HyperText CG face-to-face on Plug-In API is also interesting for the Post XML pipeline. When more than one XML application are handling an XML document, how can they cooperate?

DOM

The DOM Working Group is facing problems that others groups generally don't have: the PSV Infoset is a dynamic model from the DOM point of view. Information items can move from one place to another, attribute values can be changed, etc. A static XML Processor cannot ensure the integrity of the PSV Infoset in a DOM tree. A current opinion is that it is not possible to garantee the integrity of derived PSV Infoset without fixing the entire tree ("equivalent to a save+load operation"). The DOM must be able to support the XML applications developed in the W3C. Each XML application must not be defined using the DOM Data Model. To achieve this goal, the Core platform must be well defined in terms of a PSV Infoset, and no incompatible extensions used. It might be necessary to break the backward compatibility in the DOM, but it is not reasonnable to do it for each new Level.

Creating a DOM tree is highly dependent on your XML processor. This remark might be obvious but the impact is important: you cannot rely on the XML processor to obtain the same DOM tree in memory. If the getElementbyId method doesn't give the appropriate result, the user will blame the DOM implementation, not the XML specification. Some DOM implementations such as Xerces from xml.apache.org are resolving these troubles by supporting "as much as possible" but this leads to a real challenge when you start mixing DTD and XML schemas (and not to mention that this mix is not defined by a W3C specification).

Expectations

Clearly, the W3C cannot revise all its specifications for each new Core technology. The Core platform should be stabilized in the future and new specifications should be defined against this Core platform. Being based on negociation mechanism such as CC/PP would only work for standalone XML applications, but cannot work for cross-namespaces ones unlike Eric Prud'hommeaux' opinion. I don't expect the stabilization of the Core platform before the simplification of XML Schemas (XML Schemas 2.0). Hopefully, the TAG will be able to help the stabilisation.

$Id: ProcessingModel-plh.html,v 1.6 2001/06/28 13:34:35 plehegar Exp $