Toward a Simple XML Processing Model

The XML 1.0 specification is good, but not perfect. That's as it should be: after all, the perfect is the enemy of the good. We can only spend so much time perfecting a specification before it becomes too late to be any good.

One of the imperfections in XML 1.0 is the lack of a data model. W3C proceeded to make it a Recommendation, opting to specify the datamodel after-the-fact. Meanwhile, the designers of the DOM and XPath were able to complete their tasks, and while a careful reading reveals a satisfactory level of consistency between the three specs, the consistency is not explicitly stated, and it is somewhat subtle: a DOM element node is not the same thing as an XPath element node, and of course, neither is identical to an XML 1.0 element.

The XML Information Set is the post-hoc specification of the datamodel of XML 1.0 (plus namespaces). It was expected to converge with the datamodels of DOM and XPath; future work was expected to be based on this Infoset specification. The XML Schema specification is tightly integrated with the Infoset specification, but the XML Query specifications are only indirectly integrated (as of this writing) and the designers of the Canonical XML specification judged XPath as a more suitable basis.

As a reviewer of many of these specifications, the lack of explicit consistency between them is becoming costly:

element information item is a mouthful. But don't confuse it with element, nor either of the two senses (DOM, XPath) of element node.
per the Infoset spec, <aDoc>abc</aDoc> has 3 children -- character information items -- while in the XPath model, it has just one child, a text node. Sometimes I remember why... something about whether whitespace is significant or not... but sometimes I don't remember, and I'm not sure I could find the justification if I looked it up.
namespace declarations are attribute nodes in the DOM but not in XPath.

The XML Query data model and algebra specifications seemed to introduce a whole new set of terminology, but more recently, it seems that it will be integrated with the next version of the XPath data model. This is promising for consistency and integration, but not necessarily for modularity: the combined data model includes much (all?) of the XML Schema type system. One of my goals for any XML processing model is that at its core, there will be a simple model of an XML document as a tree of elements, attributes, and character data, rooted in the Web of URI space. Perhaps we can factor this part of the XPath/XML query data model out as a separate specification, or a separate part of the XML Query specification. The Infoset specification would seem to play this role; it is a reasonably compact explanation of the structure of an XML document. But it lacks a specification of constructors; i.e. formal mechanisms for building an (abstract) XML document up from its constituent parts.

What is an XML document?

The lack of explicit, formal consistency in XML specifications is also costly in design discussions. A number of design discussions on unrelated topics languished for weeks, suffering from an undisclosed difference of an understanding of the following questions:

If I write some <stuff/> to a file, then delete the file, then write it again, is that the same XML document, or two different XML documents with the same content? If we regard an XML document as an abstraction, like an integer or a sequence of bytes; by this view, there's just one <stuff/> document, stored in a file twice. But some regard an XML document as a historical/physical artifact, like a protocol message. By this view, there are two XML documents that have the same characters in their document entity.

I haven't found anything in the specs that rules out either view. In the abstract document view, whether something is an XML document is a formally decideable question, whereas in the artifact view, one must perform some experiment to observe the various properties of the physical/historical artifact, and judge the results of that experiment w.r.t. the XML specification. Perhaps we can achive an interoperable XML processing model without resolving this issue, but I find it annoying and costly.

The second question is answered in the Infoset and XPath specs, but it does not seem to be widely understood: If I take some <stuff/> from http://oneplace.example/aDoc.xml and copy it to http://anotherplace.example/aDoc.xml, some regard this as the same document in two places, but it's not: it's two documents that have the same characters in the document entity. The base URI is an intrinsic property of an XML document.

Integration: XML Base, XInclude, and XML Schema

An XML processing model should, ideally, address outstanding issues regarding the ordering and dependencies among XML Base, XInclude, and XML Schema.

Conformance to XML Base is specified by reference from various specifications to the XML Base specification. This is clear enough for isolated specifications, but not for integrated, extensible specifications: suppose specification S1 cites XML base, but S2 does not, but S1 has some extensibility hooks, where S2 style content is allowed. What do we make of a URI reference from the S2 portion of the document where xml:base appears in the containing S1 style markup? I am not sure that there is a better solution; getting rid of XML Base altogether would resolve this issue, but at an evidently unacceptable cost in the usability of XLink and other XML specifications.

The order of XInclude and XML Schema processing is currently unspecified; each consumes and produces an XML Infoset; conceivably, they could be applied in either order. It seems most straightforward, technically, to put XInclude processing "before" the rest of XML Schema validation, much the way XML 1.0 entity resolution goes "before" content model validation. This seems to make XInclude deployment dependent on a revision of the XML Schema specification.

Dan Connolly, W3C