Why we need a default processing model

and Why Infoset is the key of defining this processing

Daniel Veillard

Daniel Veillard is XML Linking co-chair, member of the XML Core WG, author and maintainer of the libxml and libxslt XML and XSLT toolkits, ex W3C staff now working for Red Hat. http://veillard.com/


Basically what we call XML has grown from the version 1.0 REC with a single optional part — validation — into a relatively complex set of modules, some of them defined as information encoding (namespaces, XML languages, canonical form) and others as information processing (XSLT, XInclude), and some mixing both (Schemas, XPointer).

There is little confusion about information encoding modules once the details are clarified (basically the role of the Infoset), but there is far more uncertainties when it comes to specfications defined in terms of processing.

Where is the confusion coming from?

Basically as toolkits are implementing multiple processing APIs, the way the associated specifications intermix is not well defined. Let's take a few specific examples:

XInclude and XSLT:

XInclude and Schemas:

In both cases the specifications taken in isolation look full, complete and can be implemented. But grey areas are discovered when trying to mix them to produce a given processing path.

Other examples include:

Restricting the processing path is not enough

The XML-1.0 REC tried to be clear in terms of processing model; it defined two canonical types of processing — well-formed based parser and validating parsers —, and I guess the intent was to have only two classes of processors whose processing model was clearly defined. What happened instead is that the number of classes of XML processors quickly became larger. Some were doing the minimal processing, some were fully validating, but a number of implementations diverged, some were fetching external entities but not validating, sometime only local (i.e. on-disk) resources were fetched.

Also, even within the strictly defined processing model of a validating processor, it is reasonnable to break the processing rule to satisfy some specific needs, for example an XML editor will need to be able to check the conformance of the document w.r.t. a given DTD even if the document does not carry a DOCTYPE, or if the document, having been loaded and modified, needs to be (re)checked. Providing those extra steps, which are not specified formally, is actually very important to get acceptance as a reasonable toolkit.

It is also clear that I got very few queries about problems related to the canonical processing model. Most use cases can live with a predefined path and this is a good thing that it exists, the existence of a common ground is precisely what allow to explain what is an extensions, and in what specific cases it should be used.

So even if there is a formal processing model defined, at the software level it is hard always to comply with it. But having a clear description of the "canonical" processing helps a lot in building a common understanding, making the set of specifications more useful to the general users.

Is the "stackable" model really the solution?

The solution usually suggested is to make processing module interfaces generic both for input and output, allowing the user to plug modules to build arbitrary paths. This mechanism has been used with more or less success in various areas:

So globally it seems that stackable components based on predefined input/output APIs are one possible solution, but once defined such a solution puts serious constraints on evolution or the ability to change the implementations. The cost of defining this API in a language-neutral way must not be underestimated, either.

The addition of XML namespaces increased the variety of XML processing model; this is a good indication of the kind of troubles we may have to face with a given solution.

Defining the processing model in term of data transformation

This is the approach taken by the XInclude and Schemas specs. It has the good property of avoiding the need for an programming interface definition, basically it uses the Infoset abstract data model to express the changes operated. However as one of the examples provided before, this doesn't in itself defines everything needed for interfacing with other modules.

The key point is the extensibility of the data model. The Infoset has been changed from a nearly canonical definition of the information present in a parsed document to a predefined set of data that may be provided as the result. From this point of view the Infoset acknowledges the point that the representation given may not be complete.

The Infoset also defines only the core items that may be found in the data model; both Schemas and the XML Style task force provide or suggest possible extensions to this core set. At that point if the Infoset is to be used to define processing, the relevant specification will have to specify the handling in a very complete way:

The addendum of such a way to specify the processing model is that when defining non Core Infoset one must be careful about the scope of those items of information. One must realize that adding then to the information set means that they will potentially be processed by other layers which will process them unrecognized.

Conclusion

First, not everything defined at the XML activity level is defining a processing. This should be kept, it's fine to define a grammar carrying concept without the need to explain how those should be processed.

Second it sounds to me that a canonical Processing Model for XML should be provided. But I would not take it as a fully normative piece, a NOTE would be fine, as I expect that deviations to this canonical path will be needed, I would not treat them as a violation to the standard.

Third expressing this canonical processing model should not rely on a programming interface, neither an existing one nor one to be defined. It should be based on the data model — and the Infoset is the right tool for this. What needs to be made clear is that the processing must be defined not only on the Core Infoset properties but also on a possibly larger and smaller set. We don't know now all the Information Set properties that the tools implemented now will have to process five years from now. We can build a long-standing model but it must be ready for extensions, otherwise it's not worth the effort.

Last but not least, I won't try to push for a given canonical processing path, there is some obvious things (like XInclude being an replacement for entities it should be processed before XPath), but this should be discussed broadly. And we should never lose track of the point that it would be a guideline, not a conformance requirement.

Daniel Veillard