Why we need a default processing model

and Why Infoset is the key of defining this processing

Daniel Veillard

Daniel Veillard is XML Linking co-chair, member of the XML Core WG, author and maintainer of the libxml and libxslt XML and XSLT toolkits, ex W3C staff now working for Red Hat. http://veillard.com/

Basically what we call XML has grown from the version 1.0 REC with a single optional part — validation — into a relatively complex set of modules, some of them defined as information encoding (namespaces, XML languages, canonical form) and others as information processing (XSLT, XInclude), and some mixing both (Schemas, XPointer).

There is little confusion about information encoding modules once the details are clarified (basically the role of the Infoset), but there is far more uncertainties when it comes to specfications defined in terms of processing.

Where is the confusion coming from?

Basically as toolkits are implementing multiple processing APIs, the way the associated specifications intermix is not well defined. Let's take a few specific examples:

XInclude and XSLT:

I added a --xinclude flag to my XSLT processor interface to allow running XInclude on the document being transformed before the transformation phase (this is an equivalent of the entity substitution required by the XPath processing model). After a while I got a bug report because when including other documents using the XSLT document() function, the Xinclude processing was not done on those.
=> the point here is that the apparently clean model "do XInclude first, then do XSLT" was only one part of the picture.

XInclude and Schemas:

the problem has been raised recently about what should be done with PSVI infoset addition resulting from a Schemas processing when running XInclude on the resulting Infoset. Some of these additions were making sense in the original document but may not be correct in the document resulting from the XInclude processing.

In both cases the specifications taken in isolation look full, complete and can be implemented. But grey areas are discovered when trying to mix them to produce a given processing path.

Other examples include:

mixing XLink and XSLT transformations
doing DTD validation on an Infoset
mixing DTD validation and Xinclude processing
should Xinclude be applied on resources before XPointer computations?
since XInclude can use XPointer should the recursive behaviour be applied before the XPointer computation ?
etc.

Restricting the processing path is not enough

The XML-1.0 REC tried to be clear in terms of processing model; it defined two canonical types of processing — well-formed based parser and validating parsers —, and I guess the intent was to have only two classes of processors whose processing model was clearly defined. What happened instead is that the number of classes of XML processors quickly became larger. Some were doing the minimal processing, some were fully validating, but a number of implementations diverged, some were fetching external entities but not validating, sometime only local (i.e. on-disk) resources were fetched.

Also, even within the strictly defined processing model of a validating processor, it is reasonnable to break the processing rule to satisfy some specific needs, for example an XML editor will need to be able to check the conformance of the document w.r.t. a given DTD even if the document does not carry a DOCTYPE, or if the document, having been loaded and modified, needs to be (re)checked. Providing those extra steps, which are not specified formally, is actually very important to get acceptance as a reasonable toolkit.

It is also clear that I got very few queries about problems related to the canonical processing model. Most use cases can live with a predefined path and this is a good thing that it exists, the existence of a common ground is precisely what allow to explain what is an extensions, and in what specific cases it should be used.

So even if there is a formal processing model defined, at the software level it is hard always to comply with it. But having a clear description of the "canonical" processing helps a lot in building a common understanding, making the set of specifications more useful to the general users.

Is the "stackable" model really the solution?

The solution usually suggested is to make processing module interfaces generic both for input and output, allowing the user to plug modules to build arbitrary paths. This mechanism has been used with more or less success in various areas:

Unix files and pipes interfaces — well, the concept is still around and in wide use: it should be considered a successful example. The only restriction to the picture is that it works well only if the kind of information exchanged pertains to the core ASCII characters, if one starts diverging from this character set then one can expect troubles.
Network protocol stacks, maybe a less conclusive example, though there have been serious implementations (STREAMS), and quite a lot of research done in this area. Having rigid API interface between modules allowed people to actually build complex protocols easilly, but this was an uncommon enough use that it wasn't worth paying the price (especially in term of performances) for general use.
Closer to our topic, the SAX interface is widely successful, this is a callback based API, lots of tools use it and it is in very wide use. However, this is an API; it is specified for Java but if you start using a different languages, well, suddenly a lot of problems arise, and in these cases the APIs don't match perfectly. The second problem is that the API needs to be maintained and had to evolve; this is a relatively costly solution.

So globally it seems that stackable components based on predefined input/output APIs are one possible solution, but once defined such a solution puts serious constraints on evolution or the ability to change the implementations. The cost of defining this API in a language-neutral way must not be underestimated, either.

The addition of XML namespaces increased the variety of XML processing model; this is a good indication of the kind of troubles we may have to face with a given solution.

Defining the processing model in term of data transformation

This is the approach taken by the XInclude and Schemas specs. It has the good property of avoiding the need for an programming interface definition, basically it uses the Infoset abstract data model to express the changes operated. However as one of the examples provided before, this doesn't in itself defines everything needed for interfacing with other modules.

The key point is the extensibility of the data model. The Infoset has been changed from a nearly canonical definition of the information present in a parsed document to a predefined set of data that may be provided as the result. From this point of view the Infoset acknowledges the point that the representation given may not be complete.

The Infoset also defines only the core items that may be found in the data model; both Schemas and the XML Style task force provide or suggest possible extensions to this core set. At that point if the Infoset is to be used to define processing, the relevant specification will have to specify the handling in a very complete way:

What items are required to be present in the provided infoset? For example, XInclude requires attributes to be exposed, in order to be able to detect the include elements and build the processing.
How are items and properties in the Core Infoset processed (this is the basic part)?
How are items and properties outside the Core Infoset processed if found and recognized? The definition of those will be done though a normative reference to the relevant specification.
How are unrecognized items and properties processed? In "conservative" transformations it may make sense to carry them along. In other cases it may make sense to filter them out or raise an error.

The addendum of such a way to specify the processing model is that when defining non Core Infoset one must be careful about the scope of those items of information. One must realize that adding then to the information set means that they will potentially be processed by other layers which will process them unrecognized.

Conclusion

First, not everything defined at the XML activity level is defining a processing. This should be kept, it's fine to define a grammar carrying concept without the need to explain how those should be processed.

Second it sounds to me that a canonical Processing Model for XML should be provided. But I would not take it as a fully normative piece, a NOTE would be fine, as I expect that deviations to this canonical path will be needed, I would not treat them as a violation to the standard.

Third expressing this canonical processing model should not rely on a programming interface, neither an existing one nor one to be defined. It should be based on the data model — and the Infoset is the right tool for this. What needs to be made clear is that the processing must be defined not only on the Core Infoset properties but also on a possibly larger and smaller set. We don't know now all the Information Set properties that the tools implemented now will have to process five years from now. We can build a long-standing model but it must be ready for extensions, otherwise it's not worth the effort.

Last but not least, I won't try to push for a given canonical processing path, there is some obvious things (like XInclude being an replacement for entities it should be processed before XPath), but this should be discussed broadly. And we should never lose track of the point that it would be a guideline, not a conformance requirement.

Daniel Veillard