Many processes that involve processing XML involve the "chaining" of XML technologies. In fact, some specifications, like XInclude, have been developed intending to be supporting pre-processing steps or layers upon which other XML specifications can rely. In all these cases, there are a number of technical issues that must be resolved for these chaining of processes to work together in a beneficial way. Still, we envision a world in which XML processes are chained together in this way, where some of the manipulations result from W3C standard operations, operations standardized elsewhere, or operations specific to a particular application--all of which work together to produce the final, correct result.
Within this paper we will outline the issues surrounding chaining XML processing that involves transclusion, transformations, and validations, with schemata and document infosets. We do not intend to propose a solution to these issues nor is this paper supposed to be a definitive list of all the issues. Essentially, the issues outlined are of a more serious fundamental nature to lay a foundation for chaining XML processing.
In certain cases some of the issues represent a philosophical question of which idiom one chooses. When we have formed an opinion we have stated the opinion as "Our Position" to offset it from the issue in general.
It may be tempting to resolve questions relating to the interaction of chained XML technologies by declaring a fixed ordering. For example, XInclude must be processed before XML Schema and after XPointer resolution. This order would then be defined in some all encompassing and governing specification.
In contrast, one could say that this order is defined by a using application or user. This gives the using application the choice of order without violating a specification. A result might be that the original author of some XML instances may not get the desired result within some receiving application.
We believe any such fixing of the processing order to represent a mistake. One of the main justifications for this position is an understanding that the advantage to XML technologies, and in particular to defining certain XML technologies using XML, lies in their flexibility. Thus, if there were a fixed order, we have lost flexibility and installed a rigid framework for processing instead. For example, if XInclude were to be fixed in the processing sequence to fall between XML parsing and schema validation, there would be no compelling reason to adopt it over external entities besides a preference of syntax.
When XML processing steps are chained in flexible ways, infosets flow from one step through the next, building on what happened at each step. Flexibility of chaining leads, as a matter of reasonable software engineering practice, to permitting the components for each step to operate as independently of global context as possible.
A general consideration is that given an infoset and an XML process step, you have to be able to either (a) perform the process or (b) not, because the infoset lacks what you need, and (c) tell the difference easily.
There ought to be a way to know right away if an infoset has the extensions you need and get a definitive answer. For example, I need PSVI for this step and so I must check that it is there. It would be best of there were a standardized way of inquiring about which infoset extensions are available in a particular infoset. For example, in the current draft of the XInclude specification, XInclude processing may damage the PSVI, but it leaves no clue that it has done so, or even that it has occurred, so an independent software component engaging in further processing has no way of knowing whether it is going to get consistent results or not, or if re-validation is necessary.
Reasonable software engineering practice demands that components implementing particular XML processing steps that might be chained together not be required to know about global context in order to operate. Maintaining independence means putting enough into the infoset to ensure that each component has what it needs to operate or fail reliably.
Steps that modify an infoset should either (a) add to the infoset (leave original intact and add on new stuff) (b) modify it in a way that is consistent with the infoset you would get by directly parsing the equivalent XML file. The point here is that you want to be able to put something through a subprocess without having to do different things depending on where it came from. You want to be able to treat each step as independently as possible.
The text in the Infoset specification that countenances synthetic infosets that violate consistency constraints is actively harmful and should be removed.
If you choose option (b) that says that you need to strip out extension properties in the infoset you don't understand, else you're violating the extension's consistency constraints. Choice (a) is better, because it allows for greater independence and avoids repeating steps needlessly.
Note: we distinguish here between creating a new infoset and modifying an existing one in situ. XInclude and XML Schema modify; XSLT creates new ones.
In general, we envision a world where there are chains of infoset manipulations of various sorts, some of which are W3C standards and some of which are not. Means of accessing these extended infosets are necessary, as are means of referring to them as whole units for feature testing. The DOM provides (or will provide) for accessing specific extended infosets, such as the PSVI, but not for infoset extensions in general. Where many extensions may be applied, name collisions are inevitable.
We believe it is useful to have a generic extended infoset API in the DOM. Steps should be take to provide the means for avoiding name collisions, but providing a standard way of 'namespacing' property names and whole bundles of properties (extensions, such as the PSVI). Whether these 'namespaces' are namespaces or not, or something more akin to the identification of extensions in XSL or SAX matters less that having them at all, although having them have some relation to XML namespaces makes for some interesting possibilities with respect infoset reflection into XML.
In this vision of chained infoset processing, the output infoset from one step becomes the input infoset to the next. Yet many specifications insist that the input be identified by a URI. What is the URI of a partially processed infoset? High performance processing demands that intermediate results not have to be written to a file or accessed through intermediate dereferencing of meaningless URIs.
Let infosets be infosets. We do not believe a partially processed infoset has a URI or needs one. We do believe that specifications need to acknowledge and accommodate infosets as input that never came from a web server or a file, or that didn't come from one recently (for performance reasons). We believe that specifications need to acknowledge and accommodate infosets as output that will never hit a file system. The XML Schema specification was carefully constructed so that such applications were possible and so this should also be taken into consideration. We consider this a good model to follow.
As steps in some chaining process are completed, the traceability of the origin of certain generated results may become important. (i.e. what caused the result to be generated: "what xinclude element? where?"). For example, if an xinclude element causes an inappropriate element to be included into the resulting document and that element causes an error further down the chain, there should be enough remnants in the resulting infoset to understand what originated the error--essentially, identifying which process applied and where. While, in general, processing steps should be as independent as possible, handling exceptional conditions require traceability to the source of the problem. Patching up infosets after applying processes that did not understand a certain extension requires knowing what those processes were and what they did.
Ensure that W3C processes that manipulate infosets leave behind sufficient traces of their actions that traceability is possible. We cannot make arbitrary non-standard infoset processing abide by such rules, but we can show leadership and provide the tools to do so.
When schema validation outcomes are involved in a chain of process a number of tricky issues arise. Essentially, not only does the fact that a schema is available change the resulting infoset, but user option as to what should and should not be validated affects it as well.
The follow sections are a non-exhaustive list of issues to be considered:
When a document is processed (or an infoset is re-processed) how does one control and process against the availability of a schema? There may be many places from which the schema is available. Considerations of what is appropriate give the source, locale, and system or user option must be taken into account.
The XML Schema recommendation defines the infoset properties that
are added when schema processing is applied. But it remains unclear what
happens if schema processing is chained, applying different schemata covering
the same namespaces. It is by no means obvious that the correct answer is to
completely replace one PSVI with another. For example, consider the case where
at one point in the life cycle of the infoset, a schema validation episode
applied a schema in which the contents of the
userExtension element were left as a wildcard with
validation skipped. If, later on in the history of that infoset, processing
determines which specific user extension schema to apply in that spot (which
schema has no constraints for anything else) and another schema validation
episode ensues. A plausible argument could be made that the PSVI for the rest
of the document is still useful and should be retained.
When an XML document is processed against a schema an the PSVI is produced, it is possible to conceive a step in a chain of processes where the simple typed information is manipulated or produced and will then be reproduced in a lexical form within an XML document. In this case, there should be control over the formatting of these types.
NOTE: While a minor point it is important for readability and, in some cases, the efficiency of some subsequent application.
XML processing chains result frequently involve transformations that accomplish aggregation and disaggregations. In many cases, at any point, there is the issue of multiple input and output documents. How will these be handled within a chain of processes?
Subsequently, there a number of other issues that arise. The main issues is in the case of disaggregation, what happens to the next step in the chain? Are there sub-chains initiated or does the next step receive all the resulting documents?
Also, in general, what happens when a transformation result is not necessarily and XML result? This is certainly quite possible with XSLT.
The issue of referring to intermediate results comes into play here as well: The document() function in XSLT requires a URI, but we envision situations where the XSLT step is being applied to a set of intermediate results, none of which has a URI.
It is necessary that a chain of processes "know" what to do when errors occur. While the most draconian position is to simply stop, it is easy to consider that there are many other ways of dealing with errors. Some, in fact, may be errors to the overall application but a valid result for the chained processes. In fact, this is the position that taken by XML Schema processing.
It is important to also distinguish between the errors and error handling that happens inside a process that is being chained and the error handling of the chain. That is, there are two questions: what does the specific process (or standard) say about errors and error handling and what does the chaining language say itself. For example, what should XInclude say about error processing and what should the chaining process do about those error conditions once they occur.
For the process itself, the simplest issue is what is an error? Is it invalid input or output at some point in the chain? Or is it an invalid state within the chain? These all must be defined before error handling can be discussed.
Subsequently, for the governing chain, what happens when there is an error? Some errors may be classified as necessitating draconian measures because of conformance while others may need to be reported somewhere. One may also want to cascade an alternative process based on the error. For example, a web server probably should always respond with an message regardless of the success of the process. Thus, any error should be caught and, at least, result in the generation of an error message to be returned for the service.
This then leads to whom they are reported and at what level. It may be sufficient for some errors to be reported to the hosting system. In some cases it may be necessary that they are under user control so that the error can be corrected or processed within the chain producing an alternative result.
We believe that we need both proper constraints on what a process should dictate about errors and controls at the chaining level for errors. That is, a single process should not dictate that processing should stop. Instead, it should signal the error and continue, if possible, letting the controls within the chain dictate what happens next.
With that said, the implication is that specifications that govern particular processes (e.g. XInclude) should not dictate that a processing stop. They must be able to signal a severe error (or error in general) and let the chain's controls decide what to do. In the simplest of cases, halting the chain would result.
In this paper we have outlined a number of areas where there are serious issues that need to be addressed in the architecture of chaining XML processes. While we could forge ahead with some seemingly simple solution we will probably be tying our hands in the future for dealing with more complex situations. Thus, it is important that we discuss these issues and develop a solid foundation before settling on a solution.