XML Processing Pipeline Model

Author:: Eric Prud'hommeaux, W3C

Status of this document

This document is a inividual's position paper for the XML processing Model Workshop and does not necessarily represent the position of the W3C.

Abstract

emergent web of interdependence
task-specific module interactions are defined with an eye to technical feasibility
need a model to express the processing parameters
DTD validation, schema validation, XInclude, XSLT and XBase all manipulate some form of the infoset and therefor must interact in ways predictable and consistent between any parties expecting interoperability
all the operations may even occur more than once.

This document describes motivation for an XML Processing Pipeline Model (henceforth XPPM) to describe the interrelations and interdependencies of protocols and markup languages (henceforth modules).

As individuals and organizations define new XML modules, the interdependencies of the modules become exponentially harder to enumerate within the module definitions. Furthur, as system architects make use of these modules, they are motivated to exploit the tools with more of an eye to functionality than to adherance to the normative definitions for a module. While a module like SVG defines it's use of the XBase grammer, it is unrealistic to assume that XInclude will not be used in processing some SVG documents. It is also unrealistic to assume that system engineers will heed the words of the XML 1.0 recommendation and apply DTD validation before any transformations have been made to the document. Such document pipelines are currently "outside the law" and at risk being non-interoperable.

A formal description of model processing dependencies will free module archictects from enumerating module interactions, eliminating a form of overconstraint that threatens to alienate many real world document manipulation processes. This model must describe all manipulations to the infoset that may occur from processes like DTD validation, schema validation, XInclude and XSLT. Additionally, this infoset may not be in final form just because it is delivered to an application like HTML or MathML. The model must make it feasible to control how generic XML processors navigate a growing web of module interdependence.

Status Quo

Will make sure of the accuracy of the below assertions. Very interested in feedback.

module architects are expected to define other permissable modules
- SVG uses XBase
- SMIL does not
- XHMTL sort of does (uses HTML transliteration analogous to XBase)
- MathML does not
- nobody uses XInclude (perhaps too new or too hard to define without breaking out into separate pipeline specification).

XInclude and XBase are two very different modules that make changes to the infoset that will affect the behavior of modules later visiting that infoset. As it is currently the responsibility of the module architect to enumerate interactions with compatible modules, none of SVG, SMIL, XHTML or MathML have defined interactions with the relatively new XInclude. Of the list, only SVG has a defined use of XBase. XHTML 1.0 uses a transliteration of the HTML base mechanism which will probably evolve towards XBase. It may be prohibitively difficult for any of these modules to desribe the potential interactions with XInclude without delving into a model a bit more specialized than XPPM.

Heterogeneous Documents

heterogeneous documents will tend to import the functionality of all of the component modules.

A document that consists of multiple modules that would appear to be "final" will need support from a processor that can handle all of the modules currently defined as interacting with any of the "final" modules. For instance, an XHTML document with embeded SVG will require a processor that can support XBase. The temptation on the part of document designers will be to leverage off these auxilliary modules even in the "final" modules that with which the behavior is undefined. Thus, by including a token bit of SVG, a document provider can leverage off XBase even in the XHTML portion of the document. It is not in the processor implementor's best commercial interest to specifically disable these features according to the portion of the document. This furthur supports the argument to remove the interactions from the module definitions and into a separate, orthogonal model.

Packaging Applications

XML Signatures and XML Protocol may, in the future, leverage off infoset-manipulating modules and may with to protect their payload
one solution: define equivilence and act on an alias name in the wrapper.

Modules specifically designed to enclose arbitrary XML data, for instance, XML Signatures or XML Protocols, may with to leverage off modules to manipulate the portion of the infoset germane to the packaging module without affecting the payload (packaged XML elements). One option is to "alias" the desired transformation elements in a "safe" space in the packaging module's namespace. For instance, XML Signatures could provide a portion of its namespace, "http://www.w3.org/2000/09/xmldsig#", for use by transformation modules. A document provider could then leverage off this mapping, using and element like<xlink href="incudeMe2" xmlns="http://www.w3.org/2000/09/xmldsig-imports#http%3A%2F%2Fwww.w3.org%2F2001%2FXInclude" foo="replace"> to have a downstream processor invoke an XInclude transformation just on the "http://www.w3.org/2000/09/xmldsig-imports#http%3A%2F%2Fwww.w3.org%2F2001%2FXInclude" elements. Such an invocation requires no knowledge of XInclude on the part of XML Signatures, only that the "http://www.w3.org/2000/09/xmldsig-imports#" namespace be defined for XML Signatures.

Strawman XPPM

The following proposes a strawman XML Processing Pipeline model for illustrative purposes. This may evolve into a serious proposal but has not currently been given enough thought to warrent coding or planning resources.

module interactions are generally linear - ergo processing pipeline
ordered set of operations with parameters
with "()" and "or" and "eqiv"
procedural, ie conjunctions are ordered

The scenarios discussed in this document are based on a manipulation of the infoset by a series of modules. While more complicated manipulation may occur involving forking and merging document processing paths, it is not within the scope of this proposal to define a model for such actions. A procedural model expressing a series of module invocations, a pipeline akin to common unix tools, expresses the simplest case. Expanding the model to include "or", a notion of alternate processing paths, would also require grouping operators like "()" used in conditionals in common programming languages. Successful processing of documents described by such a model would require that one of each alternative and each of the conjuntives be invoked by the processor. Packaging Applications described a furthur requirement, an "equivilence" operator to bind an "alias" to its native namespace and element. The resulting model could look something like this:

Pipeline :: Conjuntive *
Conjuntive :: Reference | Alternative
Alternative :: '(' Conjuntive [ '|' Conjuntive ] * ')'
Reference :: Invocation | Alias
Invocation :: Name
Alias :: 'alias' Name NewName
Name :
: QName
NewName :
: QName

Note: I attempted to avoid any literal characters in the "model", but found it too abstract to discuss. I would appreciate assistance with this.

canonical ordering of processing may be defined in terms of the model
model need not be serialized yet - parameterized vs. fixed (named) model(s)
different technologies, mime headers, CC/PP, Rnodes may be used to express this model
further standardization may be needed after these technologies have time to evolve
interoperability will be enhanced between all solutions that defined a mapping to the XML Processing Pipeline Model.

Description of an encompassing data model may seem too complicated to standardize and implement, but it is not necessary to implement it to benifit from it's definition. One short-term use of this pipeline model may be to define a canonical pipeline that is assumed in the absense of a pipeline description. This would meet the simplicity goals outlined in Buddy's position paper and provide a less-ambitious implementation goal. Alternatively, more than one "standard" pipeline may be defined and identified by URI.

As XML applications develop, individuals or groups may develop technologies to pass pipeline information over protocols like MIME, CC/PP or store it in HTTP-gettable resources like RNodes. After these technologies develop, it may be time to enter the second round of pipeline model standarization, serialization. In the mean time, applications benefit from having a model to express the common needs of XML processors. Gateways between different emergent serializations will be trivial to implement and FUD (Fear, Uncertainty and Doubt) will be reduced.

Serializations

Perhaps these should be called encodings?

Following are some possible approaches to serializing pipeline processing directives.

S Expresions

The XML pipeline processing directives may be serialized pretty much directly from the model description to s-expressions. The resulting data could be transported over MIME headers in HTTP or mail, or written in a comment or PI at the top of the document being described.

Simple XML

Anything expressed as an s-expression may also be encoded in XML. The XML used to represent the processing directives for a document must not, in turn, require any processing themselves. This XML may be transported over MIME headers in HTTP or mail, or entity-encoded in a comment or PI at the top of the document being described.

RDF Schema

Mechanisms like RNodes are used to store meta-information about web resources. Getting the meta-information for a document could include the processing directives required to processes that document.

XML Schema

Perhaps the most interesting (read peculiar) re-use of a mechanism to express the XPPM is the sequence and alternates model of XML Schema.

Appendix

Scenario: XInclude and Schema or DTD Validation

The following would like entity expansion before applying XInclude module:

<!ENTITY % importantXIncludeDirectives "<xlink href="incudeMe1" foo="replace" />" >
<myRoot />
&importantXIncludeDirectives;
</myRoot>

with the auxilliary includeMe1 file:

see you in the root

vs.

<!ENTITY % appContext "see you in the root" >
<xlink href="incudeMe2" foo="replace" />

with the auxilliary includeMe2 file:

<myRoot>
&appContext;
</myRoot>

The order of DTD validation and XInclude invocation is critical to generating these documents. Other scenarios include documents that are schema or DTD validated before and/or after XInclusions.

Related Initiatives

Many protocols provide a mechanism for requiring or negotiating extensions or functionality. In XML, SOAP attempts to do this by specifying orthogonal modules with mustUnderstand and actor attributes. This would allow a SOAP document provide to indicate what modules are invoked on the payload and the rest of the SOAP envelope.

Richerd Tobin characterized XML parser functionality implying that the conventional solution to the model is to pick the processor that will do what you need for your specific application.

The XML Core Working Group has an task to provide standard classifications for non-validating XML processors. This characterization could include machine-readable advertisement of functionality ala Web Services, and document processing requirements to be included in a document.

Eric Prud'hommeaux

Last modified: Thu Jun 28 11:27:53 EST 2001