W3C XML Processing Model Workshop

Position paper

C. M. Sperberg-McQueen

Let me begin by listing some basic assumptions; on the basis of those assumptions, I can then list some obvious problems which may need solution.

Assumptions

We care about XML and SGML because they allow us better access to data we care about.
Data we care about needs to be accessible in XML form, in a DTD we find useful for our purposes.
The only reasons to deal with other formats are
- The data comes to us from others in non-XML form (in which case our first task will be to put it into XML form; the non-XML form we receive is a read-only format).
- We are producing read-only output for some specialized device, such as a printer data stream for a printer, a binary format for an ebook reader, or the like (in which case for us the format is write-only).
- Some part of the data are best managed in a relational database system or other specialized software package (in which case we will want two-way translations between the XML we use as our standard internal format and the DBMS or specialized package).
Unless there are compelling reasons to the contrary, all our data processing is best regarded as XML-to-XML (or SGML-to-SGML) transformation. (N.B. I am taking the term XML broadly here as referring to any data structure or format which can be serialized as XML without information loss. Others might prefer to refer to these as infoset-to-infoset transformations.)
XInclude processing, XLink processing, XML Schema validation and similar kinds of work are just special cases of XML-to-XML transformations.
In general, the user needs to control the order in which tasks are undertaken.
Where tasks cannot be undertaken in an arbitrary order, the dependencies need to be made very clear all round. Some dependencies follow logically from the work being done; others follow from the ways specs interact. As far as possible, specs should avoid / minimize the latter. That is, the specs should avoid making unnecessary assumptions about where they come in the pipeline.

The first person I heard talk about this general problem was Lloyd Harding, in a presentation at SGML '93. Owing in part to schedule conflicts he has declined an invitation to attend the workshop, but it may be worthwhile summarizing some of his main points, if only because they still shape parts of my view of the problem.

Information fabrication can learn from manufacturing systems.
Manufacturing systems
- take input from many sources of variable quality,
- use multiple assembly lines to prepare parts of the final assembly,
- apply many kinds of process (sheet-metal presses, lathes, welders, paint booths, ...),
- have a specific target product.
Information fabrication systems need to
- take input from many sources of variable quality,
- use multiple assembly lines to prepare parts of the final assembly,
- apply many kinds of process (markup manipulation, structural reordering, document merger, document splitting, ...),
- have a specific target product.
Early manufacturing systems required large upfront investment because each machine required different attachments to the assembly line; the machines (e.g. lathes) were complex and required long setup times; many machines had to be custom built; many required highly skilled labor.
Modern manufacturing systems, by contrast, use standard machine attachments, so new units can be ‘plugged in’ to the assembly line; the machines are increasingly standardized in ways which reduce setup times; shelf-adjustable multi-task machines minimize need for bespoke machines; required skill levels have been reduced.
A key step in information fabrication is automated addition of markup not provided manually by the author / information creator.
Standard APIs and formats solve part of the problem.
Standardized DTDs, architectural forms, link process definitions are all non-solutions. Multiple (many!) DTDs are a fact of life. (N.B. I am using DTD here in its original broad sense: the set of rules governing the application of markup to a type of document. Those who find this confusing may say Many markup languages, many tag sets, many schemas are a fact of life. -MSM)
Making information fabrication work requires two kinds of specification:
- one for specifying how processes are attached to the assembly line and how to tell them just what they are to do (in general terms, the API for a single process)
- one for specifying the sequence of processes to be applied and how (in general terms, similar to the Unix one-liner which invokes fifteen processes in a specified sequence)
These should be in XML, so that they can be validated before use (thus minimizing down time).

Problems

One goal must be to specify what conditions are necessary and sufficient to allow XML-to-XML processes to be run in an arbitrary order; this is possible if no process invalidates infoset items or properties generated by some other process.
A second goal must be to specify ways in which a processor can know which infoset properties and items it has invalidated, so that they can be marked invalid or removed.
- Some cases can be hard-coded (any process which changes properties validated by XML-Schema-based validation should know how to mark the schema-validity properties of the post-schema-validation infoset is no longer valid, or strip them out of the infoset)
- Some cases probably cannot be hard-coded (can a process which changes the infoset know whether its changes will affect validation according to some future validation language? The example of DTDs and XML Schema suggests not).
- Can we make useful general classes which can describe the state of the data? (Since process X was run, the following kinds of things have happened ...)
Of course, the fundamental task of the workshop is to decide whether there is anything useful the W3C can do in this area, and if so how we should go about it.