W3C XML Processing Model Workshop
Position paper
C. M. Sperberg-McQueen
Let me begin by listing some basic assumptions; on the basis of
those assumptions, I can then list some obvious problems which
may need solution.
Assumptions
- We care about XML and SGML because they allow us better access to
data we care about.
- Data we care about needs to be accessible in XML form, in a
DTD we find useful for our purposes.
- The only reasons to deal with other formats are
- The data comes to us from others in non-XML form (in which
case our first task will be to put it into XML form; the non-XML
form we receive is a read-only format).
- We are producing read-only output for some specialized
device, such as a printer data stream for a printer, a binary format for
an ebook reader, or the like (in which case for us the format is
write-only).
- Some part of the data are best managed in a relational
database system or other specialized software package (in which
case we will want two-way translations between the XML we use
as our standard internal format and the DBMS or specialized package).
- Unless there are compelling reasons to the contrary, all our data
processing is best regarded as XML-to-XML (or SGML-to-SGML)
transformation. (N.B. I am taking the term XML broadly here as
referring to any data structure or format which can be serialized as
XML without information loss. Others might prefer to refer to these
as infoset-to-infoset transformations.)
- XInclude processing, XLink processing, XML Schema validation
and similar kinds of work are just special cases of XML-to-XML
transformations.
- In general, the user needs to control the order in which
tasks are undertaken.
- Where tasks cannot be undertaken in an arbitrary order, the
dependencies need to be made very clear all round. Some
dependencies follow logically from the work being done; others
follow from the ways specs interact. As far as possible, specs
should avoid / minimize the latter. That is, the specs should
avoid making unnecessary assumptions about where they come in
the pipeline.
The first person I heard talk about this general problem was
Lloyd Harding, in a presentation at SGML '93. Owing in part to
schedule conflicts he has declined an
invitation to attend the workshop, but it may be worthwhile summarizing
some of his main points, if only because they still shape parts of
my view of the problem.
- Information fabrication can learn from manufacturing systems.
- Manufacturing systems
- take input from many sources of variable quality,
- use multiple assembly lines to prepare parts of the final assembly,
- apply many kinds of process (sheet-metal presses, lathes, welders,
paint booths, ...),
- have a specific target product.
- Information fabrication systems need to
- take input from many sources of variable quality,
- use multiple assembly lines to prepare parts of the final assembly,
- apply many kinds of process (markup manipulation, structural
reordering, document merger, document splitting, ...),
- have a specific target product.
- Early manufacturing systems required large upfront investment
because each machine required different attachments to the assembly
line; the machines (e.g. lathes) were complex and required long setup times;
many machines had to be custom built; many required highly skilled
labor.
- Modern manufacturing systems, by contrast, use standard machine
attachments, so new units can be ‘plugged in’ to the
assembly line; the machines are increasingly standardized in ways
which reduce setup times; shelf-adjustable multi-task machines minimize
need for bespoke machines; required skill levels have been reduced.
- A key step in information fabrication is automated addition of
markup not provided manually by the author / information creator.
- Standard APIs and formats solve part of the problem.
- Standardized DTDs, architectural forms, link process definitions
are all non-solutions. Multiple (many!) DTDs are a fact of life.
(N.B. I am using DTD here in its original broad sense: the
set of rules governing the application of markup to a type of document.
Those who find this confusing may say Many markup languages, many
tag sets, many schemas are a fact of life. -MSM)
- Making information fabrication work requires two kinds of specification:
- one for specifying how processes are attached to the assembly
line and how to tell them just what they are to do (in general
terms, the API for a single process)
- one for specifying the sequence of processes to be applied
and how (in general terms, similar to the Unix one-liner which
invokes fifteen processes in a specified sequence)
These should be in XML, so that they can be validated before
use (thus minimizing down time).
Problems
- One goal must be to specify what conditions are necessary and sufficient
to allow XML-to-XML processes to be run in an arbitrary order; this is
possible if no process invalidates infoset items or properties generated
by some other process.
- A second goal must be to specify ways in which a processor can
know which infoset properties and items it has invalidated, so that
they can be marked invalid or removed.
- Some cases can be hard-coded (any process which changes
properties validated by XML-Schema-based validation should
know how to mark the schema-validity properties of the post-schema-validation
infoset is no longer valid, or strip them out of the infoset)
- Some cases probably cannot be hard-coded (can a process which
changes the infoset know whether its changes will affect
validation according to some future validation language? The
example of DTDs and XML Schema suggests not).
- Can we make useful general classes which can describe the state
of the data? (Since process X was run, the following kinds of things
have happened ...)
- Of course, the fundamental task of the workshop is to decide
whether there is anything useful the W3C can do in this area, and
if so how we should go about it.