Some Alloy models for XProc

Some Alloy models for XProc A working paper for the W3C XProc Working Group C. M. Sperberg-McQueen November 2006 - January 2007 Last revised &date.last.revised; This paper is unfinished. If you are not collaborating with the author on it and have run across it by accident, please ignore it.

This paper presents several formal models for some aspects of XProc , an XML pipeline language being defined by the W3C XML Processing Model Working Group; these models are formulated using the notation of Alloy . The goal is threefold: to help clarify design issues, to use Alloy to perform simple sanity checks on the design, and to give the author some practice using Alloy.

The current design of XProc postulates a layer of abstractions (pipelines, pipeline steps, etc.) which are independent of and unrelated to the XML vocabulary used to describe specific instances of the abstractions. The goal is to have a complete description in abstract terms, which owes nothing to the properties of the XML documents, and which thus is not unduly influenced by accidents of syntax. The current version of this paper concentrates on providing a formal version of the abstract model, as described in sections 1-3 of the specification, and on identifying places in which the abstraction layer is not fully described in the current draft and cannot be understood independently of the XML representation.

Several kinds of comment which address rather different readerships are intermixed here. The first three are what the reader of a paper on formal models for XProc might most reasonably expect: descriptions of what's in the XProc spec descriptions of the Alloy model(s) of XProc remarks (currently few and far between) on Alloy syntax, for the use of readers who don't know Alloy Part of the goal, however, is to use Alloy modeling to help find aspects of XProc which could be improved, which leads to:

meta-discussion about the design of XProc and what the XProc spec ought to be doing (as opposed to what it does); at the limit, these bleed into straightforward editorial comments which should just be sent to the editor (I tried initially to keep purely editorial issues out of this working paper, but in the end I just gave up on the attempt)

As an experiment, some such meta-discussions on XProc design issues are here rendered with a distinctive background color and font-style, like this paragraph. Readers without an interest in the details of the XProc Working Group's design process may skim these paragraphs.

And part of the goal is to improve the author's facility with Alloy, which gives rise to:

meta-discussion on alternative methods of modeling XProc in Alloy and speculation on which is preferable

As an experiment, some such meta-discussions on Alloy choices are here rendered with a distinctive background color and font-style, like this paragraph. Readers not concerned with questions of Alloy usage may skip these paragraphs without losing any essential exposition.

In the ideal case, I expect that future versions of this document will limit themselves to the first three of these (descriptions of XProc and its Alloy models); for the foreseeable future, however, it seems likely that meta-commentary is essential.

This paper provides no systematic introduction to Alloy notation. Those not familiar with Alloy should be able to follow at least the broad outlines of the discussion: Alloy is well designed to feel accessible to anyone familiar with common object-oriented programming languages, and I have made an effort to present everything important both formally, in Alloy notation, and informally, in prose.Some authorities recommend eliminating redundancy from technical specifications as far as possible, but this is I think almost always a large mistake. The redundancy of presenting material both formally and informally makes it easier to follow dense or unfamiliar material and thus easier to detect errors. Some specification languages, such as Z, are explicitly designed to be used in a mixture of prose and formal notation, and the usual stylistic advice is to exploit the duality of formal and informal presentation to provide a controlled but fully intentional redundancy. See e.g. , , and . Since Alloy is not specified as an intermixture of formal notation and prose, this paper uses a literate programming notation to make the intermixture possible . To follow the models in more detail, any reader not familiar with Alloy will find it useful to read through the Alloy tutorial. Good online documentation can be found at http://alloy.mit.edu as well as in the book .

Sections , , and of this paper present aspects of the model which emerge from the Introduction, Pipeline Concepts, and Language constructs sections, respectively, of . For each section, first parts of the spec are quoted, and the salient parts of the model (as I understand it) are identified in a sequence of numbered propositions. The salient parts, for purposes of this paper, are those which (appear to) lend themselves to modeling using a tool like Alloy. No claim is made that paragraphs not transcribed here are not essential to the spec or of lesser value, only that they contain no (new) propositions I can imagine modeling usefully using a system like Alloy.

Items in brackets reflect propositions not stated explicitly in the spec and may go slightly beyond what is intended, but seem useful to capture as possible rules. Some double-bracketed items ([[ ... ]]) just relate (what I take to be) different ways of saying the same thing. Items marked with ¿...? may or may not capture what is meant by the spec; the paragraphs in question may need editorial attention to make them clearer. Items marked Q: ... are questions about pipelines which may occur to a reader but which the spec appears not to have answered at the time they occurred to this reader.

The Alloy models provided in the course of the document formalize parts of the design as captured in the numbered propositions.

In its current form, both this paper and the models it presents are incomplete.

The paper provides no introduction to XProc other than that implicit in the quotations from the spec and the paraphrase in the form of numbered propositions. In particular, there is no motivation of the design features of the language. At times I have considered embedding this material in the text of the spec, but for the moment I have chosen merely to quote from the spec extensively. I keep having second thoughts about this, and returning to the XML source of the spec to see about inserting the models there. Each time I get a little closer to being able to add material conveniently. But so far I am not in a position to edit the XProc spec conveniently.

For the moment, the reader not already familiar with XProc may do well to read this document together with the XProc spec . The draft of XProc used is that of 17 November 2006.

Concepts in the introduction

Basic classes of objects

Section 1, paragraph 1 of reads:

An XML Pipeline specifies a sequence of operations to be performed on a collection of input documents. Pipelines take zero or more XML documents as their input and produce zero or more XML documents as their output. Steps in the pipeline may read or write non-XML resources as well.

From this, the following propositions appear to follow. Pipelines exist. (1p1; cf. prop. )The notation (1p1) is a reference to the paragraph from which this proposition was derived. Such paragraph references are used in the propositions and elsewhere; they indicate the section number (here 1) and, after a p, the paragraph number within the section (here also 1). Propositions restated later sometimes have the later location also noted, but this is not always so. In some cases, as here, reference is made to other propositions with related content. Operations exist. (1p1) Pipelines take zero or more XML documents as input. (1p1) Pipelines produce zero or more XML documents as output. (1p1) Pipeline inputs are XML documents. (1p1) Pipeline outputs are XML documents. (1p1) Steps exist. (1p1, 1p3)

There is a forward reference to steps here; remove?

Steps may be in pipelines. (1p1) Steps may read non-XML resources. (1p1) Steps may write non-XML resources. (1p1)

Section 1 paragraph 2 reads in part:

A pipeline consists of components. ...

From this, we can infer: Components exist. (1p2) Pipelines consist of components. (1p2)

Section 1 paragraph 3 elaborates on components:

There are two kinds of components: steps and (language) constructs. Steps carry out single operations and have no substructure as far as the pipeline is concerned, whereas constructs can include components within themselves.

Constructs exist. (1p3) Every step is a component. (1p3) Every construct is a component. (1p3) Every component is a step or a construct. (1p3) Steps are treated as atomic (no internal structure to model). (1p3, 2.1p1) Constructs include components within themselves. (1p3, 2.1p2, 2.1p3) [X contains Y iff X includes Y within itself.] (1p3) Constructs contain components. (1p3)

If we begin by declaring pipelines, XML documents, etc. as signatures in Alloy, and make clear the relation between components (as the general class) and steps and constructs (as its subtypes), we have something like the following. (This model is available as a stand-alone Alloy file as xproc01.als in the same directory as this document.) module xproc01 // Pipelines sig Pipeline { inputs: set XMLDoc, outputs: set XMLDoc, components: set Component } // Pipelines read and write mostly XML documents, // but also nonXML sig XMLDoc {} sig nonXML {} // Components are of two kinds: Steps and Constructs abstract sig Component {} abstract sig Step extends Component {} abstract sig Construct extends Component { components: set Component }

This model reflects most of propositions through ; it makes no attempt to model the relation between pipelines and operations, or the reading and writing of non-XML resources, so propositions , , and are excluded.

From these declarations, Alloy can generate a metamodel showing the relations of the classes:

Class relations in model xproc01

The built-in set univ is extended by Pipeline, Component, XMLDoc, and nonXML. Component, in turn, is extended by Step and Construct. The relations input and output connect Pipeline instances to XMLDoc instances, while components connects pipelines to components, and components to components.

Of course, this set of types is not yet complete, and their interrelations are incorrectly specified: Pipelines, it will become clear later in the spec, are a subclass of Constructs. And each component can have inputs and outputs.

I define Component as abstract, to force the conclusion that every component is either a Step or a Construct. To prevent other models from importing this one and adding further subtypes of component, one could add: fact component_completeness { Component = Step + Construct } Defining Step and Construct as abstract I'm less certain about; my idea is that the types of steps supported can be declared as extensions of Step, thus: sig Identity extends Step {} sig XSLT extends Step {} sig XInclude extends Step {} sig Serialize extends Step {} sig Parse extends Step {} sig Load extends Step {} sig Store extends Step {} sig ExtensionStep extends Step { type : Extension } Making the Step signature be abstract has as a consequence the invariant that every step is of a known kind; if it's not of a predefined kind then it's an ExtensionStep. Implementations that provide other step types can be modeled by adding signature declarations for the new step types. Alternatively, we could forbid such additional declarations by imposing the explicit constraint that every Step must be a member of one of the sub-signatures just named: fact step_completeness { Step = Identity + XSLT + XInclude + Serialize + Parse + Load + Store + ExtensionStep } This may or may not be the right way to model the relation among required step types and others.

Declaring Construct as abstract is intended to work the same way; I'll declare concrete sets which extend Construct and can be instantiated.

The reader of the XProc spec may be uncertain whether the correct declarations of Identity, XSLT, and so on will be as shown above (which allows the world to contain multiple Identity Steps and multiple XSLT Steps, etc.), or with a specification that there is really only one Identity step, only one XSLT step, etc.: one sig Identity extends Step {} one sig XSLT extends Step {} ... It becomes clearer at the end of the introduction that what the standard library contains are types of steps, rather than steps. See also proposition below.

Data flows, first cut

Section 1 paragraph 2 reads:

A pipeline consists of components. Like pipelines, components take zero or more XML documents as their input and produce zero or more XML documents as their output. The inputs to a component come from the web, from the pipeline document, from the inputs to the pipeline itself, or from the outputs of other components in the pipeline. The outputs from a component are consumed by other components, are outputs of the pipeline as a whole, or are discarded.

Components take zero or more XML documents as input. (1p2) Components produce zero or more XML documents as output. (1p2) Component inputs and outputs are XML documents. (1p2) Component inputs may come from the Web. (1p2) Component inputs may come from the pipeline document. (1p2) Component inputs may come from pipeline inputs. (1p2) Component inputs may come from outputs of other components. (1p2) ¿Component inputs must come from one of: the Web, pipeline document, pipeline inputs, component outputs? (Not certain whether implication intended or not.) (1p2) [No single XML document comes from more than one of: the Web, pipeline document, pipeline inputs, component outputs.] (1p2) [Components may consume XML documents.] (1p2) [[If a component consumes an XML document, then it takes that XML document as an input, and vice versa.]] (1p2) Components may consume component outputs. (1p2) Pipeline outputs may consume component outputs. (1p2) [[If a pipeline output consumes a component output, then the component output flows into the pipeline output. And conversely, if component output flows into a pipeline output, then the pipeline output consumes the component output.]] (1p2) Component outputs may be discarded. (1p2) [An XML document can both be consumed by a component and consumed by a pipeline output.] (1p2) [No XML document can both (1) be consumed by a component or by a pipeline output, and (2) be discarded.] (1p2)

Proposition interprets the or of the spec as an inclusive or as it relates to data being consumed by a pipeline output or by other components; proposition interprets the same or as exclusive when it relates to data being consumed or discarded. This only became puzzling to me when I thought about it carefully.

Components can read and write both XML documents and non-XML resources, so the reader may wonder why the inputs and outputs of components are described as being XML documents only. Section 2.2 of the spec makes clear (it might perhaps be clearer in section 1, too) that non-XML resources never flow from one component to another, and thus never flow through the pipeline itself. The terms input and output are used only for the named ports of a component, not for other data the process might read or write.

To the extent that the inputs and outputs of a component are restricted to XML documents, the terms input and output are technical terms with a specialized meaning. The discussion which follows assumes in some places that the terms are not consistently restricted in this way; it may need revision.

Pipelines take their inputs from the outside world, and their outputs flow into the outside world; how a processor manages that is implementation-dependent.

Note that the Introduction's description is asymmetric; as it describes things, a component can deal directly with the outside world only when it comes to inputs, not for outputs. For now, we'll focus on modeling the data flows within a pipeline, and will ignore any direct access to the outside world by steps within a pipeline.

The data flows of a pipeline will clearly be of great interest for a formal model.

At the moment, however, sections 1-3 of the spec don't say enough about data flows and the interconnection of pipeline components to allow a satisfactory model to be built. The remarks that follow rely on knowledge acquired through other channels, and on speculation about how XProc ought to work.

Several approaches to modeling data flows can be imagined. Based on what is said in the introduction, we can simply say that components have inputs and outputs, which can be XML documents or non-XML data: ... abstract sig Component { ins: set (XMLDoc + nonXML), outs: set (XMLDoc + nonXML) } ...

Connections between the components of a pipeline can then be modeled as a mapping from pipeline inputs and component outputs to component inputs and pipeline outputs. ... sig Pipeline { ins: set XMLDoc, outs: set XMLDoc, components: set Component, descendants: set Component, flows: (ins + descendants.@outs) -> (descendants.@ins + outs) }{ // The 'descendants' of a pipeline are all of the // components in its 'components' set, or in the // 'components' set of any descendant. descendants = univ.^components // Flows only involve XML documents. // I.e. the domain of flows is XMLDoc flows.univ in XMLDoc // And so is the range univ.flows in XMLDoc } ...

Some notes on Alloy notation may be in order here.

The operator -> denotes a mapping from its left-hand argument to its right-hand argument. Here, flows: (ins + descendants.@outs) -> (descendants.@ins + outs) says that for any given instance of the Pipeline signature, flows is a mapping from (ins + descendants.@outs) to (descendants.@ins + outs).

Viewed in isolation, of course, any field in a signature is already a relation from members of the signature to values of the field; when a field is itself of type relation of arity &n;, its name denotes a relation of arity &n; + 1. Here, flows is a set of triples, with each triple consisting of a Pipeline instance, an XML document, and and XML document.

The operator + denotes set union.

Here, the expression ins + descendants.@outs denotes the union of the set ins defined earlier in the signature and the set descendants.@outs.

The operator @ acts as an escape character; it allows the names ins and outs to be used to refer not only to the ins and outs of a particular pipeline but to those of other components. Here, the expression descendants.@ins denotes the union of the values of ins for all descendants.

For any relation &R;, the expression ^R denotes the transitive closure of &R;; here ^components denotes the transitive closure of the relation components. (Not shown here is the related operator *: *R denotes the reflexive transitive closure of &R;.

The expression flows.univ denotes the domain of the relation flows; univ.flows denotes its range. The utility functions dom and ran could have been used instead, but the use of univ in this way is a common Alloy idiom. For any binary relation &R;, R.univ and univ.R work in this way. For relations of arity n > 2, they denote the relations of arity n - 1 which result when the first, or the last, atom in each tuple of the original relation is dropped.

The . operator performs a equi-join on two sets of tuples using the last column of its left-hand argument and the first column of its right-hand argument, and then projects every column of the resulting tuples except the one on which the join was performed. The name univ denotes the universal set. So univ.flows performs an equi-join on the universal set and the first column of the set of flows (resulting in a set containing all the flows), and then discards the column used for the join, leaving only the range of flows. Fuller information on the dot operator is in the Alloy documentation.

The second set of braces in the declaration of Pipeline contains facts which must hold of every instance of the signature: writing a proposition P there is equivalent to writing For all x in Pipeline, P holds. Or in Alloy notation, sig Pipeline { ins: set XMLDoc, outs: set XMLDoc, components: set Component, descendants: set Component, flows: (ins + descendants.@outs) -> (descendants.@ins + outs) } // The 'descendants' of a pipeline are all of the // components in its 'components' set, or in the // 'components' set of any descendant. fact descendants { all p: Pipeline | p.descendants = univ.(p.^components) } // Flows only involve XML documents. // I.e. the domain of flows is XMLDoc fact flow_domain { all p: Pipeline | p.flows.univ in XMLDoc } // And so is the range fact flow_range { all p: Pipeline | univ.(p.flows) in XMLDoc }

This first formulation of data flows begins to get the point across, but it's not satisfactory for several reasons:

It suggests that flows is a mapping from data sources to data sinks; in reality, what we want is not for the ouput of one component to be mapped to the input of another, but to be identical to it.

This first cut doesn't enunciate any of the obvious sanity checks. To name just one: it doesn't forbid mapping the output of a component to an input of the same component. See proposition below.

It treats the inputs and outputs of a component as simple undifferentiated sets of XML documents, which is implausible: in reality, if a component takes two input documents they will typically have quite distinct roles (such as: Document and Schema, as shown in Figure 1 of the spec) and be bound to named ports.

Data flows, second cut

The propositions seen so far ( through ) do not provide enough information to model the named ports usefully, but if we jump ahead a little and assume some information which does not actually occur until later, we can provide a slightly more plausible model. The input and output ports of a pipeline are, in this model, mappings from names to XML documents.

Overall structure

The overall structure of this model is shown in the code snippet below: Module declaration and imports Primitives / unanalysed signatures Components Pipelines Specific component types Sanity checks

Module declaration

For reference, let's call this model xproc02. We import the library file util/relation in order to make the dom and ran functions available. module xproc02 open util/relation as R

Primitive unanalysed types

As before, we treat names, XML documents, and non-XML resources as primitive, unanalysed types: // Some primitive, unanalysed signatures: // names, documents, resources sig Name {} sig XMLDoc {} sig nonXML {}

Components and their structure

Components all have inputs and outputs, which are functional mappings from names to XML documents: // Any pipeline component has inputs and outputs. abstract sig Component { ins: Name -> lone XMLDoc, outs: Name -> lone XMLDoc }{ // The names of input and output ports are disjoint. // (At least I think they are; is this true? See 2.1p6.) no dom[ins] & dom[outs] // No document is simultaneously an input and an output // for the same component. See 2p1. // (Actually this is weaker than 2p1. In fact, // data flows are acyclic.) no ran[ins] & ran[outs] } The keyword lone in the expression ins: Name -> lone XMLDoc means that any Name maps to at most one XMLDoc.

In the scrap above, signature facts are provided (in the second pair of braces) to specify relevant invariants. First, no name can be used both for an input and for an output port. (I don't know if this is actually true in the current design of XProc, but it seemed useful in avoiding confusion.) The expression no dom[ins] & dom[outs] could also be expressed in a style more like ordinary predicate calculus: all n: Name | n in dom[ins] => n not in dom[outs] and n in dom[outs] => n not in dom[ins] or all n: Name | not (n in dom[ins] and n in dom[outs]) The expression no dom[ins] & dom[outs] has the general form no Expr, which means the set described by expression Expr has cardinality 0. Here, that set is the intersection of dom[ins] and dom[outs].

Second, since no component can consume its own output, no document is both an input and an output for the component. (If we want this really to be true, we will have an interesting time explaining what the Identity step which all XProc implementations must support is actually doing. For now, I'll suppose that the so-called Identity step actually produces an XML document equivalent and isomorphic to, but not technically identical to, its input. This is not the place to work out notions of XML document identity. To the extent that we wish to use the XMLDoc signature to represent data flows from one component to another, it has more stringent identity conditions than do XML documents.)

Steps

There are two (disjoint) kinds of component, in the current spec called steps and constructs.

It is about as much as I can do to force myself to use this terminology, instead of using the terms step (for any component), atomic step, and compound step. This would free the term construct to refer to syntactic objects, which is its natural interpretation, instead of whatever we conceive components to be. In some cases, I am not sure I have succeeded in forcing myself to use the standard terms. A future revision of this paper may abandon this attempt to align with the current terminology and adopt a Superior and Cleaner Terminology throughout.

Steps have no interesting substructure (although different kinds of steps will have different input/output signatures; see below). The following scrap models propositions , , and . // Steps (atomic components) have no further internal // structure, just inputs and outputs. abstract sig Step extends Component {}

Constructs

Compound steps, however (i.e. constructs), contain other components. We define the relation components for the components directly contains, and the auxiliary relation descendants (you would think it would not be hard to find a better name than that) for all components contained directly or indirectly within a particular construct. The following scrap models propositions , , and through . // Constructs (compound components), however, have // nested components abstract sig Construct extends Component { components: set Component, descendants: set Component }{ descendants = ran[^@components] } It is the expression descendants = ran[^@components] which gives meaning to the descendants relation.

If there is a way to initialize the descendants relation as part of its declaration, instead of in a separate signature fact, I don't currently see it.

Pipelines

Finally, we distinguish a Pipeline as a special kind of construct. sig Pipeline extends Construct {}

Specific kinds of step

As in the first model, we have declared Step as abstract, in order to specify individual types of steps separately in more detail.

An XInclude step, for example, takes one input and produces one output: sig XInclude extends Step {}{ some document : Name | some X : XMLDoc | ins = ( document -> X ) some result : Name | some X : XMLDoc | outs = ( result -> X ) }

A Validate step, by contrast, takes two inputs, and produces one output. sig Validate extends Step {}{ some document, schema : Name | some X1, X2 : XMLDoc | disj[document, schema] // N.B. the Names are disjoint, not necessarily the documents and ins = ( document -> X1 + schema -> X2 ) some result : Name | some X : XMLDoc | outs = ( result -> X ) }

The other step types are left for later elaboration; for now, we use the following place-holders. sig Identity extends Step {} sig XSLT extends Step {} sig Serialize extends Step {} sig Parse extends Step {} sig Load extends Step {} sig Store extends Step {} sig ExtensionStep extends Step { type : Name }

The show predicate

Finally, we define a show predicate which can be used to find instances of the model described. (Inability to find an instance may indicate a contradiction in the model.)In some cases, however, inability to find an instance turns out only to indicate that the default scope on the run command is too small to meet the constraints imposed by a predicate. This is particularly true when a predicate asks for four disjoint instances of the same signature; there will be no instances of such a predicate in the default scope of three. In an attempt to make the instance more interesting, we specify that the inputs and outputs of the pipeline should not be empty. pred show (p: Pipeline) { some p.ins some p.outs Component = p + p.^components } run show for 3

The predicate cyclic generates an instance with a cyclic containment relation (a component which contains itself directly or indirectly). pred cyclic (p: Pipeline) { some p.ins some p.outs some c: Component | c in c.^components } run cyclic for 3 The predicate acyclic generates an instance in which no component contains itself directly or indirectly. pred acyclic (p: Pipeline) { some p.ins some p.outs no c: Component | c in c.^components } run acyclic for 3

To do: Define further predicates to generate particular kinds of instances of this model: empty pipelines, pipelines with only simple components (steps, no constructs), pipelines with some inputs and outputs, universes in which every component is in some pipeline, universes in which some components exist outside pipelines, pipelines with three children, pipelines with six children.

Example 1

Figure 1 of the spec shows a simple pipeline with two steps: an XInclude step whose output goes to a Validate step. Each is represented by a box, as is the entire pipeline. The boxes representing the individual steps are decorated with male and female connectors indicating input (female) and output (male) ports. Inputs labeled Document and Schema flow from outside the outer box to input connectors on the two steps; the output of the XInclude step flows to the other input of the Validate step. The output of the Validate step flows out of the pipeline.

Figure 1 of the spec.

The diagram does not show connectors (ports) on the pipeline, so it may be read as indicating that the XInclude step takes its input direct from the Web, rather than from a named pipeline input, and similarly for the schema input to the Validation step. The text accompanying the figure makes clear, however, as does the XML version of the pipeline in appendix E.1 of the spec, that Schema and Document are pipeline inputs and that the result document of the Validate step is the result document of the pipeline.

The diagram should perhaps be redrawn to make the ports on the pipeline explicit.

It would be useful to ensure that the figure as shown is consistent with the model we have defined. We can do this, by defining an Alloy predicate which describes an instance of the model, and by imposing constraints on that instance which correspond to the properties of the pipeline shown in the figure.

Describing a specific instance in this way is not a focus of any Alloy documentation I have seen. It may count as software abuse.

In the usual mode of operation, the modeler just describes the model, not particular instances, and lets Alloy handle the business of generating instances. The generation of instances is indeed one of Alloy's particular strengths compared with other modeling tools. But it's not uncommon to require, in an Alloy predicate, some particular properties that an instance of the model may possess, so as to be able to focus for a while on instances with those properties. As a side effect, of course, the generation of model instances which satisfy the constraints of a given predicate also establishes that the constraints are consistent with the model; otherwise, no such instance could exist. In attempting to induce Alloy to re-draw the pipeline of Figure 1, we are simply pushing a little harder in that direction than is usual in exploratory modeling: we describe constraints which should result in a pipeline isomorphic to that of Figure 1, and we ask Alloy to show us some examples. If they don't look like Figure 1, then something is wrong, and probably some constraints are missing either from our general model or from the statement of the salient properties of the pipeline in the figure.

We begin by declaring the module and importing the relation utility. module xproc_figure01 open util/relation as R Then we import the xproc02 model. There's no particular reason we need to have that model and this work on figure 1 in separate files, but it feels cleaner this way. This way the xproc02 file just focuses on the model proper, and not on the attempt to replicate Figure 1. open xproc02 as XP

As an experiment, we'll define some subtypes of Name. This isn't strictly necessary, but since the Alloy visualizer uses type names as labels in various places, this will help make the mapping from names to documents clearer. (Individual mappings will be labeled ins[Document] and ins[Schema], for instance, instead of ins[Name.0] and ins[Name.1].) // First, define some specific names (these aren't // essential, but they are convenient) one sig Document, Schema, Input, Result extends Name {} As the images below show, the experiment seems to have worked; the diagrams are somewhat easier to read with these labels.

It may be possible, I now realize, to get the same effect by tweaking the visualizer parameters. That is probably cleaner.

Next, we define an Alloy predicate named Figure_1. Within this predicate, we will need to refer to the pipeline itself, to the two steps in the pipeline, and to various port names. After the definition of the predicate, we include the command run Figure_1 for 3 but 4 XMLDoc, which stipulates that the instance generated by Alloy may have at most three atoms for any signatureIn the case of the subtypes of Name, the instances of Document, Schema, Input, and Result appear not to count against the budget for Name. How the scope rules count atoms which can be members of overlapping signatures I do not know., except for XMLDoc, which may have four. (We want the two inputs and the two outputs to be distinct documents.) // Now define the figure itself. // It requires a pipeline, two components, and four names // for specifying the data flows. pred Figure_1 (p: Pipeline, x: XInclude, v: Validate, document: Document, schema: Schema, input: Input, result: Result ) { Constrain component structure Describe the port wiring } run Figure_1 for 3 but 4 XMLDoc

Within the body of the predicate, we specify first that the pipeline should contain exactly the two components x and v: // The XInclude and validation components are // in the pipeline. // Nothing else is in the pipeline p.components = (x + v)

Next we describe how the data flows. We declare four variables of type XMLDoc, and specify that the input and output ports of the components must map to those documents in a particular way. The input ports of the pipeline, for example, are named Document and SchemaThat's a slight oversimplification: the names are not actually specified here. What is specified is actually that one port has a name of type Document, and one of type Schema. Since Schema and Document are declared as singleton sets, each has just one member; it does no harm here if we imagine that the one member of the set named Document is the name Document, or speak as if we imagined it. and they map to two XML documents which we here call Doc and Sch. The outs field of the pipeline contains a single mapping, from the name Result to the XML document Res. And so on. The flow of data from component (or port) to another is expressed only by identity of the XML document involved.

As noted in the comment, it's not clear to me at this writing whether it's better to declare the four XML documents here, or in the list of declarations for the predicate. As far as I can tell, the only semantic difference is the scope of the declarations.

We also specify that the four variables of type XMLdoc must be disjoint — in other words, we want four documents, not just four variables. // There are four documents needed. (I don't currently // see whether it's better to declare these above, or here.) // This is one way to define the data flows. // There are others. some Doc, Sch, Tmp, Res: XMLDoc { // For simplicity, we specify that the four documents // pairwise disjoint. That's not actually required // by the figure or by the nature of the case. But // reasoning about it would require a better definition // of XML document identity than I want to bother // with now. disj[Doc, Sch, Tmp, Res] // How the connections work p.ins = ( document -> Doc + schema -> Sch ) and p.outs = ( result -> Res) and x.ins = ( document -> Doc ) and x.outs = ( result -> Tmp ) and v.ins = ( document -> Tmp + schema -> Sch) and v.outs = ( result -> Res ) }

The same effect can be had, without specifying names for the four documents, in the following slightly more compact formulation: // Ports on the pipeline // N.B. ports on the components are described in xproc02 dom[p.ins] = document + schema dom[p.outs] = result // Data flows. Here is another way to define the flows. x.ins[document] = p.ins[document] v.ins[document] = x.outs[result] v.ins[schema] = p.ins[schema] p.outs[result] = v.outs[result] disj[p.ins[document], p.ins[schema], p.outs[result], x.outs[result]] The first two lines specify the input and output ports on the pipeline. The next four lines show the flow from data sources to sinks, using the convention that sinks are on the left and sources on the right. The last line requires that there be four distinct XML documents.

Alloy generates a diagram to help visualize the instance it finds for the Figure_1 predicate. If a little effort is put into customizing the display, the result can be made fairly legible:

The pipeline of figure 1, as an instantiation of the xproc02 model. Some manual repositioning of the document boxes and process ovals in an image editor can make it slightly easier to see the flow of the data.The figure shown here was created by saving the Graphviz input generated by Alloy in a dot file, and then cleaning the image in an image editing program (OmniGraffle 3.2). It would appear that there may be some interoperability issues concerning colors and shape filling in the Graphviz notation. Fortunately, for our purposes the variation can be ignored. (If the type is too narrow to read, a wider browser window will allow it to become larger.)

The pipeline of figure 1, as an instantiation of the xproc02 model.

The graphic notation used here is, of course, different from that used in the XProc spec. But the information content is, I believe, much the same; we have succeeded in showing that the pipeline shown in Figure 1 is a legal instance of the pipeline model xproc02.

Example 2

Having succeeded with Figure 1, it is tempting to attempt a similar effort with Figure 2 of the XProc spec. A glance at the image makes clear, however, that the models developed so far are not nearly complete enough to generate any instance remotely like this one.

Figure 2 of the spec. The Document input splits into two data streams; the fork has no box in the diagram and is not a component. We currently have no words or concepts to describe what is happening here. (Another apparent difference is only illusory: the outputs of the two Validate steps flow together, and the point at which they join is not a component. But this can be described using the concepts already introduced: the two data flows go into the same data sink.)

We will come back to Figure 2 later.

The constructs of Figure 2 go not only beyond the Alloy models developed and sketched above; they also go beyond anything section 1 of the spec has prepared us for. In particular, the conditional direction of flow either into V1 validation or V2 validation, depending on a property of the input document, appears to involve entities of kinds not yet mentioned in the exposition. There is no reason they would need to be mentioned in section 1 of the spec, but nothing said in the other sections devoted to the abstract model touches on them, either. In the current draft of the spec, the abstract model of XProc does not seem to be defined in enough detail to make it possible to describe what is happening in Figure 2.

This is one reason the author believes that the XProc spec would do better to take a different approach to the definition of pipelines.

Pipeline concepts

Pipeline connections and evaluation

Section 2 paragraph 1:

[Definition: A pipeline is a set of components connected together, with outputs flowing into inputs, without any loops (no component can read its own output, directly or indirectly).] A pipeline is itself a construct and must satisfy the constraints on constructs.

Connections [or: data flows] exist. (2p1) Components may connect to components. (2p1) A pipeline is a set of components connected together. [Or: A pipeline is a pair (C, F), where C is a set of components and F is a relation C → C.] (2p1; cf. prop. ) [If a pipeline consists of components X, Y, ... Z, then that pipeline is the set {X, Y, ... Z} connected together, and vice versa.] (2p1; cf. prop. ) [A connection between components is a data flow.] (2p1) ?! In a data flow, an output flows to an input. (Or perhaps: ¿Data flows connect data sources to data sinks?) (2p1)

Note that as reflected in proposition , the spec as written does not cover the case of [pipeline] input flowing to [component] input or [component] output to [pipeline] output. The spec might be clearer if it consistently reserved the words input and output for named ports and used terms like (data) source for possible sources of data (pipeline inputs and component outputs) and (data) sink for possible consumers of data (pipeline outputs and component inputs).

Data flow, viewed as a relation on components, is acyclic. That is, no component ever reads or can read its own output, directly or indirectly. (2p1) Every pipeline is a construct. (2p1)

Section 2 paragraph 2:

The result of evaluating a pipeline is the result of evaluating the components that it contains, in the order determined by the connections between them. A pipeline must behave as if it evaluated each component each time it occurs. Unless otherwise indicated, implementations must not assume that components are functional (that is, that their outputs depend only on their explicit inputs and parameters) or side-effect free.

What does each time it occurs mean? How could it be made precise?

A pipeline may be evaluated. (2p2) [[The result of evaluating something is its value.]] (2p2) The value of a pipeline is the result of evaluating [its] components, in the order determined by the connections between them . (2p2) The connections between components determines an order. (2p2) Q: What does the order determined by the connections between [components] mean? Is it a total or partial order? (2p2) Components may produce output that depends on things other than their inputs and parameters (i.e., components need not be mathematical functions). (2p2) Components may produce side effects. (2p2) [[If X and Y produce side effects when evaluated, the order in which they are evaluated is detectable. Components may produce side effects. It follow that the order in which components are evaluated may be detectable.]] (2p2)

Steps, constructs, and subpipelines

Section 2.1 paragraph 4 reads:

Every construct contains zero or more components. [Definition: The components that occur directly inside a construct are called contained components.] [Definition: A construct which immediately contains a component is called its container.]

Each construct contains zero or more components. (2.1p4) [X can contain Y directly or indirectly.] (2.1p4) [X contains Y indirectly if there is some construct Z such that X directly contains Z and Z contains Y directly or indirectly.] (2.1p4) Containers exist. (2.2p4) Construct X is the container of component Y if and only if X directly contains component Y. (2.1p4) [Every container is a construct; every construct that contains at least one component is a container.] (2.1p4) Contained components exist. A component X is a contained component if there is some component Y which directly contains X. (2.1p4) ¿Every component in a pipeline, except possibly the pipeline itself, is a contained component? (2.1p4) Q: Do components exist outside of pipelines? Is every component that is not a pipeline a contained component? (2.1p4)

Section 2.1 paragraph 5:

[Definition: The components (and the connections between them) within a container form a subpipeline.] Each construct determines how and which, if any, of its subpiplines is evaluated.

Subpipelines exist. Q: Is a subpipeline a pipeline?

Note that section 3.1 of the spec makes clear that subpipelines are not pipelines: a pipeline must not itself be a contained component.

The set of components contained ¿directly or indirectly? by a construct, together with their connections, constitute a subpipeline. (2.1p5) [It follows that each construct contains one subpipeline.] (2.1p5) Each construct determines which of its subpipelines to evaluate. (2.1p5) [It follows that each construct may contain more than one subpipeline.] (2.1p5) When a construct contains more than one subpipeline, one or more of the subpipelines may remain unevaluated; one or more may be evaluated; (2.1p5)

Note that propositions and contradict each other. This leads to a question:

Q: can constructs contain just one subpipeline, or several? (2.1p5)

Section 2.1 paragraph 6:

Steps and constructs have “ports” into which inputs and outputs are connected. Each component has a number of input ports and a number of output ports, all with unique names. A component can have zero input ports and/or zero output ports. (All components have an implicit standard output port for reporting errors that must not be declared.)

Ports exist. (2.1p6) Names exist. (2.1p6) Every port has a name. (2.1p6) ?! All ports have unique names. (What does this mean?) (2.1p6) ¿No two names are identical. Therefore, every name is unique. Therefore, every port which has a name, has a unique name? (2.1p6) ¿No two ports in the universe have the same name? (2.1p6) ¿No two ports on the same component have the same name? (Is this what all with unique names means?) (2.1p6) ¿No two ports in the same pipeline have the same name? (2.1p6) Some ports are input ports. (2.1p6) Some ports are output ports. (2.1p6) ¿Every port is an input port or an output port? (2.1p6) ¿No port is both an input port and an output port? (2.1p6) Each component has zero or more input ports. (2.1p6) Each component has zero or more output ports. (2.1p6) Each component has a predefined output port for error reporting. (Section 3.6 of the spec supplies the name #error for this predefined port.) (2.1p6) [It follows that each component has one or more output ports.] (2.1p6)

Note that propositions and put different lower bounds on the number of outputs for a component. This leads to the question:

Q: is the implicit standard output port for reporting errors an output port as that term is intended to be used? (2.1p6)

Section 2.1 paragraph 7:

Components have any number of parameters, all with unique names. A component can have zero parameters.

Parameters exist. (2.1p7) Parameters have names. (2.1p7) ?! All parameters have unique names. (What does this mean?) (2.1p7) ¿No two names are identical. Therefore, every name is unique. Therefore, every parameter which has a name, has a unique name? (2.1p7) ¿No two parameters in the universe have the same name? (2.1p7) ¿No two parameters on the same component have the same name? (Is this what all with unique names means?) (2.1p7) ¿No two parameters in the same pipeline have the same name? (2.1p7) Every component has zero or more parameters. (2.1p7)

Inputs and outputs

Section 2.2 paragraph 2 reads:

It is a dynamic error if a non-XML resource is produced on a component output or arrives on a component input.

Errors exist. (2.2p2, cf. prop. ) Some errors are dynamic. (2.2p2) ¿An error may occur when a pipeline is evaluated? (2.2p2) A dynamic error occurs when non-XML input flows through / arrives at an input or output during pipeline evaluation. (2.2p2) Q: Who is responsible for raising errors: the component or the pipeline framework or ... ? (2.2p2)

Section 2.2 paragraph 4 reads:

Each component declares its input and output ports. [Definition: The input ports declared on a component are its declared inputs.] [Definition: The output ports declared on a component are its declared outputs.]

Each component? or each component type? Or both? 2.2p6 suggests that steps cannot declare their own ports, but must inherit their signature from their step type.

It's not clear from this whether the inputs and outputs of a component, as referred to from time to time in the spec, are only the declared inputs and outputs, or whether the terms are sometimes used both of the declared and of possibly additional undeclared data read or written.

The different is salient only if we wish to model interactions of a pipeline with the outside world beyond those identified by the declared inputs and outputs of the pipline or its components. If we do, then the following propositions may be salient:

A component (? or only a step?) may [or: must not] read XML resources other than its inputs. (2.2p4) A component (? or only a step?) may [or: must not] write XML resources other than its outputs. (2.2p4)

Section 2.2 paragraph 5 reads:

All of the declared inputs of a component must be connected to outputs in the pipeline. It is a static error if a component has an input port which is not connected. Unconnected output ports are allowed; any documents produced on those ports are simply discarded.

Every component input is connected (by a data flow) to a data source. (2.2p5) A static error occurs when [a pipeline is so constructed that] a component input is not connected by a data flow to a data source. (2.2p5) Every component output is a data source and may be connected through a data flow to a data sink. (2.2p5)

Section 2.2 paragraph 6 reads:

[Definition: The signature of a component is the set of inputs, outputs, and parameters that it is declared to accept.] Each type of step (e.g. XSLT or XInclude) has a fixed signature, declared globally or built-in, which all its instances share, whereas each instance of a construct has its own signature declared locally.

Signatures exist. (2.2p6) ¿Declarations exist? (2.2p6)

Proposition does not mean that any Alloy specification that models declarations will necessarily provide an Alloy signature for declarations; the effect of declarations may be implicit rather than explicit in the model. The proposition is here only to allow us to speak as needed about the properties of declarations without being guilty of Meinongianism and ascribing properties to things that do not exist.

[Component types exist.] (2.2p6) [Components are instances of component types.] (2.2p6)

Section 1 of the spec is a bit vague about whether the term component denotes a particular instance of a component type (e.g. one use of XInclude), or a multiply usable object which appears in an infinite number of pipelines. Here it becomes clear that XInclude, XSLT, etc., are to be taken as types of components, not as individual components.

Note, however, that the difference between components and component types is not simply that there is one of the latter, and one of the former per use of the type in any pipeline. The model as described in the spec does not require that a particular component use (or: instance) be part of only a single pipeline. Pipelines are sets of components (or more precisely, pipelines are (C,F) pairs with C a set of components). Any object can be a member of more than one set; it's rather difficult to prevent it. It's true that allowing components to appear in more than one unrelated pipeline seems likely to confuse some users, if only because there doesn't seem to be any principled way to tell when two pipelines should contain the same XInclude component, and when they should contain different ones. Some may regard this as an argument in favor of revising the design and abandoning the attempt to postulate an abstraction layer independent of and unrelated to the XML transfer syntax.

If pipelines and components are XML elements, then it follows trivially for the usual interpretation of XML element identity that every component has at most one (direct) container. If there is a separate abstraction layer, then either we must mirror that relation at the abstraction layer, although there doesn't seem to be any motivation for it, or we must explain when different XML elements in the transfer syntax correspond to the same component, and when they correspond to distinct components. Or else we must explain that the mapping from XML to components is indeterminate.

Every component has a signature. (2.2p6) Every step type has a signature. (2.2p6) ¿The signature of a component type is a triple (I,O,P) where I is a set of names of input ports. O is a set of names of output ports. P is a set of names of parameters. ? (2.2p6, cf. ) ¿The signature of a particular component (an instance of a component type) is a triple (F,T,V), where F is a set of declarations which connect named input ports to data flows [or: to other ports]. T is a set of declarations which connect named output ports to data flows [or: to other ports]. V is a set of bindings which associate values with (parameter) names. ? (2.2p6)

These two formulations ( and ) assume that when an input, output, or parameter is declared, what happens is that a name is reserved for the port or parameter. That seems a plausible assumption, but it appears to be too simple; paragraph 8 of section 2.2 (see below) indicates that each port is declared as accepting (or producing) a single document or a sequence of documents. There appears to be nothing in the spec that describes declarations in more detail at the abstract level.

The signature of a step instance is the same as the signature of its step type. (2.2p6) [It follows that for any steps S1 and S2, if S1 and S2 have the same step type, then S1 and S2 have the same signature.] (2.2p6)

Note that proposition is incompatible with proposition . Paragraph 2.2p7 seems to make clear that the problem lies with proposition .

The signature of a construct is declared locally. (2.2p6) [It follows that for any constructs C1 and C2, if C1 and C2 have the same component type (or if they do not), their signatures may be identical or different.] (2.2p6)

Section 2.2 paragraph 7 reads:

[Definition: A component matches its signature if and only if it specifies an input for each declared input and it specifies no inputs that are not declared; it specifies a parameter for each parameter that is declared to be required; and it specifies no parameters that are not declared.] In other words, every input and required parameter must be specified and only inputs, outputs, and parameters that are declared may be specified. Outputs and optional parameters do not have to be specified.

This paragraph raises a few questions. It appears incompatible with proposition (or to be more precise, if proposition is accepted, this paragraph appears to be incomprehensible). That suggests that proposition reflects a possibly serious misunderstanding of the text of the spec and will need reformulation or deletion.

A component may specify an input for a declared input [port]. (2.2p7) A component must not specify any inputs except for declared input [ports]. (2.2p7) Parameters may be required [or optional]. (2.2p7) ?! A component may specify a parameter for a declared parameter. (2.2p7) [[¿Perhaps this means: A copmonent may specify a value for / bind a value to a declared parameter.]]

Terminology problem: the phrase specifies a parameter for each parameter sounds nonsensical; although some sensible meanings for the phrase can be imagined, they make unreasonable demands on the reader's mental agility. Ideally, the term parameter would be used for one but not both of the senses used here; if that's not feasible, we should avoid using the term twice in different and complementary senses in a four-word span.

¿A component may specify an output for a declared output [port]? (2.2p7) ¿A component must not specify any outputs except for declared output [ports]? (2.2p7) A component may match a signature. A component matches a signature iff: The set of inputs specified by the component is a subset of the inputs declared in the signature, and a superset of the inputs declared in the signature as required. ¿The set of outputs specified by the component is a subset of the outputs declared in the signature? The set of parameters specified by the component is a subset of the parameters declared in the signature, and a superset of the parameters declared in the signature as required.

Note that the text explicitly entails the notion that a component not only can be (must be) connected (as specified in prop. ) but can specify an input. This appears to be either (1) a failure to enforce the boundary between the abstraction layer, in which input ports either are or are not connected to data flows, and the XML syntax layer, in which directions for how to connect things are given, or else (2) an attempt to define an abstraction layer not only for sets of connected components but also for the descriptions of connections. In the latter case, the abstraction layer will also need to include a definition of the process of constructing connections in accord with the descriptions (or not in accord with them, in the case of error conditions). Neither of these seems plausible.

Note also that the text provides two formulations linked by In other words, but the two formulations do not say the same thing. Thesecond entails propositions not included in the first, e.g. propositions and .

Section 2.2 paragraph 8 reads:

Each input and output is declared to accept or produce either a single document or a sequence of documents. It is not a static error to connect a port that is declared to produce a sequence of documents to a port that is declared to accept only a single document. It is, however, a dynamic error if the former component actually produces more than a single document at run time.

A declaration may identify an input or output port as a sequence, or as a single document. (2.2p8) Every input or output port is declared either as a sequence, or as a single document, not both. (2.2p8) [A single document is indistinguishable from a sequence of documents with length 1.] (2.2p8) [During evaluation, a port declared for a sequence of documents may produce or receive a single document.] (2.2p8) A dynamic error occurs when during evaluation of a pipeline a port declared for a single document receives a sequence of documents with length greater than one. (2.2p8) Q: Does a dynamic error occur when a port declared for a single document receives an empty sequence of documents (i.e. no document)? (2.2p8)

The information in 2.2p8 allows us to reformulate proposition more correctly:

The keywords one and sequence exist. The keywords required and optional exist. ¿The signature of a component type is a triple (I,O,P) where I is a functional mapping from names (of input ports) to the set {one, sequence}; the declared input ports are those with names in the domain of I. O is a functional mapping from names (of output ports) to the set {one, sequence}; the declared output ports are those with names in the domain of O. P is a functional mapping from names (of parameters) to the set {required, optional}; the declared parameters are those with names in the domain of P. ? (2.2p6 + 2.2p8, cf. prop. )

Section 2.2 paragraph 9 reads:

Steps may also produce error, warning, and informative messages. These messages appear on a special “error output” that is available in the catch clause of a try/catch.

Messages exist. (2.2p9, cf. prop. ) Some messages are error messages. (2.2p9) Some messages are warning messages. (2.2p9) Some messages are informative messages. (2.2p9) [Every message is an error message, a warning message, or an informative message.] (2.2p9) [No message is nore than one of: an error message, a warning message, an informative message.] (2.2p9) Messages flow through the implicit standard output port for reporting errors. (2.2p9, cf. prop. ) [It follows that messages are XML documents.] (2.2p9) [Every try/catch construct has an input port on its catch clause ¿which is connected to the implicit standard output port for reporting errors of the components (directly?) contained in the ¿try clause of the try/catch clause?] (2.2p9, cf. prop. )

Parameters

Section 2.3 paragraph 1 reads:

[Definition: A parameter is a QName/value pair.] The value of a parameter must be a string. If a document, node, or other value is given, its (XPath 1.0) string value is computed and that string is used.

Here once more, as in 2.2p7, the spec is either violating the abstraction boundary by bringing representation-level issues (the type of value denoted by an XPath expression before type coercion) into a discusison of the abstraction layer, or else postulating an abstraction layer which models not only correct pipelines but the process of mapping from the XML representation to the abstraction layer. For now, I assume that there is no need for any representation at the abstraction level of the coercion of XML documents or nodes into Unicode strings.

A parameter is a pair (Q,V), where Q is a QName and V (the value) is a Unicode string.The reader unfamiliar with XPath 1.0 may be puzzled by the formluation a Unicode string; the spec describes the value of a parameter as an (XPath 1.0) string value, and that entails that it is a Unicode string. It also entails that some Unicode characters not appear, namely those not legal in an XML document, but it seems unnecessary to model that fact or to mention it in the paraphrase of the spec. (2.3p1)

Connections

Section 2.4 paragraph 1 reads:

Components are connected together by their inputs and outputs. [Definition: Components A and B are connected if they are either directly or indirectly connected. Component A is directly connected to B if an output of A is associated with an input port of B. Component A is indirectly connected to B if there is a chain of directly connected components that allows traversal from A to B.]

Cf. paragraph 1p2 and propositions through .

Note that the relation directly-connected is not symmetric: A and B are connected if data flows from A to B, but not if data flows from B to A. And since data flow is acyclic (prop. ), the relation is not only not symmetric, but asymmetric: if A → B is in the relation, then B → A is not. The relation connected corresponds to reachability via a directed path in a directed graph, not to connectedness in the corresponding undirected graph.

A component A is connected to a component B iff A is directly connected to B or A is indirectly connected to B. (2.4.p1) Two components A and B are directly connected iff an output of A is connected to an input of B. (2.4.p1) Two components A and B are indirectly connected iff either A is directly connected to B, or A is directly connected to some third component C which is indirectly connected to B. (2.4.p1)

Since most treatments of graph theory define graph traversal in such a way that for any vertex V, V is reachable from V (by a path of length 0), the definition of indirect connection offered in 2.4p1 may be taken either as reflexive (if A and B are the same component, then there is a chain of length 0 connecting them) or as irreflexive (as in the preceding paragraph). Hence question .

Q: is A connected to A?

I take the intent to be that the relation should be irreflexive; the definition should probably be rephrased to say Two distinct components A and B are connected if and only if ... or traversal from A to B along a path with length > 0 or [Definition: Components A and B are connected if and only if they are either directly or indirectly connected. Component A is directly connected to B if and only if an output of A is associated with an input port of B and A &neq; B. Component A is indirectly connected to B if and only if a &neq; B and there is a chain of directly connected components that allows traversal from A to B.] I have also turned the conditionals into biconditionals on the theory that normally definitions should be biconditionals.

Note also that directly connected and indirectly connected are not complementary; the definition of indirectly connected makes it synonymous with connected, since if any components A and B are directly connected, then there is a chain of components (consisting of A and B in sequence) which allows traversal from A to B, and therefore A and B are indirectly connected.

Section 2.4 paragraph 2 reads:

With respect to connected components, we can speak of one component being either before or after another. [Definition: Component A is before component B if component B is a contained component of component A, either directly or indirectly, or if any output from component A is connected to any input of component B, either directly or indirectly.] [Definition: Component A is after component B if component B is the container for component A (or an ancestor of such a container) or if any output from component B is connected to any input of component A, either directly or indirectly.]

Several editorial problems here. The first sentence of the paragraph suggests that the terms before and after are relations on pairs of connected copmonents. But the terms are defined to apply also to some unconnected pairs. The first sentence could probably be deleted without tears. It's not clear why the definition of before and after speaks of inputs and outputs being connected (which is not, as far as I can tell, a defined term), instead of using the definition of connected components presented in the preceding paragraph. For purposes of this working paper I am going to assume that this is a simple editorial slip rather than a subtle trap, and that if any output from component A is connected to any input of component B, either directly or indirectly is intended to have the same meaning as if A is connected to B. I believe that it would be shorter, and come to the same thing, if the second definition were: Given two components A and B, A is after B if and only if B is before A. If the relation after is anything but the inverse of the relation before, then the text needs to highlight the difference. The term ancestor is not defined for abstract components; it appears to be an irruption from the XML level. The phrase a contained component of component A, either directly or indirectly would be unnecessary if at the appropriate point in 2.1 the spec defined containment as transitive (which would be natural). The definition of container could either be left unchanged or generalized; in the latter case those uses which need to be changed could be changed to say immediate container and the others could be simplified by deleting phrases like or an ancestor of such a container). The relevant part of paragraph 2.4p2 could then be recast as Component A is before a distinct component B if and only if either A contains B or A is connected to B. Component A is after B if and only if B is before A. Would it kill us to say that before defines a partial order? Or (equivalently) that given any two components A and B, either A is before B or B is before A or neither is before the other?

For any two components A and B, A precedes (is before) B iff either A contains B or A is connected to B. For any two components A and B, A follows (is after) B iff B precedes A. Q. Is a pipeline guaranteeed to be connected by the before relation?

Section 2.4 paragraph 3 reads:

It is a static error if a component is either before or after itself.

A static error occurs when [a pipeline is so constructed that] some component is before itself. (2.4p3)

If after is the inverse of before, then either before or after is redundant (and uses seven syllables when two will do — have you been reading the HyTime spec again? Stop that!), and makes this reader wostop to wonder whether he misread paragraph 2.4p2: If 2.4p3 says either before or after, then perhaps the two relations are not inverses after all? Better go back and reread 2.4p2 again, to make sure. Other readers, of course, may stumble if 2.4p3 just says before. In choosing the formulation, I lean toward serving the readers to take the definitions seriously. Also toward two syllables not seven.

Language constructs

Paragraph 2 of section 3 of the spec reads:

Every component in a pipeline has five parts: a set of inputs, a set of outputs, a set of parameters, a set of contained components, and a context inherited from its container. [Definition: The context is the the set of input ports, output ports, and parameter names that are visible.]

Every component has [at least?] five properties: a set of inputs, a set of outputs, a set of parameters, a set of contained components, and a context.

This can be captured conveniently in an Alloy signature, representing the properties of components as fields of the signature. If we model inputs and outputs as objects of signature Port (left undescribed for now), we can write: abstract sig Component { ins : set Port, outs : set Port, parms : set Parameter, components : set Component, context : Context } abstract sig Step {}{ components = none } abstract sig Construct {}

Contexts exist. Every context has three properties: a set of input ports, a set of ouput ports, and a set of parameter names.

Here, too, the Alloy expression for the structure of contexts is simple. In order to reduce confusion, I have named the first two fields of the context sources (ports which provide data to be consumed) and sinks (ports which consume data); the latter of these corresponds to the outputs of the spec, while the former might correspond to the inputs of the spec: sig Context { sources : set Port, sinks : set Port, cparms : set Parameter }

These declarations may become more complex, depending on how ports and parameters are modeled.

The subsections of part 3 of the spec describe the properties of each construct. Several generalizations are possible:

First of all, the inputs of each type of construct are as declared:

For every construct, the set of inputs is the set of declared inputs.

The spec does not describe how inputs are declared, and it is not clear to this reader whether declarations are objects which occur in the abstraction layer, or only in the XML layer.

The outputs of each type of construct are as declared:

For every construct, the set of outputs is the set of declared outputs.

For some construct types, the cardinality of inputs or outputs is fixed at one.

The parameters of each type of construct are as declared:

For every construct, the set of parameters is the set of declared parameters.

The spec does not describe parameter declarations in detail, and once more it's unclear whether they are visible in the abstraction layer or not.

The set of components directly contained by any constract are as declared.

For every construct, the set of contained components is the set of declared contained components.

In some cases (choose, try/catch), the number of subpipelines is fixed or otherwise described specially for the individual construct type. The spec does not describe how contained components are declared.

The context of each construct is given by a set of descriptions relating it to the context of its container. There are four algorithms, which I will call inherited_context (for the normal, unmarked case), choose_context (for use in Choose constructs), try_context (for use in Try clauses), and Catch_context (for use in Catch clauses). They are described below.

The normal case of inherited context is to add the declared inputs, declared outputs, and parameters of the construct to the sets inherited from the container. In the words of section 3.1 (substituting [component] for pipeline):

The context of a [component] is its inherited context modified as follows: All of the declared inputs of the [component] are added to the outputs in the context. The union of all the declared outputs of the contained components are added to the outputs in the context. All of the declared parameters of the [component] are added to the parameters in the context. This is the context used by the pipeline and inher[i]ted by its contained components.

fun inherited_context(c0: Context, i: set Port, o: set Port, p: set Parameter) : Context { one c : Context | c.sinks = c0.sinks and c.sources = c0.sources + i + o and c.cparms = c0.cparms + p }

Note that the definition just given assumes that ports from distinct components are distinct, so that when we add the input ports of this construct, they are all distinct from any other ports already in the context. Other parts of the model are simpler if we model ports just as names, in which case this definition needs to become more complex: what need to be recorded in the context are (component, port-name) pairs, not just port names.

This algorithm is used by the Pipeline, For-each, Viewport, and Group constructs. A construct that uses this algorithm for determining the context to be inherited by its containees will have a signature fact reading something like context = inherited_context(container.@context, ins, components.@outs, parms),The @-signs are required to prevent Alloy from interpreting the identifiers context and outs as short-hand for this.context and this.outs, i.e. as references to the context and outs fields of the particular atom whose conformance to the signature facts is being considered. where container denotes the construct which directly contains the construct whose context is being described. Without an inherited context (i.e. without a container), the algorithm does not determine a context.

Note that both the construct's declared inputs and the outputs of its containees are added to the outputs of the context. The outputs of the context do not appear here; they will have been among the containee outputs added to the context by the container.

The choose_context algorithm is used only by the Choose construct, but will be described here for easier comparison with the other algorithms. Section 3.4 of reads in part:

The context of a choose is its inherited context modified as follows: All of the declared inputs of the choose are added to the outputs in the context. The declared outputs of (any one of) the subpipelines are added to the outputs in the context. All of the declared parameters of the choose are added to the parameters in the context. This is the context used by the choose and inher[i]ted by its subpipelines.

The second bullet item here presumably does not mean that a pipeline processor chooses a subpipeline at random, calculates a context for it, and assigns that context to the components in each of its subpipelines as their inherited context. But that is what it says.

Here I will take the second bullet as requiring that each distinct subpipeline within the choose have its own inherited context, which adds the parameters and inputs of the choose and the outputs of the containees in that subpipeline to the context.

It's not quite clear how to reconcile this with the other things said in the specification. One way for this definition to be effective would be for a subpipeline construct to be regarded as the container for each subpipeline within the choose. The problem with this is that the spec does not define a subpipeline construct to attach the different contexts to. Another way would be to assume that the inherited context for the components within the different subpipelines is not simply the context field of the container. The major problem with this is that the spec doesn't say what the inherited context is, in that case; a secondary problem is that in that case, the context property of a Choose object does not seem to have much purpose or point.

In the absence of further information (either a clarification of the spec, or speculation on the part of the model builder), it's not clear how to model this algorithm as an Alloy function.

The try_context and catch_context algorithms similarly require clarification of the subpipeline concept before they can be defined in Alloy.

A few difficulties with these definitions of the context may be worth noting: The inputs of the context never change, because nothing in any of the function definitions changes them. Each definition requires having an inherited context to start from, but the spec does not say how the outermost context is initialized. As noted, the treatment of inheritance in Choose and Try/Catch is unclear; it seems to rely both on having an explicit subpipeline construct in the abstraction layer and on not having such a concept.

Pipeline

As described in section 3.1 of , pipelines have inputs, outputs, parameters, and containees as declared. They use the inherited_context algorithm for calculating their context. In Alloy terms: sig Pipeline { }{ context = inherited_context(container.@context, ins, components.@outs, parms) }

As noted above, the inherited_context algorithm requires that the pipeline have a container. This will not be true for the outermost pipeline, but it will be true for pipelines contained in other constructs, if section 3.1 of the spec is intended to describe nested subpipelines. The spec seems to contradict itself on this point. The first paragraph of 3.1 reads A pipeline encapsulates the behavior of a subpipeline. while the final paragraph reads There is one additional constraint imposed on pipelines: a pipeline must not itself be a contained component. It may be possible to reconcile these two statements in some way, but I cannot see a way to do so.

The consequence is that any formal model of constructs will require that we go somewhat beyond the design as expounded in the spec. If we assume that the Pipeline abstraction is indeed to model both the outermost pipeline and also subpipelines within other constructs, then we might represent it in Alloy with definitions something like the following.

First, we need to revise the definition in scrap , to provide a convenient way to refer to a component's container; we declare the field container : lone Component to indicate that some components (outermost pipelines) lack containers, while others (everything else) have exactly one. abstract sig Component { ins : set Port, outs : set Port, parms : set Parameter, container : lone Component, components : set Component, context : Context }

We can now define Pipeline so as to describe both outermost pipelines and subpipelines; for convenience, we define them as two distinct subtypes of Construct. For outermost pipelines, I conjecture a context whose inputs (data sinks) is the empty setNote that this conjecture has the consequence that for any context within the pipeline, the inputs are the empty set., whose outputs (data sources) are just the inputs (not the outputs) of the pipeline, and whose parameters are those of the pipeline: sig OuterPipeline extends Pipeline { }{ context.sources = ins, context.sinks = outs, context.cparms = parms, container = none }

Subpipelines have the same structure but different signature facts: sig Subpipeline extends Pipeline { }{ context = inherited_context(container.@context, ins, components.@outs, parms) #container = 1 // or: some container }

Whether outermost or nested, all pipelines share one invariant, captured in their common superclass: the pipeline serves as the container of all the contained components. abstract sig Pipeline extends Construct { }{ all c : components | c.@container = this }

For-Each

A For-each construct processes a series of documents, applying its subpipeline to each in turn. It is required to have exactly one input port, and contains exactly one sub-pipeline.

In Alloy terms, we can model this either with an explicit subpipeline construct, or without. First, with an explicit subpipeline: sig ForEach extends Construct { }{ context = inherited_context(container.@context, ins, components.@outs, parms) #container = 1 // or: some container } Modeled, in this way, a For-each must contain exactly one other component, namely a subpipeline: #components = 1 components in Subpipeline At the XML level, there is no nested pipeline element, and so the subpipeline cannot have any declared ports or parameters, in any pipeline created from a valid pipeline document. This constraint has no particular motivation at the abstract level, but if we wish to match the XML constraint, we can: components.ins = none components.outs = none components.parms = none

Finally, For-each constructs are required to have exactly one input port: #ins = 1 // or: one ins

It is perhaps more convenient to do without the nested subpipeline construct; it seems to serve little purpose. We can model For-each without an explicit subpipeline component, by defining the signature as an extension of Subpipeline. The declaration is then slightly more concise: sig ForEach extends Subpipeline { }{ }

Viewport

A Viewport resembles a For-each, in that it must have one input port, contains one subpipeline, and typically runs several times. but instead of running once for each document in an input sequence, a Viewport runs once on each set of nodes in the (or: in each) input document which match a selection expression. The Alloy descriptions of Viewport are virtually identical to those for For-each; the differences are (1) the selection, which is modeled here without any internal structure as an object of type XPathExpression, and (2) a constraint requiring the set of output ports to be a singleton.

We can model Viewport with an explicit contained subpipeline: sig XPathExpression{} sig Viewport extends Construct { selection : XPathExpression }{ context = inherited_context(container.@context, ins, components.@outs, parms) #container = 1 // or: some container #components = 1 components in Subpipeline components.ins = none components.outs = none components.parms = none #ins = 1 // or: one ins #outs = 1 // or: one outs }

Or we can model Viewport without any explicit contained subpipeline: sig Viewport extends Subpipeline { selection : XPathExpression }{ #ins = 1 #outs = 1 }

The select attribute on an input to a p:viewport seems to have a rather different meaning from the same attribute on an input to a different construct, e.g. p:pipeline. The description of the input and output of viewpors seems to contradict flatly the description of the semantics of the select attribute in section 4.2.2 of the XProc spec. Section 4.2.2 suggests that nodes not matching the select are discarded; the description of Viewport says they are not. Is it wise to represent such different semantics with the same syntax?

Choose

Structure and input of the Choose

The section of the XProc specification devoted to the abstration layer is not very informative about Choose. If, however, we assume that the abstraction should have the same properties as are described for the XML transfer syntax, we can describe it thus: a Choose contains a set

The spec actually requires that the subpipelines be ordered, but it will be more convenient to model them as a set. The sequence of subpipelines P1, P2, P3 with tests T1, T2, T3 can be transformed, in the translation from XML to the component layer, into a set of subpipelines {P1, P2, P3} with tests P1, P2 and not P1, P3 and not P1 and not P2.

The appeal to a sequence over the subpipelines (The choose considers each subpipeline in turn ...) contradicts the statement in section 3.1 of the spec that each construct has a set, not a list, of subcomponents.

and optionally an Otherwise subpipeline. Each When subpipeline has a test predicate containing a guard expression. A When subpipeline is evaluated only if its guard is true; the Otherwise subpipeline is evaluated if and only if none of the guards of the When constructs are true.

In addition, a Choose may specify a context for the evaluation of the guard expressions; this has nothing to do with the context described in section 3.1 of the spec. It appears, actually, to be a specialized input port (and is so modeled here). A Choose, When, or Otherwise construct may declare a context, and if the Choose does not, then the When and Otherwise constructs contained by it must do so.

In Alloy terms: sig Choose extends Construct { XPath_context : lone Port }{ }

The so-called context input is simply an input port; a Choose must have at most one. In the following Alloy transcription, it's named XPath_context to avoid a name collision with the context inherited from Component. XPath_context = ins

Alternatively, one could suppress the declaration of the field XPath_context and simply impose the constraint lone ins. I've included the declaration, for now, just to have something in the Alloy model that connects with the term context in the spec.

Using the term context both for the set of visible ports and parameters and also for the single input for the subpipelines of the Choose is an editorial oversight that should probably be fixed, because it's certain to lead to serious confusion.

The components contained by a Choose must be zero or more When subpipelines and optionally an Otherwise: components in When + Otherwise #(Otherwise & components) = 1

Outputs of Choose and its subpipelines

Every subpipeline contained by a Choose must declare the same outputs. Or, more precisely, as section 4.2.9 of the spec says: All of the p:when branches and the p:otherwise must declare the same number of output ports with the same names. It is a static error if they do not. If we model Ports just as sets of names, so that different subpipelines can declare the same ports, this constraint can be modeled simply: all c1, c2 : Component | if c1 + c2 in components then c1.@outs = c2.@outs Most port references will then need to be references to a particular port name on a particular components. If, on the other hand, ports are modeled as having, but not being, names, and if the ports of any two distinct components are necessarily distinct, then a more complex statement of the constraint will be necessary.

The details of the constraint will depend on the details of how ports are modeled, which will involve some experimentation I have not yet done. If we assume that each Port has a Name, as in: sig Name {} sig Port { name : Name } then the constraint might be expressed as: all c1, c2 : Component | if c1 + c2 in components then c1.@outs.name = c2.@outs.name A constraint will then be needed on Component, to ensure that the names of the ports on any given component are unique. But as noted elsewhere (see proposition and following), the precise constraint is not clear to me from the spec.

Logically speaking, the outputs of the Choose construct ought to be those of its containees, but the spec appears not to say this anywhere. But I'll model it anyway. If ports are just names, then: outs = components.@outs Since every subpipeline declares the same set of names, the set components.@outs is the same set as the outs field of any subcomponent, and the constraint outs = components.@outs requires that the Choose itself have exactly the same set of outputs.

If ports have identity beyond that of their name, then the Alloy constraint will have to require that the set of names on the output ports of the Choose and that on the output ports of each subpipeline are the same: outs.name = components.@outs.name

Parameters of Choose

The XML element p:choose is not allowed to contain p:parameter elements; if we wish to carry this constraint over to the abstraction layer, we can express the constraint in Alloy thus: parms = none

Calculating the context for a Choose

The calculation of the context for a Choose is described thus in the spec: The context of a choose is its inherited context modified as follows: All of the declared inputs of the choose are added to the outputs in the context. The declared outputs of (any one of) the subpipelines are added to the outputs in the context. All of the declared parameters of the choose are added to the parameters in the context. I have assumed above that the set of declared inputs of a Choose is the singleton set containing its declared context, or else the empty set, but the spec does not say this explicitly, as far as I can tell. And the set of declared parameters is guaranteed to be the empty set, if we adopt the invariant in code fragment . If we model ports as names, then the context rule for Choose is: context = inherited_context(container.@context, ins, components.@outs, parms) or equivalently: context = inherited_context(container.@context, ins, outs, none)

If ports are not just names, this will need to be more complex. The spec says that The declared outputs of (any one of) the subpipelines are added to the outputs in the context. But if a port is unique to a particular component, then this could lead to unacceptable results: if the subpipeline whose outputs are chosen is not the one evaluated, then the data flow will not be arranged correctly. The static pipeline checker appears to need an oracle of some kind to predict accurately which subpipeline will run.

It would seem simpler (at this point I am leaving the text of the spec behind and just thinking about the problem in general) to say that the context inherited by the subpipelines is just the same as the context inherited by the Choose, with the addition of the optional declared input to the set of data sources: context = inherited_context(container.@context, ins, none, none) And the context transmitted from the subpipelines down to their containees should contain the outputs declared for each subpipeline. (This description will work whether ports are names or non-name objects which have names.)

The wording in the spec seems to suggest that the design currently suffers from contradictory impulses: some things seem to imply that at the abstract level there are When and Otherwise subpipelines (and nothing else) directly contained by the Choose; other things (like the description of context for the Choose) seem to imply that no subpipelines are visible within the Choose, and that it's important, as a consequence, that the Choose (rather than the subpipeline) ensure that the output ports of the chosen subpipeline are visible among the data sources of that subpipeline.

When and Otherwise

The When subpipelines are normal subpipelines, except for having a guard expression and an XPath_context to evaluate it in (and no other inputs). Section 4.2.9 reads in part: If no context is specified on the p:when, the context specified on the p:choose is used. It is a static error if no context is specified in either place. It's simplest for now to model the static error by simply making it impossible. We constrain the field XPath_context to have a single Port as its value; if there no input, the XPath_context is that of the container. sig When extends Subpipeline { XPath_context : one Port test : XpathExpression }{ lone ins if #ins = 0 then XPath_context = container.XPath_context else XPath_context = ins }

The Otherwise subpipeline is simpler. Section 4.2.9 says that its XML representation must declare no inputs; we reproduce that constraint here. sig Otherwise extends Subpipeline { }{ ins = none }

Guard expressions

For the moment, I'll leave the Boolean evaluation of the guard expressions out of the model. The model thus has no representation of the constraint that:

In a Choose, at most one guard expression among the contained When subpipelines may evaluate to true. In a Choose, if no guard expression on any contained When subpipeline evaluates to true, and no Otherwise subpipeline is present, then it is a dynamic error.

Group

The 'group' construct encapsulates the behavior of its subpipeline, according to section 3.5 of the XProc specification. It's not clear that there is any need for a separate Group construct at the abstraction level, distinct from the Subpipeline abstraction.

Er, why is the group construct visible in the abstraction layer?

The inputs, outputs, parameters, and containees are all as declared, and the context is determined using the inherited_context algorithm.

In Alloy terms:

Try/Catch

...

In Alloy terms:

Other steps

...

In Alloy terms:

$Id: models.xml,v 1.16 2007/01/23 20:03:02 cmsmcq Exp $

Sweb Notation

Section on Sweb notation needed here.

Or better yet, external reference to a simple document on how to read Sweb notation.

For now, the reader puzzled by the presentation of the Alloy fragments in this document is probably best off reading any of the various introductions to literate programming easily found on the network; the system used here is SWeb, described in .

References Jackson, Daniel. Software abstractions: Logic, language, and analysis. Cambridge: MIT Press, 2006. McMorran, Mike, and Steve Powell. Z guide for beginners. Oxford: Blackwell Scientific, 1993. Potter, Ben, Jane Sinclair, and David Till. An introduction to formal specification and Z. New York: Prentice Hall, 1991. Sperberg-McQueen, C. M. SWEB: an SGML Tag Set for Literate Programming. Unpublished paper. Available on the Web at http://www.w3.org/People/cmsmcq/1993/sweb.html Walsh, Norman, and Alex Milowski, ed. XProc: an XML Pipeline Language. W3C Working Draft 17 November 2006. [Cambridge, Sophia-Antipolis, Tokyo]: World Wide Web Consortium, 2006. On the Web at: http://www.w3.org/TR/2006/WD-xproc-20061117/. Latest version: http://www.w3.org/TR/xproc/. Wordsworth, J. B. Software development with Z: A practical approach to formal method in software engineering.. Wokingham: Addison-Wesley, 1992.

To do

N.B. This list needs to be cleaned up; some items appear to be ghosts which have wandered in from another document.

Formulate propositions from the rest of section 2, and from section 3 (trying, though, to minimize repetition). Rework exposition of xproc01 and xproc02 to note which propositions it does and doesn't model. Formulate more Alloy models and generate more sample instances. Illustrate variant ways of solving design problems (like component connections: restrict to siblings? or allow arbitrary linkages?) Model data flows in various ways: Component -> Component ins : Name -> XMLDoc ins: set Name, ... sig Flow { from: Source, to: Sink } Model context construction and inheritance. Model correct construction and behavior. Model signature checking. Model error handling: static pipeline errors, dynamic pipeline errors, component failures. Model component identity several ways: Type/instance story for components, with unconstrained instances (as specified in current draft) 1:1 with XML of the pipeline document (no component ever is in more than one pipeline) No component is ever in more than one pipeline, that is, unless its XML element is in more than one XML document. That's a common view, but elements in shared external entities and material included via XInclude might be understood as an exception. No type/instance distinction, only a component/use (or component / occurrence) distinction. No component/use (or component / occurrence) distinction, either: put everything that has to be distinct into the pipeline, not into an occurrence. Can we assert that the flow graph of a legal pipeline is always connected? Or is it legal to have a component which has no input and produces no output that is used? If such a pipeline is legal, how is the ordering of the steps done? Model step uniqueness all three ways: step as token (how to formulate this?), step as type (only one XSLT), step as indeterminate (may or may not be distinct). Model composite structures (requires making a subpipeline construct, to make sense of the spec) Test the definition of ordering; I think it's correct but a naive translation rules out useful pipelines (see etudes/xproc04 comments) What ways are there to outlaw 'senseless' data flows in the model? E.g. step output = pipeline input, or step input = pipeline output (and nothing else). Model the input and output ports of steps as names, flows as mapping from name to name. Model ports as objects, and their names as a property on each step, mapping from ports to names. reduce number of uses of 'formal'. some comments on grapheme, type, token, ... Introduce t/t more formally. 'abstract object' is a problem. If abstract, then perhaps no object? article on type/token distinction go easier on Tanselle and Vander Meulen systematic ambiguity ... ?