XProc: An XML Pipeline Language

W3C Working Draft 17 November 2006

This Version:: http://www.w3.org/TR/2006/WD-xproc-20061117/
Latest Version:: http://www.w3.org/TR/xproc/
Previous version:: http://www.w3.org/TR/2006/WD-xproc-20060928/
Editors:: Norman Walsh, Sun Microsystems, Inc. <Norman.Walsh@Sun.COM>; Alex Milowski, Invited expert <alex@milowski.org>

This document is also available in these non-normative formats: XML, Revision markup

Abstract

This specification describes the syntax and semantics of XProc: An XML Pipeline Language, a language for describing operations to be performed on XML documents.

An XML Pipeline specifies a sequence of operations to be performed on one or more XML documents, producing one or more XML documents as output. Steps in the pipeline may read or write non-XML resources as well.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document was produced by the XML Processing Model Working Group which is part of the XML Activity. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This is a public Working Draft. While some details of the design remain incomplete, the Working Group has chosen to publish a new draft in order to show the direction we are heading and to encourage feedback from potential users.

Please send comments about this document to public-xml-processing-model-comments@w3.org (public archives are available).

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1 Introduction

2 Pipeline Concepts

2.1 Steps, Constructs, and Subpipelines
2.2 Inputs and Outputs
2.3 Parameters
2.4 Connections

3 Language Constructs

3.1 Pipeline
3.2 For-Each
3.3 Viewport
3.4 Choose
3.5 Group
3.6 Try/Catch
3.7 Other Steps

4 Syntax

4.1 Overview

4.1.1 Scoping of Names
4.1.2 Associating Documents with Ports
4.1.3 Extension attributes
4.1.4 Extension elements

4.2 Detailed Vocabulary

4.2.1 p:pipeline Element
4.2.2 p:input Element
4.2.3 p:output Element
4.2.4 p:parameter Element
4.2.5 p:step Element
4.2.6 p:import-parameter Element
4.2.7 p:for-each Element
4.2.8 p:viewport Element
4.2.9 p:choose/p:when/p:otherwise Elements
4.2.10 p:group Element
4.2.11 p:try/p:catch Elements
4.2.12 p:declare-step-type Element
4.2.13 p:pipeline-library Element
4.2.14 p:import Element

5 Errors

5.1 Static Errors
5.2 Dynamic Errors

Appendices

A References

B Glossary

C The Error Vocabulary

D Standard Component Library

1 Required Components

1.1 Identity
1.2 XSLT
1.3 XInclude
1.4 Serialize
1.5 Parse
1.6 Load
1.7 Store

2 Optional Components

3 Micro-Operations Components

3.1 Rename
3.2 Wrap
3.3 Insert
3.4 Set-attributes

4 Component Declarations

E Examples

1 Introduction

An XML Pipeline specifies a sequence of operations to be performed on a collection of input documents. Pipelines take zero or more XML documents as their input and produce zero or more XML documents as their output. Steps in the pipeline may read or write non-XML resources as well.

A pipeline consists of components. Like pipelines, components take zero or more XML documents as their input and produce zero or more XML documents as their output. The inputs to a component come from the web, from the pipeline document, from the inputs to the pipeline itself, or from the outputs of other components in the pipeline. The outputs from a component are consumed by other components, are outputs of the pipeline as a whole, or are discarded.

There are two kinds of components: steps and (language) constructs. Steps carry out single operations and have no substructure as far as the pipeline is concerned, whereas constructs can include components within themselves.

This specification defines a standard library, Appendix D, Standard Component Library, of steps. Pipeline implementations may support additional steps as well.

Figure 1, “A simple, linear XInclude/Validate pipeline” is a graphical representation of a simple pipeline that performs XInclude processing and validation on a document.

Figure 1. A simple, linear XInclude/Validate pipeline

This is a pipeline that consists of two steps, XInclude and Validate. The pipeline itself has two inputs, “Document” and “Schema”. How these inputs are connected to XML documents outside the pipeline is implementation-defined. The XInclude step reads the pipeline input “Document” and produces a result document. The Validate step reads the pipeline input “Schema” and the output from the XInclude step and produces a result document. The result of the validation, “Result Document”, is the result of the pipeline. How pipeline outputs are connected to XML documents outside the pipeline is implementation-defined.

Figure 2, “A validate and transform pipeline” is a more complex example: it performs schema validation with an appropriate schema and then styles the validated document.

Figure 2. A validate and transform pipeline

The heart of this example is the conditional. The “choose” construct evaluates an XPath expression over a test document. Based on the result of that expression, one or another branch is evaluated. In this example, each branch consists of a single validate component.

Pipelines for these two examples can be found in Appendix E, Examples.

2 Pipeline Concepts

[Definition: A pipeline is a set of components connected together, with outputs flowing into inputs, without any loops (no component can read its own output, directly or indirectly).] A pipeline is itself a construct and must satisfy the constraints on constructs.

The result of evaluating a pipeline is the result of evaluating the components that it contains, in the order determined by the connections between them. A pipeline must behave as if it evaluated each component each time it occurs. Unless otherwise indicated, implementations must not assume that components are functional (that is, that their outputs depend only on their explicit inputs and parameters) or side-effect free.

2.1 Steps, Constructs, and Subpipelines

Steps are the basic computational units of a pipeline. [Definition: A step is an atomic component that performs a unit of XML processing, such as XInclude or transformation.] Steps can perform arbitrary amounts of computation but they are indivisible from the point of view of the construct that contains them. Steps carry out fundamental XML operations. An XSLT step, for example, performs XSLT processing; a validation step validates one input with respect to some schema, etc.

Language constructs, on the other hand, control and organize the flow of documents through a pipeline, reconstructing familiar programming language functionality such as conditionals, iterators and exception handling. As such, they typically contain components, whose evaluation they control.

[Definition: A construct is a component that contains additional components. That is, a construct differs from a step in that its semantics are at least partially determined by the components that it contains.]

Every construct contains zero or more components. [Definition: The components that occur directly inside a construct are called contained components.] [Definition: A construct which immediately contains a component is called its container.]

[Definition: The components (and the connections between them) within a container form a subpipeline.] Each construct determines how and which, if any, of its subpiplines is evaluated.

Steps and constructs have “ports” into which inputs and outputs are connected. Each component has a number of input ports and a number of output ports, all with unique names. A component can have zero input ports and/or zero output ports. (All components have an implicit standard output port for reporting errors that must not be declared.)

Components have any number of parameters, all with unique names. A component can have zero parameters.

2.2 Inputs and Outputs

Although some kinds of components can read and write non-XML resources, what flows between components as inputs and outputs are exclusively XML documents or sequences of XML documents. Each XML document (or document in a sequence) must be an [Infoset] with a Document Information Item at its root. The inputs and outputs can be implemented as sequences of characters, events, or object models, or any other representation the implementation chooses.

It is a dynamic error if a non-XML resource is produced on a component output or arrives on a component input.

Editorial Note

What about the cases where it's impractical to test for this error?

An implementation may make it possible for a component to produce non-XML output—for example, writing a PDF document—but that output cannot flow through the pipeline. Similarly, one can imagine a component that takes no pipeline inputs, reads a non-XML file from a URI, and produces an XML output. But the non-XML file cannot be an input to a component or pipeline.

Each component declares its input and output ports. [Definition: The input ports declared on a component are its declared inputs.] [Definition: The output ports declared on a component are its declared outputs.]

All of the declared inputs of a component must be connected to outputs in the pipeline. It is a static error if a component has an input port which is not connected. Unconnected output ports are allowed; any documents produced on those ports are simply discarded.

[Definition: The signature of a component is the set of inputs, outputs, and parameters that it is declared to accept.] Each type of step (e.g. XSLT or XInclude) has a fixed signature, declared globally or built-in, which all its instances share, whereas each instance of a construct has its own signature declared locally.

[Definition: A component matches its signature if and only if it specifies an input for each declared input and it specifies no inputs that are not declared; it specifies a parameter for each parameter that is declared to be required; and it specifies no parameters that are not declared.] In other words, every input and required parameter must be specified and only inputs, outputs, and parameters that are declared may be specified. Outputs and optional parameters do not have to be specified.

Each input and output is declared to accept or produce either a single document or a sequence of documents. It is not a static error to connect a port that is declared to produce a sequence of documents to a port that is declared to accept only a single document. It is, however, a dynamic error if the former component actually produces more than a single document at run time.

Steps may also produce error, warning, and informative messages. These messages appear on a special “error output” that is available in the catch clause of a try/catch.

2.3 Parameters

[Definition: A parameter is a QName/value pair.] The value of a parameter must be a string. If a document, node, or other value is given, its (XPath 1.0) string value is computed and that string is used.

2.4 Connections

Components are connected together by their inputs and outputs. [Definition: Components A and B are connected if they are either directly or indirectly connected. Component A is directly connected to B if an output of A is associated with an input port of B. Component A is indirectly connected to B if there is a chain of directly connected components that allows traversal from A to B.]

With respect to connected components, we can speak of one component being either before or after another. [Definition: Component A is before component B if component B is a contained component of component A, either directly or indirectly, or if any output from component A is connected to any input of component B, either directly or indirectly.] [Definition: Component A is after component B if component B is the container for component A (or an ancestor of such a container) or if any output from component B is connected to any input of component A, either directly or indirectly.]

It is a static error if a component is either before or after itself.

3 Language Constructs

This section describes the core language constructs of XProc.

Every component in a pipeline has five parts: a set of inputs, a set of outputs, a set of parameters, a set of contained components, and a context inherited from its container. [Definition: The context is the the set of input ports, output ports, and parameter names that are visible.]

3.1 Pipeline

A pipeline encapsulates the behavior of a subpipeline.

Inputs: As declared.
Outputs: As declared.
Parameters: As declared.
Contained components: As declared.

The context of a pipeline is its inherited context modified as follows:

All of the declared inputs of the pipeline are added to the outputs in the context.
The union of all the declared outputs of the contained components are added to the outputs in the context.
All of the declared parameters of the pipeline are added to the parameters in the context.

This is the context used by the pipeline and inherted by its contained components.

Viewed from the outside, a pipeline is a black box which performs some calculation on its inputs and produces its outputs. From the pipeline author's perspective, the computation performed by the pipeline is described in terms of contained components which read the pipeline's inputs and produce the pipeline's outputs.

For example, a pipeline might accept a document and a stylesheet as input; perform XInclude, validation, and transformation; and produce a sequence of formatted documents as its output.

There is one additional constraint imposed on pipelines: a pipeline must not itself be a contained component.

3.2 For-Each

A for-each construct processes a sequence of documents, applying its subpipeline to each document in turn.

Inputs: Exactly one, as declared.
Outputs: As declared.
Parameters: As declared.
Contained components: As declared.

The context of a for-each is its inherited context modified as follows:

All of the declared inputs of the for-each are added to the outputs in the context.
The union of all the declared outputs of the contained components are added to the outputs in the context.
All of the declared parameters of the for-each are added to the parameters in the context.

This is the context used by the for-each and inherted by its contained components.

The for-each construct can be used in cases where a component requires a single document input but a pipeline needs to process a sequence of documents with that component.

The result of the for-each is a sequence of documents produced by processing each individual document in the input sequence. If the subpipeline is connected to one or more output ports on the for-each, what appears on each of those ports is the sequence of documents produced by each iteration of the loop.

For example, a for-each might accept a sequence of DocBook chapters as its input, process each chapter in turn with XSLT, and produce a sequence of formatted chapters as its output.

3.3 Viewport

A viewport construct processes a single document, applying its subpipeline to one or more subsections of the document.

Inputs: Exactly one, as declared.
Outputs: Exactly one, as declared.
Parameters: As declared.
Contained components: As declared.

The context of a viewport is its inherited context modified as follows:

All of the declared inputs of the viewport are added to the outputs in the context.
The union of all the declared outputs of the contained components are added to the outputs in the context.
All of the declared parameters of the viewport are added to the parameters in the context.

This is the context used by the viewport and inherted by its contained components.

The result of the viewport is a copy of the original document with the selected subsections replaced by the results of applying the subpipeline to them.

For example, a viewport might accept an XHTML document as its input, apply encryption to selected div elements within that document, and return an XHTML document that is the same as the original except that each selected div has been replaced by its encrypted result.

3.4 Choose

A choose construct selects exactly one of a list of alternative subpipelines based on the evaluation of XPath expressions.

Inputs: As declared.
Outputs: As declared.
Parameters: As declared.
Contained components: As declared. A choose construct contains several alternate subpipelines, exactly one of which will be evaluated.

The context of a choose is its inherited context modified as follows:

All of the declared inputs of the choose are added to the outputs in the context.
The declared outputs of (any one of) the subpipelines are added to the outputs in the context.
All of the declared parameters of the choose are added to the parameters in the context.

This is the context used by the choose and inherted by its subpipelines.

The list of alternative subpipelines consists of zero or more subpipelines, each guarded by an XPath expression (with an associated context document), followed optionally by a single default subpipeline.

The choose considers each subpipeline in turn and selects the first (and only the first) subpipeline for which the guard expression evaluates to true in the context of its context document. If there are no subpipelines for which the expression evaluates to true, the default subpipeline, if it was specified, is selected.

After a subpipeline is selected, it is evaluated as if only it had been present.

The context of the contained components in the selected subpipeline is the context of the choose with the union of all the declared outputs of the contained components of the selected subpipeline added to the outputs in the context.

The result of the choose is the result of the selected subpipeline.

For example, a choose might test a schema and apply XML Schema validation to an input document if the schema is an XML Schema document, apply RELAX NG validation if the schema is a RELAX NG grammar, or perform no validation otherwise.

In order to ensure that the result of the choose is consistent irrespective of the subpipeline chosen, each subpipeline must declare the same number of outputs with the same names. It is a static error if two subpipeline in a choose declare different outputs.

It is a dynamic error if no subpipeline is selected by the choose and no default is provided.

3.5 Group

A group construct encapsulates the behavior of its subpipeline.

Inputs: As declared.
Outputs: As declared.
Parameters: As declared.
Contained components: As declared.

The context of a group is its inherited context modified as follows:

All of the declared inputs of the group are added to the outputs in the context.
The union of all the declared outputs of the contained components are added to the outputs in the context.
All of the declared parameters of the group are added to the parameters in the context.

This is the context used by the group and inherted by its contained components.

A group is a convenience wrapper for a collection of components and can be used to perform parameter renaming to aid in reuse of sets of components.

3.6 Try/Catch

A try construct isolates a subpipeline, preventing any errors that arise within it from being exposed to the rest of the pipeline.

Inputs: As declared.
Outputs: As declared.
Parameters: As declared.
Contained components: As declared. A try contains two subpipelines, one or both of which will be evaulated.

A try construct contains two subpipelines: an initial subpipeline and a recovery (or “catch”) subpipeline.

The initial context of a try is its inherited context modified as follows:

All of the declared inputs of the try are added to the outputs in the context.
The union of all the declared outputs of the contained components in the initial pipeline are added to the outputs in the context.
All of the declared parameters of the try are added to the parameters in the context.

This is the context used by the try and inherted by the contained components in its initial pipeline.

Editorial Note

In the context of try/catch, “errors” refers to component failure which is not the same as a static or dynamic error in the pipeline itself. (Though perhaps it will be possible to recover from some dynamic errors.) The notion of component failure as a distinct class of error needs to be described.

The try construct evaluates the initial subpipeline and, if no errors occur, the results of that pipeline are the results of the construct. However, if any errors occur, it abandons the first subpipeline, discarding any output that it might have generated, and evaluates the recovery subpipeline.

In this case, it must recompute the context inherited by the contained components in the recovery pipeline. The context of the recovery subpipeline is the inherited context of the try modified as follows:

All of the declared inputs of the try are added to the outputs in the context.
The union of all the declared outputs of the contained components in the recovery pipeline are added to the outputs in the context.
All of the declared parameters of the try are added to the parameters in the context.

This is the context inherted by the contained components in the recovery pipeline.

The results of the recovery subpipeline are the results of the try construct. If the recovery subpipeline is evaluated and a component within that subpipeline fails, the try fails.

For example, a pipeline might attempt to process a document by dispatching it to some web service. If the web service succeeds, then those results are passed to the rest of the pipeline. However, if the web service cannot be contacted or reports an error, the catch construct can provide some sort of default for the rest of the pipeline.

In order to ensure that the result of the try is consistent irrespective of whether the initial subpipeline provides its output or the recovery subpipeline does, both subpipelines must declare the same number of outputs with the same names. It is a static error if the two subpipelines declare different outputs.

In order to support corrective action in the recovery subpipeline, components inside it have access to all the error output of the components that were in the initial subpipeline on a special port named “#error”.

Note

In evaluating the initial subpipeline, failure of one component can cause other components to fail. In addition, some components that fail might not produce output on their error ports and some components that succeeded might produce such output. This pipeline language places no constraints on the order of error messages provided to the recovery subpipeline, nor does it attempt to guarantee that such output will be available in all cases.

The error documents that appear should conform to Appendix C, The Error Vocabulary.

3.7 Other Steps

A pipeline document may declare additional steps. These can be implementation-defined steps or can be defined through some implementation-dependent extension mechanism. Each declared step must have a name and a signature. It is a static error if a pipeline contains an step that is not recognized by the processor.

4 Syntax

This section describes a set of XML syntactic elements sufficient to represent all the aspects of a pipeline, as set out in the preceding sections.

4.1 Overview

Elements in a pipeline document represent the pipeline, the components it contains, the connections between those components, the components and connections contained within them, and so on. Each component is represented by an element; a combination of elements and attributes specify how the inputs and outputs of each component are connected and how parameters are passed.

Conceptually, we can speak of components as objects that have inputs and outputs that are connected together and which may have contain additional components. Syntactically, we need a mechanism for specifying these relationships.

Containment is represented naturally using nesting of XML elements. The connections between components are expressed using names and references to those names.

Five kinds of things are named in XProc:

The types of components,
The components themselves,
Input ports,
Output ports, and
Parameters

4.1.1 Scoping of Names

The scope of the names of component types is the pipeline. Each pipeline processor has some number of built in component type and each pipelines may declare (directly, or by reference to an external library) additional component types.

The scope of the names of the components themselves is determined by the context of each component. In general, the name of a component, the names of its sibling components, the names of any components that it contains directly, the names of its ancestors; and the names of its ancestor's siblings are all in the same scope. It is a static error if two components with the same name appear in the same scope. All in-scope components must have unique names.

The scope of an input or output port name is the component on which it is defined. The names of all the ports on any component must be unique.

Taken together, these uniqueness constraints guarantee that the combination of a component name and a port name uniquely identifies exactly one in-scope port.

The scope of parameter names is essentially the same as the scope of component names, with the following caveat. Whereas component names must be unique, parameter names may be repeated. The declaration of a parameter on a component shadows any declaration that may already be in-scope.

4.1.2 Associating Documents with Ports

A document or a sequence of documents can be bound to a port in three ways: by source, by URI, or by providing it “here”, as the content of the element establishing the binding. A document must be specified in exactly one of these ways, otherwise a static error is raised.

Specified by URI

[Definition: A document is specified by URI if it refers to it with a URI.] The href attribute is used for this purpose.

In this example, the input to the Identity step named “otherstep” comes from “http://example.com/input.xml”.

<p:step name="otherstep" type="p:identity">
  <p:input port="document" href="http://example.com/input.xml"/>
</p:step>

It is a dynamic error if the processor attempts to retrieve the specified URI and fails. (For example, if the resource does not exist or is not accessible with the user's authentication credentials.)

Specified by source

[Definition: A document is specified by source if it refers to a specific port on another component.] The step and source attributes are used for this purpose. (The step attribute may refer to any kind of component, either step or construct, its name notwithstanding.)

There are constraints on what ports can be specified: the specified port must be in the output ports of the component's context.

In this example, the “document” input to the XInclude step named “expand” comes from the “result” port of the step named “otherstep”.

<p:step name="expand" type="p:xinclude">
  <p:input port="document" step="otherstep" source="result"/>
</p:step>

It is a static error if the specified port does not exist.

Specified by here document

[Definition: An document is specified by here document if it is contained in the body of the element that binds it.]

In this example, the “stylesheet” input to the XSLT step named “xform” comes from the content of the p:input element itself.

<p:step name="xform" type="p:xslt">
  <p:input port="document" step="expand" source="result"/>
  <p:input port="stylesheet">
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                    version="1.0">
      ...
    </xsl:stylesheet>
  </p:input>
</p:step>

Here documents are considered “quoted”, they are not interpolated or available to the pipeline processor in any way except as documents flowing through the pipeline.

4.1.3 Extension attributes

[Definition: An element from the XProc namespace may have any attribute not from the XProc namespace, provided that the expanded-QName of the attribute has a non-null namespace URI. These attributes are called extension attributes.] The presence of an extension attribute must not cause the connections between components to differ from the connections that any other conformant XProc processor would produce. They must not cause the processor to fail to signal an error that a conformant processor is required to signal. This means that an extension attribute must not change the effect of any XProc element except to the extent that the effect is implementation-defined or implementation-dependent.

A processor which encounters an extension attribute that it does not recognize must behave as if the attribute was not present.

4.1.4 Extension elements

[Definition: Outside the context of a “here document”, any element not in the XProc namespace is an extension element.] The presence of an extension element must not cause the connections between components to differ from the connections that any other conformant XProc processor would produce. They must not cause the processor to fail to signal an error that a conformant processor is required to signal. This means that an extension element must not change the effect of any XProc element except to the extent that the effect is implementation-defined or implementation-dependent.

Inside the context of a here document, all content is considered quoted so neither extension elements nor XProc elements are said to occur.

A processor which encounters an extension element that it does not recognize must behave as if neither the element nor its attributes, nor any of its content was present.

4.2 Detailed Vocabulary

This section describes in detail the XML vocabulary that represents a pipeline.

4.2.1 p:pipeline Element

A p:pipeline represents a pipeline. Its children declare the inputs, outputs, and parameters that the pipeline exposes and represent its subpipeline.

<p:pipeline name? = QName> (p:input*, p:output*, p:parameter*, p:import*, p:declare-step-type*, subpipeline) </p:pipeline>

If specified, the name must be unique across all available pipelines. If a p:pipeline occurs as the child of a p:pipeline-library element, it must be named.

A pipeline can declare additional steps (e.g., ones that are provided by a particular implementation or in some implementation-defined way) and import steps from libraries.

Example 1. A Sample Pipeline Document

<p:pipeline name="buildspec">
  <p:input port="document"/>
  <p:input port="stylesheet"/>
  <p:output port="result"/>
  <p:parameter name="validate"/>
  …
</p:pipeline>

4.2.2 p:input Element

A p:input identifies input for a component, optionally declaring it, if necessary.

<p:input port = QName sequence? = yes|no />

The port attribute defines the name of the port. It is a static error to identify two ports with the same name on the same component. On atomic components, it is a static error if the port given does not match the name of an input port specified in the component's declaration.

On language constructs and declare-step-type, an input declaration can indicate if a sequence of documents is allowed to appear on the port. If sequence is specified with the value “yes”, then a sequence is allowed. If the sequence is not specified, or has the value “no”, then it is a dynamic error for a sequence of more than one document to appear on the declared port.

The declaration may be accompanied by a binding (or default binding) for the input. This binding can be accomplished by source:

<p:input port = QName step = step name source = port name select? = xpath expression sequence? = yes|no />

by URI:

<p:input port = QName href = URI select? = xpath expression sequence? = yes|no />

or by here document:

<p:input port = QName select? = xpath expression sequence? = yes|no> here document </p:input>

If a binding is provided, a select expression may also be provided. If provided, the specified XPath select expression is used to filter the document(s) that are read. Each matching node or set of nodes is wrapped in a document and provided to the input port.

It is a static error if more than one of source, uri or here document is specified.

The select expression, if specified, applies the specified XPath select expression to the document(s) that are read. Each matching node or set of nodes is wrapped in a document and provided to the input port. In other words,

<p:input port="document" href="http://example.org/input.html"/>

provides a single document, but

<p:input port="document" href="http://example.org/input.html" select="//html:div"/>

provides a sequence of zero or more documents, one for each matching html:div in http://example.org/input.html.

<p:input port="document" step="origin" source="portname" select="//html:div"/>

provides a sequence of zero or more documents, one for each matching html:div in the document (or each of the documents) that is read from the portname port of the component named origin.

4.2.3 p:output Element

A p:output identifies an output port, optionally declaring it, if necessary.

<p:output port = QName sequence? = yes|no />

The port attribute defines the name of the port. It is a static error to identify two ports with the same name on the same component.

An output declaration can indicate if a sequence of documents is allowed to appear on the declared port. If sequence is specified with the value “yes”, then a sequence is allowed. If the sequence is not specified, or has the value “no”, then it is a dynamic error if the component produces a sequence of more than one document on the declared port.

On a construct, the declaration must be accompanied by a binding for the output. This binding can be accomplished by source:

<p:output port = QName step = step name source = port name sequence? = yes|no />

by URI:

<p:output port = QName href = URI sequence? = yes|no />

or by here document:

<p:output port = QName sequence? = yes|no> here document </p:output>

It is a static error if more than one binding is specified.

4.2.4 p:parameter Element

The p:parameter element is used both to declare parameters and to establish values for them. When used on a p:declare-step-type or language construct, p:parameter declares the parameter and may associate a default value with it. Used elsewhere, p:parameter associates a value with the parameter.

4.2.4.1 Declaring Parameters

Parameters are declared on p:declare-step-type and language constructs with p:parameter:

<p:parameter name = token required? = yes | no />

The name attribute must be a QName, a single asterisk (*), or a string of the form *:NCName or NCName:*.

If the name is a QName, the parameter may be declared as required or it may be given a default value. It is a static error to specify that the parameter is required or that it has a default value if the name given is not a QName. It is also a static error to specify that the parameter is both required and has a default value.

If a parameter is required, it is a static error to invoke the component without specifying a value for that parameter.

4.2.4.2 Using Parameters

Parameters are used on p:step elements with p:parameter:

<p:parameter name = QName />

The parameter must be given a value when it is used.

4.2.4.3 Assigning Values to Parameters

When a parameter is declared, it can be given a default value. When it is used, it must be given a value. That value can be specified in several ways, by source:

<p:parameter name = QName step = step name source = port name select? = XPath expression />

by URI:

<p:parameter name = QName href = URI select? = XPath expression />

or by here document:

<p:parameter name = QName select? = XPath expression> here document </p:parameter>

Editorial Note

Is there really any value in allowing here documents on parameters? Doing so means that an empty parameter is always associated with an empty here document, which in turn makes it impossible to detect some kinds of errors statically. Might it not be better to disallow here documents and add a rule that any reference to the context is an error if a document is not explicitly provided?

If a select expression is given, it is evaluated against the document specified and the (XPath 1.0) string value of the expression becomes the default value of the parameter. If no select expression is given, the (XPath 1.0) string value of the document becomes the default value of the parameter. It is a dynamic error if a document sequence is specified.

The select expression may refer to the values of other in-scope parameters by variable reference. It is a static error if the variable reference uses a QName that is not the name of an in-scope parameter or if the reference is circular, either directly or indirectly.

The default value may also be specified with the value attribute:

<p:parameter name = QName value = string />

In this case, the default value of the parameter is the string specified in the value attribute.

4.2.5 p:step Element

A p:step represents an step in a pipeline.

<p:step type = QName name = QName> (p:input*, p:import-parameter*, p:parameter*) </p:step>

The type attribute identifies the component to be instantiated. It is a static error if the name is not unique in the current scope, if the specified type is not known to the processor, or if the specified inputs, outputs, and parameters do not match the signature for steps of that type.

4.2.6 p:import-parameter Element

An p:import-parameter provides a set of in-scope parameters to a component.

<p:import-parameter name = token />

All in-scope parameters which match the name are made available to the component as if they had been specified with individual p:parameter elements.

The name attribute must be a single asterisk (*), a QName, or a string of the form *:NCName or NCName:*.

4.2.7 p:for-each Element

A p:for-each represents a for-each.

<p:for-each name = NCName> (p:input, p:output*, p:parameter*, subpipeline) </p:for-each>

Exactly one input must be declared and it must include a binding for the port it declares. If outputs are declared, they must also include a binding.

The processor will provide each document read through the input binding to the subpipeline represented by the children of the p:for-each, one at a time. For each declared output, the processor will collect all the documents that are produced for that output from all the iterations, in order, into a sequence. The result of the p:for-each is that set of document sequences.

Example 2, “A Sample For-Each” shows an example of a p:for-each in action.

Example 2. A Sample For-Each

<p:for-each name="chapters">
  <p:input port="chap" href="http://example.org/docbook.xml" select="//chapter"/>
  <p:output port="html" step="xform-to-html source="result"/>
  <p:output port="fo" step="xform-to-fo" source="result"/>
  <p:step name="xform-to-fo" type="p:xslt">
    <p:input name="document" step="chapters" source="chap"/>
    <p:input name="stylesheet" href="fo/docbook.xsl"/>
  </p:step>
  <p:step name="xform-to-html" type="p:xslt">
    <p:input name="document" step="chapters" source="chap"/>
    <p:input name="stylesheet" href="html/docbook.xsl"/>
  </p:step>
</p:for-each>

The //chapters of the DocBook document are selected. Each chapter is transformed into HTML and XSL FO using an XSLT step. The resulting HTML and FO documents are aggregated together and appear on the html and fo ports, respectively, of the chapters construct itself.

It is a static error if there is not exactly one p:input child of p:for-each or if the declared input does not specify a binding.

It is a static error if any declared output does not specify a binding.

4.2.8 p:viewport Element

A p:viewport represents a viewport.

<p:viewport name = NCName> (p:input, p:output, p:parameter*, subpipeline) </p:viewport>

Exactly one input must be declared and it must include both a binding and a select expression. Exactly one output must be declared and it must include a binding. The processor will provide a document that contains each set of nodes that matches the specified select expression through the input binding to the subpipeline represented by the children of the p:viewport, one at a time. What appears on the output from the p:viewport will be a copy of the input document except that where each matching node or set of nodes appears, the result of applying the subpipeline to those nodes will be output.

It is a dynamic error if the input source is a sequence of more than one document or if the output from any iteration is a sequence of more than one document.

Example 3, “A Sample Viewport” shows an example of a p:viewport in action.

Example 3. A Sample Viewport

<p:viewport name="encdivs">
  <p:input port="div" step="step" source="port" select="//h:div[@class='enc']"/>
  <p:output port="html" step="encrypt" source="result"/>
  <p:step name="encrypt" type="p:encrypt-document">
    <p:input name="document" step="encdivs" source="div"/>
  </:step>
</p:viewport>

The //h:div[@class='enc']s of the document are selected. Each selected div is encrypted and the resulting encrypted version replaces the original div. The result of the whole construct is a copy of the input document with each selected div encrypted.

Editorial Note

The WG has decided that the semantics of the XPath expression on viewport should be “match” semantics (a la XSLT), not select semantics. That decision is not been reflected in this draft.

It is a static error if there is not exactly one p:input child and exactly one p:output child of p:viewport or if the declared ports do not specify a binding.

4.2.9 p:choose/p:when/p:otherwise Elements

A p:choose represents a choose.

<p:choose name = NCName> (p:when*, p:otherwise?) </p:choose>

The p:choose can specify a context in which the XPath expressions that occur on each branch are evaluated. If a context is specified, it must be specified in exactly one of two ways, by source:

<p:choose name = NCName step = step name source = port name> (p:when*, p:otherwise?) </p:choose>

or by URI:

<p:choose name = NCName href = URI> (p:when*, p:otherwise?) </p:choose>

Each conditional subpipeline is represented by a p:when element.

<p:when test = expression> (p:output*, p:parameter*, subpipeline) </p:when>

The p:when can specify a context in which its XPath expression is to be evaluated. If a context is specified, it must be specified in exactly one of two ways, by source:

<p:when test = expression step = step name source = port name> (p:output*, p:parameter*, subpipeline) </p:when>

or by URI:

<p:when test = expression href = URI> (p:output*, p:parameter*, subpipeline) </p:when>

If no context is specified on the p:when, the context specified on the p:choose is used. It is a static error if no context is specified in either place.

Editorial Note

The p:choose/p:when elements are inconsistent with p:for-each and p:viewport. It would be consistent to allow exactly one p:input to specify the context, instead of specifying the context directly on the p:choose and p:when elements. (Or conversely, to remove the p:input elements from p:for-each and p:viewport.)

Note, in particular, that without the p:input, there's no direct way to specify a here document for the context. The WG has not yet decided how to resolve this issue.

The test attribute provides the guard expression for the subpipeline.

The default branch is represented by a p:otherwise element.

<p:otherwise> (p:output*, p:parameter*, subpipeline) </p:otherwise>

All of the p:when branches and the p:otherwise must declare the same number of output ports with the same names. It is a static error if they do not.

The result of the p:choose is the result of the selected subpipeline. It is a dynamic error if no p:when is selected and no p:otherwise is specified.

4.2.10 p:group Element

A p:group is a wrapper for a subpipeline.

<p:group name = NCName> (p:output*, p:parameter*, subpipeline) </p:group>

The result of a p:group is its declared outputs.

4.2.11 p:try/p:catch Elements

A p:try represents a try/catch.

<p:try name = NCName> (p:group, p:catch) </p:try>

Where p:group represents the initial subpipeline. The recovery (or “catch”) pipeline is identified with a p:catch element:

<p:catch> (p:output*, p:parameter*, subpipeline) </p:catch>

Within the p:catch block, the special input port #error is defined. The document(s) on that port constitute the error messages received from the component which failed. Note that the order of the messages on that port is undefined. Note also that the failure of one component can cause others to fail and the component which signaled the error might not be the only or even the first component that failed.

Both the p:group and the p:catch must declare the same number of output ports with the same names. It is a static error if they do not.

4.2.12 p:declare-step-type Element

A p:declare-step-type provides the type and signature of an implementation-dependent type of step. It declares the inputs, outputs, and parameters for all steps of that type.

<p:declare-step-type type = QName> (p:input*, p:output*, p:parameter*) </p:declare-step-type>

Editorial Note

We need to make some provision for identifying the implementation of a declared step, even if it's no more than implementation-defined extension attributes. We'll need some sort of mechanism for declaring multiple implementations too.

It is a static error if the type attribute of a p:step is not recognized by the processor. It is not an error to declare such an step, only to use it.

Exactly one input declaration of a p:declare-step-type may use the name “*” to indicate that the step accepts an arbitrary number of inputs.

4.2.13 p:pipeline-library Element

A p:pipeline-library contains one or more step declarations and/or pipelines. It declares steps that pipelines can import.

<p:pipeline-library> (p:import*, p:declare-step-type*, p:pipeline*) </p:pipeline-library>

Example 4. A Sample Pipeline Library

<p:pipeline-library>
  <p:declare-step-type> name="extension-component">…</p:declare-steptype>
  <p:pipeline name="xinclude-and-validate">…</p:pipeline>
  <p:pipeline name="validate-and-transform">…</p:pipeline>
  …
</p:pipeline>

4.2.14 p:import Element

An p:import loads a pipeline or pipeline library, making it available in the current context.

<p:import href = URI />

An import statement loads the specified URI and makes any pipelines declared within it available to the current pipeline.

It is a dynamic error if the URI cannot be retrieved or if, once retrieved, it does not point to a p:pipeline-library or p:pipeline. If it points to a p:pipeline, it is a dynamic error if the pipeline does not have a name.

5 Errors

Errors in a pipeline can be divided into two classes: static errors and dynamic errors.

5.1 Static Errors

[Definition: A static error is one which can be detected before pipeline evaluation is even attempted.] Examples of static errors include cycles, incorrect specification of inputs and outputs, and reference to unknown components.

Static errors are fatal and must be detected before any components are evaluated.

5.2 Dynamic Errors

A [Definition: A dynamic error is one which occurs while a pipeline is being evaluated.] Examples of dynamic errors include references to URIs that cannot be resolved, components which fail, and pipelines that exhaust the capacity of an implementation (such as memory or disk space).

If a component fails due to a dynamic error, failure propagates upwards until either a try is encountered or the entire pipeline fails. In other words, outside of a try, component failure causes the entire pipeline to fail.

A References

[XML Core Req] XML Processing Model Requirements Dmitry Lenkov, Norman Walsh, editors. W3C Working Group Note 05 April 2004

[Infoset] XML Information Set (Second Edition) John Cowan, Richard Tobin, editors. W3C Working Group Note 04 February 2004.

[XML 1.0] Extensible Markup Language (XML) 1.0 (Fourth Edition) Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, et. al. editors. W3C Recommendation 16 August 2006.

[XML 1.1] Extensible Markup Language (XML) 1.1 (Second Edition) Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, et. al. editors. W3C Recommendation 16 August 2006.

B Glossary

after

Component A is after component B if component B is the container for component A (or an ancestor of such a container) or if any output from component B is connected to any input of component A, either directly or indirectly.

Note: defined but never referenced.

before

Component A is before component B if component B is a contained component of component A, either directly or indirectly, or if any output from component A is connected to any input of component B, either directly or indirectly.

Note: defined but never referenced.

by source

A document is specified by source if it refers to a specific port on another component.

by URI

A document is specified by URI if it refers to it with a URI.

connected

Components A and B are connected if they are either directly or indirectly connected. Component A is directly connected to B if an output of A is associated with an input port of B. Component A is indirectly connected to B if there is a chain of directly connected components that allows traversal from A to B.

construct

A construct is a component that contains additional components. That is, a construct differs from a step in that its semantics are at least partially determined by the components that it contains.

contained components

The components that occur directly inside a construct are called contained components.

container

A construct which immediately contains a component is called its container.

context

The context is the the set of input ports, output ports, and parameter names that are visible.

declared inputs

The input ports declared on a component are its declared inputs.

declared outputs

The output ports declared on a component are its declared outputs.

Note: defined but never referenced.

dynamic error

A dynamic error is one which occurs while a pipeline is being evaluated.

extension attributes

An element from the XProc namespace may have any attribute not from the XProc namespace, provided that the expanded-QName of the attribute has a non-null namespace URI. These attributes are called extension attributes.

Note: defined but never referenced.

extension element

Outside the context of a “here document”, any element not in the XProc namespace is an extension element.

Note: defined but never referenced.

here document

An document is specified by here document if it is contained in the body of the element that binds it.

matches

A component matches its signature if and only if it specifies an input for each declared input and it specifies no inputs that are not declared; it specifies a parameter for each parameter that is declared to be required; and it specifies no parameters that are not declared.

parameter

A parameter is a QName/value pair.

Note: defined but never referenced.

pipeline

A pipeline is a set of components connected together, with outputs flowing into inputs, without any loops (no component can read its own output, directly or indirectly).

Note: defined but never referenced.

signature

The signature of a component is the set of inputs, outputs, and parameters that it is declared to accept.

static error

A static error is one which can be detected before pipeline evaluation is even attempted.

step

A step is an atomic component that performs a unit of XML processing, such as XInclude or transformation.

Note: defined but never referenced.

subpipeline

The components (and the connections between them) within a container form a subpipeline.

C The Error Vocabulary

This appendix describes the XML vocabulary that components are expected to use to identify messages on their error ports.

To be described…

D Standard Component Library

This appendix describes the standard XProc components.

Note

The components described in this draft are intended mainly as a starting point for discussion and to present a flavor for the sorts of components envisioned. The WG has not yet discussed them in detail.

1 Required Components

This section describes standard components that must be supported by any conforming processor.

To be described…

1.1 Identity

The identity construct makes a verbatim copy of its input available on its output.

Inputs

input, a sequence of documents.

Outputs

result, a sequence of documents.

Parameters

None.

Note

The use of 'input' as a port name here probably isn't the best choice. Since the identity component must be able to take a sequence of documents, the port name 'document' isn't appropriate either.

1.2 XSLT

The xslt component applies an XSLT 1.0 transformation supplied by the input to the 'transform' port to the document provided on the 'document' port. It produces a sequence of documents on its 'result' port.

Inputs

document, a single document.
transform, a single document.

Outputs

result, a sequence of documents.

Parameters

Any

All of the specified parameters are made available to the XSLT processor. If the XSLT processor signals a fatal error, the component fails, otherwise the result of the transformation is produced on the result port.

Note, an XSLT 1.0 processor without any extensions can only produce a single XML document as its result. However, many XSLT 1.0 processors provide extensions which allow the processor to produce more than one result. In such cases, more than one document may appear in the result port. The principle result document will always appear last.

1.3 XInclude

The XIinclude component applies xinclude processing semantics to the document. The referenced documents are calculated against the base URI and are not provided as input to the component.

Inputs

document, a single document.

Outputs

result, a single document.

Parameters

None.

1.4 Serialize

The serialize component applies XML serialization to the children of the document element and replaces those children with their serialization. The outcome is a single element with text content that represents the "escaped" syntax of the children if they were serialized.

Inputs

document, a single document.

Outputs

result, a single document.

Parameters

None.

1.5 Parse

The parse component takes the text value of the document element and parses the content as if it was and unicode character stream containing XML. The outcome is a single element with children from the parsing of the XML content. This is the reverse of the serialize component.

When the text value is parsed, a document element wrapper should be assumed so that element siblings can be parsed back into XML. Further, if the 'namespace' parameter is specified, the default namespace is declared on that wrapper element. If a wrapper element name is specified, it is not returned in the result.

If the 'content-type' parameter is specified, an implementation can use a different parser to produce XML content. Such a behavior is implementation defined. For example, for the mime type 'text/html', an implementation might provide an HTML to XHTML parser (e.g. Tidy).

Inputs

document, a single document.

Outputs

result, a single document.

Parameters

namespace
content-type

1.6 Load

The load component has no inputs but takes a parameter that specifies a URI of an XML resource that should be loaded and provided as the result.

Inputs

None.

Outputs

result, a single document.

Parameters

href, required.

Load attempts to read an XML document from the specified URI. If the document does not exist, or is not well-formed, the component fails. Otherwise, the document read is produced on the result port.

Note

Should this component allow href to be a list of URIs and return a sequence of documents?

1.7 Store

The store component stores a serialized version of its input to a URI. The URI is either specified explicitly by the 'href' parameter or implicitly by the base URI of the document. This component has no output.

Note

Should this component allow sequences on its input?

Inputs

document, a single document.

Outputs

None.

Parameters

href

Note

A more direct “serialize-to-octet-stream” component may also be required. One, for example, that supports the XSLT 2.0/XQuery 1.0 Serialization specification.

2 Optional Components

T.B.D.

3 Micro-Operations Components

Note

No decisions have been made about whether these components will be optional or required.

3.1 Rename

The rename component renames elements or attributes in a document based on parameter values.

Inputs

document, a single document.

Outputs

result, a single document.

Parameters

select
name, required.

Each element, attribute, or processing-instruction identified by the XPath 1.0 expression specified in the 'select' parameter is renamed. The name of elements and attributes, and the target of processing-instructions, are changed to the value of the 'name' parameter.

The component fails if the specified name is not a valid name or if the renaming would introduce a syntactic error into the document (i.e., if it would create two attributes with the same name on the same element).

3.2 Wrap

The wrap component wraps the document element with a new document element.

Inputs

document, a single document.

Outputs

result, a single document.

Parameters

name, required.

3.3 Insert

The insert component inserts a document specified on the 'insertion' port as a child of the document element provided on the 'document' port. The position of this insert is governed by the parameters.

Inputs

document, a single document.
insertion, a single document.

Outputs

result, a single document.

Parameters

at-start, required.

If the at-start parameter is true, the insertion document will be inserted as the first child(ren) of the document, otherwise it will be inserted as the last child(ren). If the parameter is not specified, a value of true is assumed.

3.4 Set-attributes

The set-attributes component sets attribute values on the document element using the attribute values provided on the document element of the 'attribute' port's document.

Inputs

document, a single document.
attributes, a single document.

Outputs

result, a single document.

Parameters

None.

4 Component Declarations

T.B.D.

<p:pipeline-library name="standard">

  <p:declare-step-type type="p:validate">
    <p:input port="document" sequence="no"/>
    <p:input port="schema" sequence="yes"/>
    <p:output port="result" sequence="no"/>
  </p:declare-step-type>

  <p:declare-step-type type="p:xinclude">
    <p:input port="document" sequence="no"/>
    <p:output port="result" sequence="no"/>
  </p:declare-step-type>

  <p:declare-step-type type="p:xslt">
    <p:input port="document" sequence="no"/>
    <p:input port="stylesheet" sequence="no"/>
    <p:output port="result" sequence="yes"/>
  </p:declare-step-type>

</p:pipeline-library>

E Examples

This appendix contains some examples…

Consult Section 4, “Component Declarations” for a description of the signatures of the standard components.

Example E.1. Pipeline for Figure 1

<p:pipeline name="fig1"
            xmlns:p="http://example.org/PipelineNamespace">
  <p:input port="doc" sequence="no"/>
  <p:input port="schemaDoc" sequence="yes"/>
  <p:output port="out" step="s2" source="result"/>

  <p:step type="p:xinclude" name="s1">
    <p:input port="document" step="fig1" source="doc"/>
  </p:step>

  <p:step type="p:validate" name="s2">
    <p:input port="document" step="s1" source="result"/>
    <p:input port="schema" step="fig1" source="schemaDoc"/>
  </p:step>
</p:pipeline>

Example E.2. Pipeline for Figure 2

<p:pipeline name="fig2"
            xmlns:p="http://example.org/PipelineNamespace">
  <p:input port="doc" sequence="no"/>
  <p:output port="out" step="xform" source="result"/>

  <p:choose name="vcheck" step="fig2" source="doc">
    <p:when test="/*[@version &lt; 2.0]">
      <p:output name="valid" step="val1" source="result"/>
      <p:step type="p:validate" name="val1">
        <p:input port="document" step="fig2" source="doc"/>
        <p:input port="schema" href="v1schema.xsd"/>
      </p:step>
    </p:when>

    <p:otherwise>
      <p:output name="valid" step="val2" source="result"/>
      <p:step type="p:validate" name="val2">
        <p:input port="document" step="fig2" source="doc"/>
        <p:input port="schema" href="v2schema.xsd"/>
      </p:step>
    </p:otherwise>
  </p:choose>

  <p:step type="p:xslt" name="xform">
    <p:input port="document" step="vcheck" source="valid"/>
    <p:input port="stylesheet" href="stylesheet.xsl"/>
  </p:step>
</p:pipeline>

XProc: An XML Pipeline Language

W3C Working Draft 17 November 2006

Abstract

Status of this Document

Table of Contents

Appendices

1 Introduction

2 Pipeline Concepts

2.1 Steps, Constructs, and Subpipelines

2.2 Inputs and Outputs

2.3 Parameters

2.4 Connections

3 Language Constructs

3.1 Pipeline

3.2 For-Each

3.3 Viewport

3.4 Choose

3.5 Group

3.6 Try/Catch

3.7 Other Steps

4 Syntax

4.1 Overview

4.1.1 Scoping of Names

4.1.2 Associating Documents with Ports

4.1.3 Extension attributes

4.1.4 Extension elements

4.2 Detailed Vocabulary

4.2.1 p:pipeline Element

4.2.2 p:input Element

4.2.3 p:output Element

4.2.4 p:parameter Element

4.2.4.1 Declaring Parameters

4.2.4.2 Using Parameters

4.2.4.3 Assigning Values to Parameters

4.2.5 p:step Element

4.2.6 p:import-parameter Element

4.2.7 p:for-each Element

4.2.8 p:viewport Element

4.2.9 p:choose/p:when/p:otherwise Elements

4.2.10 p:group Element

4.2.11 p:try/p:catch Elements

4.2.12 p:declare-step-type Element

4.2.13 p:pipeline-library Element

4.2.14 p:import Element

5 Errors

5.1 Static Errors

5.2 Dynamic Errors

A References

B Glossary

C The Error Vocabulary

D Standard Component Library

1 Required Components

1.1 Identity

1.2 XSLT

1.3 XInclude

1.4 Serialize

1.5 Parse

1.6 Load

1.7 Store

2 Optional Components

3 Micro-Operations Components

3.1 Rename

3.2 Wrap

3.3 Insert

3.4 Set-attributes

4 Component Declarations

E Examples