XSLT 2.0 and XQuery 1.0 Serialization

&doc.prefix;-&doc.date;

W3C Working Draft

&date.day; &date.month; &date.year; &url.this; XML http://www.w3.org/TR/xslt-xquery-serialization/ http://www.w3.org/TR/2004/WD-xslt-xquery-serialization-20040723/ http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/ http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20030502/ Michael Kay Saxonica (formerly of Software AG) http://www.saxonica.comMichael.Kay@softwareag.com Norman Walsh Sun Microsystems Norman.Walsh@Sun.COM Henry Zongaro IBM zongaro@ca.ibm.com

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a Public Working Draft for review by W3C Members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document describes how , and other related XML standards convert an instance of the into a sequence of octets. This material has been moved out of the XSLT draft and into a separate document so that it can be shared by both the named specifications and possibly other specifications as well.

This draft includes many corrections and changes based on member-only and public comments on the Last Call Working Draft (http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/). The XML Query and XSL WGs wish to thank the people who have sent in comments for their close reading of the document.

This draft reflects decisions taken up to and including the joint teleconference meeting 209 of the XSL and XML Query Working Groups of 21 September 2004. These decisions are recorded in the Last Call issues list (http://www.w3.org/2004/10/xquery-serialization-issues.html). However, some of these decisions may not yet be reflected in this document.

XSLT 2.0 and XQuery 1.0 Serialization has been defined jointly by the XSL Working Group and the XML Query Working Group (both part of the XML Activity).

Public comments on this document and its open issues are invited. Comments should be sent to the W3C XSLT/XPath/XQuery mailing list, public-qt-comments@w3.org (archived at http://lists.w3.org/Archives/Public/public-qt-comments/), with “[Serial]” at the beginning of the subject field.

The patent policy for this document is the 5 February 2004 W3C Patent Policy. Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page and the XSL Working Group's patent disclosure page. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.

This document defines serialization for the and specifications and any other specifications that reference it.

English

See the CVS changelog.

Introduction

This document defines serialization of the W3C XQuery 1.0 and XPath 2.0 Data Model, which is the data model of at least , , and , and any other specifications that reference it.

This material has been moved out of the XSLT draft and into a separate document. The Working Groups also considered moving this material directly into the Data Model document, but elected to keep it separate for the moment, principally in order to advance the Data Model to Last Call. In the future, this material may be moved into the Data Model. The Working Groups solicit public opinion about which alternative is superior.

Serialization is the process of converting an instance of the into a sequence of octets. Serialization is well-defined for most data model instances.

The document assumes the reader already knows generally what serialization is. A brief explanation will be added, especially to disabuse any reader who thinks it might mean Java (or .NET) serialization. The editor has yet to align the description of serialization errors with the description of errors in related specifications. That will be done in a future public working draft. Terminology

In this specification, where they appear in upper case, the words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", "MAY", "REQUIRED", and "RECOMMENDED" are to be interpreted as described in .

As is indicated in , conformance criteria for serialization are determined by other specifications that refer to this specification. A serializer is software that implements some or all of the requirements of this specification in accordance with such conformance criteria. A serializer is not REQUIRED to directly provide a programming interface that permits a user to set serialization parameters or to provide an input sequence for serialization.

Implementation-defined indicates an aspect that MAY differ between serializers, but whose actual behaviour MUST be specified either by another specification that sets conformance criteria for serialization (see ) or in documentation that accompanies the serializer.

Implementation-dependent indicates an aspect that MAY differ between serializers, and whose actual behaviour is not REQUIRED to be specified either by another specification that sets conformance criteria for serialization (see ) or in documentation that accompanies the serializer.

In some instances, the sequence that is input to serialization cannot be successfully converted into a sequence of octets given the set of serialization parameter () values specified. A serialization error is said to occur in such an instance. In some cases, a serializer is REQUIRED to signal such an error. What it means to signal a serialization error is determined by the relevant conformance criteria () to which the serializer conforms. In other cases, there is an implementation-defined choice between signalling a serialization error and performing a recovery action. Such a recovery action will allow a serializer to produce a sequence of octets that might not fully reflect the usual requirements of the parameter settings that are in effect.

Sequence Normalization

The XQuery 1.0 and XPath 2.0 Data Model is richer and less constrained than XML. There are valid instances of the data model that have no direct analog in XML. In particular, instances of the data model can contain typed values, sequences, and sequences of typed values. And whereas XML deals only with documents, instances of the data model can have as their root any node type, simple value, or sequence and may even be empty.

This section describes how to convert an arbitrary instance of the data model into one of several simplified forms. We then describe how these forms are serialized. This greatly simplifies the sections which follow. A serializer is not REQUIRED to implement serialization of arbitrary instances of the data model in this way, provided it produces the same results as this conceptual model.

If the instance of the data model contains any typed or untyped atomic values, or sequences that contain typed or untyped atomic values, convert them to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. If the value cannot be cast to xs:string, serialization of the instance of the data model is undefined.

If adjacent strings occur in a sequence, replace both values with their concatenation separated by a single space.

If empty sequences occur, replace them with the empty string.

To complete the simplification, perform the following steps interactively iteratively until a simplest form is reached:

If the instance of the data model has as its root an attribute or namespace node, or a QName value, or if it has as its root a sequence which contains one of these items, serialization is undefined.

If the instance of the data model has as its root a single document node, or an element, processing instruction, comment, or text node, or a sequence of only element, processing instruction, comment, and text nodes, it is already in its simplest form.

If the instance of the data model has as its root a sequence of document nodes, or a sequence which contains document nodes, replace each document node with its children in document order.

If the instance of the data model has as its root a string value, or a sequence which contains one or more string values, replace each string value with a text node that contains the same string.

If there are any remaining string values among the children of elements in the instance of the data model, replace them with text nodes that contain the same string values and merge adjacent text nodes.

An instance of the data model that is input to the serialization process is a sequence. Prior to serializing a sequence using any of the output methods whose behavior is specified by this document () the serializer MUST first place that input sequence into a normalized form compute a normalized sequence for serialization; it is the normalized sequence that is actually serialized. The purpose of this sequence normalization step is to create a sequence that can be serialized as a well-formed XML document or external general parsed entity, that also reflects the content of the input sequence to the extent possible.

The normalized form sequence for serialization is constructed by applying all of the following rules in order, with the initial sequence being input to the first step, and the sequence that results from any step being used as input to the subsequent step. For any implementation-defined output method, it is implementation-defined whether this sequence normalization process takes place.

Where the process of converting the input sequence to a normalized form sequence indicates that a value MUST be cast to xs:string, that operation is as defined in of . The steps in computing the normalized sequence are:

If the sequence that is input to serialization is empty, create a sequence S₁ that consists of a zero-length string. Otherwise, copy each item in the sequence that is input to serialization to create the new sequence S₁.

Replace an empty sequence with a zero-length string.

For each item in S₁, if the item is atomic, obtain the lexical representation of the item by casting it to an xs:string and copy the string representation to the new sequence; otherwise, copy the item, which will be a node, to the new sequence. The new sequence is S₂.

If the instance of the data model contains any atomic values, or sequences that contain atomic values, convert the atomic values to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. It is a serialization error if the value cannot be cast to xs:string.

For each subsequence of adjacent strings in S₂, copy a single string to the new sequence equal to the values of the strings in the subsequence concatenated in order, each separated by a single space. Copy all other items to the new sequence. The new sequence is S₃.

Replace all adjacent strings in the sequence with a single string equal to the values of the strings concatenated, each separated by a single space.

For each item in S₃, if the item is a string, create a text node in the new sequence whose string value is equal to the string; otherwise, copy the item to the new sequence. The new sequence is S₄.

Replace any string in the sequence with a text node whose string value is equal to the string.

For each item in S₄, if the item is a document node, copy its children to the new sequence; otherwise, copy the item to the new sequence. The new sequence is S₅.

Replace any document node in the sequence with its children.

It is a serialization error if an item in S₅ is an attribute node or a namespace node. Otherwise, construct a new sequence, S₆, that consists of a single document node and copy all the items in the sequence, which are all nodes, as children of that document node.

It is a serialization error if an item in the sequence is an attribute node or a namespace node. Otherwise, create a new document node and make all the items in the sequence, which are all nodes, children of that document node.

S₆ is the normalized sequence.

The tree rooted at the document node that is created by the final step of this sequence normalization process is the instance of the data model to which the rules of the appropriate output method are applied. If the sequence normalization process results in a serialization error, the processor serializer MUST signal the error.

The sequence normalization process for a sequence $seq is equivalent to constructing a document node using the XSLT instruction:

<xsl:result-document> <xsl:copy-of select="$seq"/> </xsl:result-document>

or the XQuery expression:

document { for $s in $seq return if ($s instance of document-node()) then $s/child::node() else $s }

and then serializing the document node as described in , , , , or in an implementation-defined manner.

This process will fail results in a serialization error with certain sequences, for example sequences containing parentless attribute and namespace nodes, or atomic values of types that cannot be cast to a string, such as xs:QName. and xs:NOTATION Such a failure results in a serialization error; the processor serializer MUST signal the error.

Serialization Parameters

There are a number of parameters that influence how serialization is performed. Host languages MAY allow users to specify any or all of these parameters, but they are not REQUIRED to be able to do so.

The following serialization parameters are defined:

Here and throughout the document, the distinction between "should" and "must" will be revisited. When serialization was described in the XSLT specification, use of "should" helped to clarify that the serialization process was optional. Now that it's described here in a standalone specification, many of those clauses should use "must".

Serialization parameter name	Permitted values for parameter
`byte-order-mark`	One of the enumerated values `yes` or `no`. This parameter indicates whether the serialized sequence of octects is to be preceded by a Byte Order Mark. (See Section 5.1 of .) The actual byte order used is implementation-dependent. If the concept of a Byte Order Mark is not meaningful in connection with the value of the `encoding` parameter, the `byte-order-mark` parameter is ignored.
`cdata-section-elements`	A list of expanded-QNames, possibly empty.
`doctype-public`	A string of Unicode characters. This parameter is optional.
`doctype-system`	A string of Unicode characters. This parameter is optional.
`encoding`	A string of Unicode characters in the range #x21 to #x7E (that is, printable ASCII characters); the value SHOULD be a charset registered with the Internet Assigned Numbers Authority , or begin with the characters `x-` or `X-`.
`escape-uri-attributes`	One of the enumerated values `yes` or `no`.
`include-content-type`	One of the enumerated values `yes` or `no`.
`indent`	One of the enumerated values `yes` or `no`.
`media-type`	A string of Unicode characters specifying the media type (MIME content type) ; the charset parameter of the media type MUST NOT be specified explicitly in the value of the `media-type` parameter. If the destination of the serialized output is annotated with a media type, this parameter MAY be used to provide such an annotation. For example, it MAY be used to set the media type in an HTTP header.
`method`	An expanded-QName with a null namespace URI, and the local part of the name equal to one of `xml`, `xhtml`, `html` or `text`, or having a non-null namespace URI. If the namespace URI is non-null, the parameter specifies an implementation-defined output method.
`normalization-form`	One of the enumerated values `NFC`, `NFD`, `NFKC`, `NFKD`, `fully-normalized`, `none` or an implementation-defined value.
`omit-xml-declaration`	One of the enumerated values `yes` or `no`.
`standalone`	One of the enumerated values `yes`, `no` or `none`.
`undeclare-namespaces`	One of the enumerated values `yes` or `no`.
`use-character-maps`	A list of pairs, possibly empty, with each pair consisting of a single Unicode character and a string of Unicode characters.
`version`	A string of Unicode characters.

encoding specifies the preferred character encoding that the processor serializer SHOULD use for encoding sequences of characters as sequences of bytes; the value of the parameter SHOULD be treated case-insensitively; the value MUST contain only characters in the range #x21 to #x7E (i.e. printable ASCII characters); the value SHOULD either be a charset registered with the Internet Assigned Numbers Authority , or start with X-

If this parameter is not specified, and the output method does not specify any additional requirements, the encoding used is implementation-defined.

cdata-section-elements specifies a list of the names of elements whose text node children are to be output using CDATA sections

If this parameter is not specified, no elements will be treated specially.

doctype-system specifies the system identifier to be used in the document type declaration

If this parameter is not specified, no system identifier will a system identifer MUST NOT be generated. For XML and XHTML output methods, no public identifier will a public identifer MUST NOT be generated either, regardless of the setting of doctype-public.

doctype-public specifies the public identifier to be used in the document type declaration

If this parameter is not specified, a public identifier MUST NOT no public identifier will be generated.

escape-uri-attributes specifies whether the processor serializer SHOULD is to escape URI-valued attributes in HTML and XHTML output using the method RECOMMENDED in (section 2.4.1). The value MUST be yes or no.