XSLT 2.0 and XQuery 1.0 Serialization

&doc.prefix;-&doc.date;

W3C Working Draft

&date.day; &date.month; &date.year; &url.this; XML http://www.w3.org/TR/xslt-xquery-serialization/ http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/ http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20030502/ Michael Kay Saxonica (formerly of Software AG) http://www.saxonica.comMichael.Kay@softwareag.com Norman Walsh Sun Microsystems Norman.Walsh@Sun.COM Henry Zongaro IBM zongaro@ca.ibm.com

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a Public Working Draft for review by W3C Members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document describes how and convert an instance of the into a sequence of octets. This material has been moved out of the XSLT draft and into a separate document so that it can be shared by both the named specifications and possibly other specifications as well.

This draft includes many corrections and changes based on member-only and public comments on the Last Call Working Draft (http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/). The XML Query and XSL WGs wish to thank the people who have sent in comments for their close reading of the document.

This draft reflects decisions taken up to and including the face-to-face meeting in Cambridge, MA during the week of 21 June 2004. These decisions are recorded in the Last Call issues list (http://www.w3.org/2004/07/xquery-serialization-issues.html). However, some of these decisions may not yet be reflected in this document.

XSLT 2.0 and XQuery 1.0 Serialization has been defined jointly by the XSL Working Group and the XML Query Working Group (both part of the XML Activity).

Public comments on this document and its open issues are invited. Comments should be sent to the W3C XSLT/XPath/XQuery mailing list, public-qt-comments@w3.org (archived at http://lists.w3.org/Archives/Public/public-qt-comments/), with “[Serial]” at the beginning of the subject field.

The patent policy for this document is expected to become the 5 February 2004 W3C Patent Policy, pending the Advisory Committee review of the renewal of the XML Query Working Group. Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page and the XSL Working Group's patent disclosure page. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.

This document defines serialization for the and specifications and any other specifications that reference it.

English

See the CVS changelog.

Introduction

This document defines serialization of the W3C XQuery 1.0 and XPath 2.0 Data Model, which is the data model of at least , , and , and any other specifications that reference it.

This material has been moved out of the XSLT draft and into a separate document. The Working Groups also considered moving this material directly into the Data Model document, but elected to keep it separate for the moment, principally in order to advance the Data Model to Last Call. In the future, this material may be moved into the Data Model. The Working Groups solicit public opinion about which alternative is superior.

Serialization is the process of converting an instance of the into a sequence of octets. Serialization is well-defined for most data model instances.

The document assumes the reader already knows generally what serialization is. A brief explanation will be added, especially to disabuse any reader who thinks it might mean Java (or .NET) serialization. The editor has yet to align the description of serialization errors with the description of errors in related specifications. That will be done in a future public working draft.

In this specification the words must, must not, should, should not, may, required, and recommended are to be interpreted as described in .

Serializing Arbitrary Instances of the Data Model

The XQuery 1.0 and XPath 2.0 Data Model is richer and less constrained than XML. There are valid instances of the data model that have no direct analog in XML. In particular, instances of the data model can contain typed values, sequences, and sequences of typed values. And whereas XML deals only with documents, instances of the data model can have as their root any node type, simple value, or sequence and may even be empty.

This section describes how to convert an arbitrary instance of the data model into one of several simplified forms. We then describe how these forms are serialized. This greatly simplifies the sections which follow. Implementations are not required to implement serialization of arbitrary instances of the data model in this way, provided that they produce the same results as this conceptual model.

If the instance of the data model contains any typed or untyped atomic values, or sequences that contain typed or untyped atomic values, convert them to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. If the value cannot be cast to xs:string, serialization of the instance of the data model is undefined.

If adjacent strings occur in a sequence, replace both values with their concatenation separated by a single space.

If empty sequences occur, replace them with the empty string.

To complete the simplification, perform the following steps interactively iteratively until a simplest form is reached:

If the instance of the data model has as its root an attribute or namespace node, or a QName value, or if it has as its root a sequence which contains one of these items, serialization is undefined.

If the instance of the data model has as its root a single document node, or an element, processing instruction, comment, or text node, or a sequence of only element, processing instruction, comment, and text nodes, it is already in its simplest form.

If the instance of the data model has as its root a sequence of document nodes, or a sequence which contains document nodes, replace each document node with its children in document order.

If the instance of the data model has as its root a string value, or a sequence which contains one or more string values, replace each string value with a text node that contains the same string.

If there are any remaining string values among the children of elements in the instance of the data model, replace them with text nodes that contain the same string values and merge adjacent text nodes.

An instance of the data model that is input to the serialization process is a sequence. Prior to serializing a sequence using any of the output methods whose behavior is specified by this document () the serialization process must first place that input sequence into a normalized form for serialization; it is the normalized sequence that is actually serialized. The normalized form for serialization is constructed by applying all of the following rules in order, with the initial sequence being input to the first step, and the sequence that results from any step being used as input to the subsequent step. For any implementation-defined output method, it is implementation-defined whether this normalization process takes place.

Where the process of converting the input sequence to a normalized form indicates that a value must be cast to xs:string, that operation is as defined in of .

Replace an empty sequence with a zero-length string.

If the instance of the data model contains any atomic values, or sequences that contain atomic values, convert the atomic values to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. It is a serialization error if the value cannot be cast to xs:string.

Replace all adjacent strings in the sequence, with a single string equal to the values of the strings concatenated, each separated by a single space.

Replace any string in the sequence with a text node whose string value is equal to the string.

Replace any document node in the sequence with its children.

It is a serialization error if an item in the sequence is an attribute node or a namespace node. Otherwise, create a new document node and make all the items in the sequence, which are all nodes, children of that document node.

The tree rooted in the document node that is created by the final step of this normalization process is the instance of the data model to which the rules of the appropriate output method are applied. If the normalization process results in a serialization error, the processor must signal the error.

The normalization process for a sequence $seq is equivalent to constructing a document node using the XSLT instruction:

<xsl:result-document> <xsl:copy-of select="$seq"/> </xsl:result-document>

or the XQuery expression:

document { for $s in $seq return if ($s instance of document-node()) then $s/child::node() else $s }

and then serializing the document node as described in , , , , or in an implementation-defined manner.

This process will fail results in a serialization error with certain sequences, for example sequences containing parentless attribute and namespace nodes, or atomic values of types that cannot be cast to a string, such as xs:QName. and xs:NOTATION Such a failure results in a serialization error; the processor must signal the error.

Serialization Parameters

There are a number of parameters that influence how serialization is performed. Host languages may allow users to specify any or all of these parameters, but they are not required to be able to do so.

The following serialization parameters are defined:

Here and throughout the document, the distinction between "should" and "must" will be revisited. When serialization was described in the XSLT specification, use of "should" helped to clarify that the serialization process was optional. Now that it's described here in a standalone specification, many of those clauses should use "must".

Serialization parameter name	Permitted values for parameter
`cdata-section-elements`	A list of expanded-QNames, possibly empty.
`doctype-public`	A string of Unicode characters. This parameter is optional.
`doctype-system`	A string of Unicode characters. This parameter is optional.
`encoding`	A string of Unicode characters in the range #x21 to #x7E (that is, printable ASCII characters); the value should be a charset registered with the Internet Assigned Numbers Authority , or begin with the characters `x-` or `X-`.
`escape-uri-attributes`	One of the enumerated values `yes` or `no`.
`include-content-type`	One of the enumerated values `yes` or `no`.
`indent`	One of the enumerated values `yes` or `no`.
`media-type`	A string of Unicode characters specifying the media type (MIME content type) ; the charset parameter of the media type must not be specified explicitly in the value of the `media-type` parameter.
`method`	An expanded-QName with a null namespace URI, and the local part of the name equal to one of `xml`, `xhtml`, `html` or `text`, or having a non-null namespace URI. If the namespace URI is non-null, the parameter specifies an implementation-defined output method.
`normalization-form`	One of the enumerated values `NFC`, `NFD`, `NFKC`, `NFKD`, `fully-normalized`, `none` or an implementation-defined value.
`omit-xml-declaration`	One of the enumerated values `yes` or `no`.
`standalone`	One of the enumerated values `yes`, `no` or `none`.
`undeclare-namespaces`	One of the enumerated values `yes` or `no`.
`use-character-maps`	A list of pairs, possibly empty, with each pair consisting of a single Unicode character and a string of Unicode characters.
`version`	A string of Unicode characters.

encoding specifies the preferred character encoding that the processor should use for encoding sequences of characters as sequences of bytes; the value of the parameter should be treated case-insensitively; the value must contain only characters in the range #x21 to #x7E (i.e. printable ASCII characters); the value should either be a charset registered with the Internet Assigned Numbers Authority , or start with X-

If this parameter is not specified, and the output method does not specify any additional requirements, the encoding used is implementation-defined.

cdata-section-elements specifies a list of the names of elements whose text node children are to be output using CDATA sections

If this parameter is not specified, no elements will be treated specially.

doctype-system specifies the system identifier to be used in the document type declaration

If this parameter is not specified, no system identifier will a system identifer must not be generated. For XML and XHTML output methods, no public identifier will a public identifer must not be generated either, regardless of the setting of doctype-public.

doctype-public specifies the public identifier to be used in the document type declaration

If this parameter is not specified, a public identifier must not no public identifier will be generated.

escape-uri-attributes specifies whether the processor should is to escape URI-valued attributes in HTML and XHTML output using the method recommended in (section 2.4.1). The value must be yes or no.