XSLT 2.0 and XQuery 1.0 Serialization

&doc.prefix;-&doc.date;

W3C Working Draft

&date.day; &date.month; &date.year; &url.this; XML http://www.w3.org/TR/xslt-xquery-serialization/ http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20030502/ Michael Kay Software AG Michael.Kay@softwareag.com Norman Walsh Sun Microsystems Norman.Walsh@Sun.COM Henry Zongaro IBM zongaro@ca.ibm.com

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a Public Working Draft for review by W3C Members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document describes how and convert an instance of the into a sequence of octets. This material has been moved out of the XSLT draft and into a separate document so that it can be shared by both the named specifications and possibly other specifications as well.

XSLT 2.0 and XQuery 1.0 Serialization has been defined jointly by the XSL Working Group and the XML Query Working Group (both part of the XML Activity).

This is a Last Call Working Draft. Comments on this document are due on 15 February 2004. Comments should be sent to the W3C mailing list public-qt-comments@w3.org (archived at http://lists. w3.org/Archives/Public/public-qt-comments/) with [Serial] at the beginning of the Subject field.

Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page at http://www.w3.org/2002/08/xmlquery-IPR-statements and the XSL Working Group's patent disclosure page at http://www.w3.org/Style/XSL/Disclosures.html.

This document defines serialization for the and specifications and any other specifications that reference it.

English

See the CVS changelog.

Introduction

This document defines serialization of the W3C XQuery 1.0 and XPath 2.0 Data Model, which is the data model of at least , , and , and any other specifications that reference it.

This material has been moved out of the XSLT draft and into a separate document. The Working Groups also considered moving this material directly into the Data Model document, but elected to keep it separate for the moment, principally in order to advance the Data Model to Last Call. In the future, this material may be moved into the Data Model. The Working Groups solicit public opinion about which alternative is superior.

Serialization is the process of converting an instance of the into a sequence of octets. Serialization is well-defined for most data model instances.

The document assumes the reader already knows generally what serialization is. A brief explanation will be added, especially to disabuse any reader who thinks it might mean Java (or .NET) serialization.

In this specification the words must, must not, should, should not, may, required, and recommended are to be interpreted as described in .

Serializing Arbitrary Data Models

The XQuery 1.0 and XPath 2.0 Data Model is richer and less constrained than XML. There are valid instances of the data model that have no direct analog in XML. In particular, data model instances can contain typed values, sequences, and sequences of typed values. And whereas XML deals only with documents, data model instances can have as their root any node type, simple value, or sequence and may even be empty.

This section describes how to convert an arbitrary data model instance into one of several simplified forms. We then describe how these forms are serialized. This greatly simplifies the sections which follow. Implementations are not required to implement serialization of arbitrary data model instances in this way, provided that they produce the same results as this conceptual model.

If the data model instance contains any typed or untyped atomic values, or sequences that contain typed or untyped atomic values, convert them to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. If the value cannot be cast to xs:string, serialization of the data model is undefined.

If adjacent strings occur in a sequence, replace both values with their concatenation separated by a single space.

If empty sequences occur, replace them with the empty string.

To complete the simplification, perform the following steps interactively iteratively until a simplest form is reached:

If the data model instance has as its root an attribute or namespace node, or a QName value, or if it has as its root a sequence which contains one of these items, serialization is undefined.

If the data model instance has as its root a single document node, or an element, processing instruction, comment, or text node, or a sequence of only element, processing instruction, comment, and text nodes, it is already in its simplest form.

If the data model instance has as its root a sequence of document nodes, or a sequence which contains document nodes, replace each document node with its children in document order.

If the data model instance has as its root a string value, or a sequence which contains one or more string values, replace each string value with a text node that contains the same string.

If there are any remaining string values among the children of elements in the data model instance, replace them with text nodes that contain the same string values and merge adjacent text nodes.

An instance of the data model that is input to the serialization process is a sequence. The serialization process must first place that input sequence into a normalized form for serialization; it is the normalized sequence that is actually serialized. The normalized form for serialization is constructed by applying all of the following rules in order, with the initial sequence being input to the first step, and the sequence that results from any step being used as input to the subsequent step.

Replace an empty sequence with a zero-length string.

If the data model instance contains any atomic values, or sequences that contain atomic values, convert the atomic values to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. It is a serialization error if the value cannot be cast to xs:string.

Replace all adjacent strings in the sequence, with a single string equal to the values of the strings concatenated, each separated by a single space.

Replace any string in the sequence with a text node whose string value is equal to the string.

Replace any document node in the sequence with its children.

It is a serialization error if an item in the sequence is an attribute node or a namespace node. Otherwise, create a new document node and make all the items in the sequence, which are all nodes, children of that document node.

The tree rooted in the document node that is created by the final step of this normalization process is the instance of the data model to which the rules of the appropriate output method are applied. If the normalization process results in a serialization error, the processor must signal the error.

The normalization process for a sequence $seq is equivalent to constructing a document node using the XSLT instruction:

<xsl:result-document> <xsl:copy-of select="$seq"/> </xsl:result-document>

or the XQuery expression:

document-node { for $s in $seq return if $s instance of document-node() then $s/child::node() else $s }

and then serializing the document node as described in , , , , or in an implementation-defined manner.

This process will fail with certain sequences, for example sequences containing parentless attribute and namespace nodes, or atomic values such as xs:QName and xs:NOTATION that cannot be cast to a string.Such a failure results in a serialization error; the processor must signal the error.

Serialization Parameters

There are a number of parameters that influence how serialization is performed. Host languages may allow users to specify any or all of these parameters, but they are not required to be able to do so.

The following serialization parameters are defined:

Here and throughout the document, the distinction between "should" and "must" will be revisited. When serialization was described in the XSLT specification, use of "should" helped to clarify that the serialization process was optional. Now that it's described here in a standalone specification, many of those clauses should use "must".

encoding specifies the preferred character encoding that the processor should use for encoding sequences of characters as sequences of bytes; the value of the parameter should be treated case-insensitively; the value must contain only characters in the range #x21 to #x7E (i.e. printable ASCII characters); the value should either be a charset registered with the Internet Assigned Numbers Authority , or start with X-

If this parameter is not specified, and the output method does not specify any additional requirements, the encoding used is implementation defined.

cdata-section-elements specifies a list of the names of elements whose text node children are to be output using CDATA sections

If this parameter is not specified, no elements will be treated specially.

doctype-system specifies the system identifier to be used in the document type declaration

If this parameter is not specified, no system identifier will a system identifer must not be generated. For XML and XHTML output methods, no public identifier will a public identifer must not be generated either, regardless of the setting of doctype-public.

doctype-public specifies the public identifier to be used in the document type declaration

If this parameter is not specified, a public identifier must not no public identifier will be generated.

escape-uri-attributes specifies whether the processor should is to escape URI-valued attributes in HTML and XHTML output using the method recommended in (section 2.4.1). The value must be yes or no.