This document is also available in these non-normative formats: XML.
Copyright © 2003 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document defines serialization for the [XSLT 2.0] and [XQuery 1.0] specifications and any other specifications that reference it.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is a Public Working Draft for review by W3C Members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document describes how [XSLT 2.0] and [XQuery 1.0] convert an instance of the [Data Model] into a sequence of octets. This material has been moved out of the XSLT draft and into a separate document so that it can be shared by both the named specifications and possibly other specifications as well.
XSLT 2.0 and XQuery 1.0 Serialization has been defined jointly by the XSL Working Group and the XML Query Working Group (both part of the XML Activity).
This is a Last Call Working Draft. Comments on this document are due on 15 February 2004. Comments should be sent to the W3C mailing list public-qt-comments@w3.org (archived at http://lists. w3.org/Archives/Public/public-qt-comments/) with [Serial] at the beginning of the Subject field.
Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page at http://www.w3.org/2002/08/xmlquery-IPR-statements and the XSL Working Group's patent disclosure page at http://www.w3.org/Style/XSL/Disclosures.html.
1 Introduction
2 Serializing Arbitrary Data Models
3 Serialization Parameters
4 XML Output Method
4.1 XML Output Method: the version
Parameter
4.2 XML Output Method: the encoding
Parameter
4.3 XML Output Method: the indent
Parameter
4.4 XML Output Method: the
cdata-section-elements Parameter
4.5 XML Output Method: the
omit-xml-declaration Parameter
4.6 XML Output Method: the
doctype-system and doctype-public Parameters
4.7 XML Output Method:
the undeclare-namespaces Parameter
4.8 XML Output Method: Other
Parameters
5 XHTML Output Method
6 HTML Output Method
6.1 HTML Output Method: Markup for
Elements
6.2 HTML Output Method: Writing
Attributes
6.3 HTML Output Method:
Indentation
6.4 HTML Output Method: Writing
Character Data
6.5 HTML Output Method:
Encoding
6.6 HTML Output Method: Document
Type Declaration
6.7 HTML Output Method: Other
Parameters
7 Text Output Method
8 Character Maps
This document defines serialization of the W3C XQuery 1.0 and XPath 2.0 Data Model, which is the data model of at least [XPath 2.0], [XSLT 2.0], and [XQuery 1.0], and any other specifications that reference it.
Ed. Note: This material has been moved out of the XSLT draft and into a separate document. The Working Groups also considered moving this material directly into the Data Model document, but elected to keep it separate for the moment, principally in order to advance the Data Model to Last Call. In the future, this material may be moved into the Data Model. The Working Groups solicit public opinion about which alternative is superior.
Serialization is the process of converting an instance of the [Data Model] into a sequence of octets. Serialization is well-defined for most data model instances.
Ed. Note: The document assumes the reader already knows generally what serialization is. A brief explanation will be added, especially to disabuse any reader who thinks it might mean Java (or .NET) serialization.
In this specification the words must, must not, should, should not, may, required, and recommended are to be interpreted as described in [RFC2119].
An instance of the data model that is input to the serialization process is a sequence. The serialization process must first place that input sequence into a normalized form for serialization; it is the normalized sequence that is actually serialized. The normalized form for serialization is constructed by applying all of the following rules in order, with the initial sequence being input to the first step, and the sequence that results from any step being used as input to the subsequent step.
Replace an empty sequence with a zero-length string.
If the data model instance contains any atomic values, or sequences that contain atomic values, convert the atomic values to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. It is a serialization error if the value cannot be cast to xs:string.
Replace all adjacent strings in the sequence, with a single string equal to the values of the strings concatenated, each separated by a single space.
Replace any string in the sequence with a text node whose string value is equal to the string.
Replace any document node in the sequence with its children.
It is a serialization error if an item in the sequence is an attribute node or a namespace node. Otherwise, create a new document node and make all the items in the sequence, which are all nodes, children of that document node.
The tree rooted in the document node that is created by the final step of this normalization process is the instance of the data model to which the rules of the appropriate output method are applied. If the normalization process results in a serialization error, the processor must signal the error.
Note: The normalization process for a sequence
$seq
is equivalent to constructing a document node using the XSLT instruction:<xsl:result-document> <xsl:copy-of select="$seq"/> </xsl:result-document>or the XQuery expression:
document-node { for $s in $seq return if $s instance of document-node() then $s/child::node() else $s }This process will fail with certain sequences, for example sequences containing parentless attribute and namespace nodes, or atomic values such as
xs:QName
andxs:NOTATION
that cannot be cast to a string.
There are a number of parameters that influence how serialization is performed. Host languages may allow users to specify any or all of these parameters, but they are not required to be able to do so.
The following serialization parameters are defined:
encoding
specifies the preferred character encoding for
encoding sequences of characters as sequences of bytes; the value of
the parameter should be treated case-insensitively; the value must contain
only characters in the range #x21 to #x7E (i.e. printable ASCII characters);
the value should either be a charset
registered with the
Internet Assigned Numbers Authority [IANA], [RFC2278] or start with X-
If this parameter is not specified, and the output method does not specify any additional requirements, the encoding used is implementation defined.
cdata-section-elements
specifies a list of the names of
elements whose text node children are to be output using CDATA
sections
If this parameter is not specified, no elements will be treated specially.
doctype-system
specifies the system identifier to be used in
the document type declaration
doctype-public
specifies the public identifier to be used in
the document type declaration
escape-uri-attributes
specifies whether the processor
is to escape URI-valued attributes in HTML and XHTML output
using the method recommended in [RFC2396] (section
2.4.1). The value must be yes
or no
.
If this parameter is not specified, the value is implementation defined.
include-content-type
specifies whether the serialization
process is to add a meta
element in HTML and XHTML
output. The value must be yes
or no
.
If this parameter is not specified, the value is implementation defined.
indent
specifies whether the processor may add additional
whitespace when outputting the data model; the value must be yes
or no
If this parameter is not specified, the value is implementation defined.
media-type
specifies the media type (MIME content type)
[RFC2376] of the data that results from outputting the
data model; the charset
parameter of the media type
must not be specified explicitly
If this parameter is not specified, the media type is implementation defined.
normalize-unicode
requests conversion of the
serialized output to Unicode Normalization Form C as specified in [Unicode Normalization]. The value must be
yes
or no
.
If this parameter is not specified, the value is implementation defined.
omit-xml-declaration
specifies whether the serialization
process is to output an XML declaration. The value must be
yes
or no
If this parameter is not specified, the value is implementation defined.
standalone
specifies whether the processor is to emit a
standalone document declaration and the value of the declaration; the
value of the parameter must be yes
or
no
undeclare-namespaces
specifies whether namespaces, are
to be undeclared during serialization; the value must be
yes
or no
.
If this parameter is not specified, the value is implementation defined.
This parameter only applies when the XML serialization method is used and the version is greater than 1.0.
use-character-maps
provides a list of character/string pairs
that are used in serialization (see 8 Character
Maps).
If this parameter is not specified, no character maps are used.
version
specifies the version of the output method
If this parameter is not specified, the value is implementation defined.
The method
identifies the overall method that should be used
for serializing. The value of the method
parameter must be a
valid QName. If the QName is in no namespace, then it identifies a method
specified in this document and must be one of xml
,
html
, xhtml
, or text
; in this
case, the output method specified must be used for serializing. If the
QName is in a namespace, then it identifies an implementation
defined output method; the behavior in this case is not specified by
this document.
The detailed semantics of each parameter will be described separately for each output method for which it is applicable. If the semantics of a parameter are not described for an output method, then it is not applicable to that output method.
Serialization can be regarded as involving four phases of processing, carried out sequentially as follows:
Markup generation produces the representation of start and end
tags for elements, and other constructs such as XML declarations, processing
instructions, and so on. This is influenced by the parameters
method
, doctype-system
,
doctype-public
, include-content-type
,
indent
, omit-xml-declaration
,
standalone
, and version
.
Character expansion is concerned with the representation of characters appearing in text and attribute nodes in the data model. The substitution processes that may apply are listed below, in priority order: a character that is handled by one process in this list will be unaffected by processes appearing later in the list:
URI escaping (in the case of URI-valued attributes in the HTML and XHTML
output methods), as determined by the escape-uri-attributes
parameter
Creation of CDATA sections, as determined by the
cdata-section-elements
parameter. Note that this is also
affected by the encoding
parameter, in that characters not
present in the selected encoding cannot be represented in a CDATA
section.
Character mapping, as determined by the use-character-maps
parameter.
Escaping of special characters according to XML or HTML rules, for example
replacing <
by <
Unicode Normalization, if requested by the
normalize-unicode
parameter. Unicode normalization is applied to
the character stream that results after all markup generation and character
expansion has taken place.
Encoding, as controlled by the encoding
parameter.
This converts the character stream produced by the previous phases into a
byte stream.
The xml
output method outputs the data model as an XML entity
that must satisfy the rules for either a well-formed XML
document entity or a well-formed XML external general parsed entity, or both,
unless the processor is unable to satisfy those rules due to either
serialization errors or the requirements of the character expansion phase of
serialization, as described in 3 Serialization
Parameters.
In all other circumstances, the serialized form must comply with the
requirements described for the xml
output method.
If the document node of the data model has a single element node child and no text node children, and the serialized output is a well-formed XML document entity, the serialized output must conform to the XML Namespaces Recommendation [XML Names]. If the data model does not take this form, and the serialized output is a well-formed XML external general parsed entity, then the serialized output must be an entity which, when referenced within a trivial XML document wrapper like this
<!DOCTYPE doc [ <!ENTITY e SYSTEM "entity-URI"> ]> <doc>&e;</doc>
where entity-URI
is a URI for the entity, produces a document
which must itself be a well-formed XML document conforming to
the XML Namespaces Recommendation [XML Names].
In addition, the output must be such that if a new tree was constructed by parsing the XML document and converting it into a data model as specified in [Data Model], then the new data model would be the same as the starting data model, with the following possible exceptions:
If the document was produced by adding a document wrapper, as described
above, then it will contain an extra doc
element as the document
element.
The order of attribute and namespace nodes in the two trees may be different.
The base URIs of nodes in the two trees may be different.
The new tree may contain additional attributes and text nodes resulting from the expansion of default and fixed values in its DTD or schema.
The type annotations of the nodes in the two trees may be different. Type annotations in a result tree are discarded when the tree is serialized. Any new type annotations obtained by parsing the document will depend on whether the serialized XML document is assessed against a schema, and this may result in type annotations that are either more or less precise than those in the original result tree.
Note: In order to permit such type annotations to be available in a data model that results from processing a serialized XML document, the process that creates the input data model could create it so that the serialized form uses mechanisms provided by [XML Schema], such as the
xsi:type
andxsi:schemaLocation
attributes.
Additional namespace nodes may be present in the new tree if the serialization process undeclared namespaces, as described in 4.7 XML Output Method: the undeclare-namespaces Parameter, and the starting data model contained an element node with a namespace node that declared some prefix, but a child element of that node did not have any namespace node that declared the same prefix.
Additional nodes may be present in the new tree, and the values of attribute nodes and text nodes in the new tree may be different from those in the original tree, due to the character expansion phase of serialization.
A consequence of this rule is that certain whitespace characters
must be output as character references, to ensure that they
survive the round trip through serialization and parsing. Specifically, CR
characters in text nodes must be written as

or an equivalent; while CR, NL, and TAB characters in
attribute nodes must be output respectively as

, 

, and 	
,
or their equivalents.
For example, an attribute with the value "x" followed by "y" separated by
a newline will result in the output "x
y"
(or with any
equivalent character reference). The XML output cannot be "x" followed by a
literal newline followed by a "y" because after parsing, the attribute value
would be "x y"
as a consequence of the XML attribute
normalization rules.
Note: To anticipate the proposed changes to end-of-line handling in XML 1.1, implementations may also output the characters x85 and x2028 as character references. This will not affect the way they are interpreted by an XML 1.0 parser.
It is a serialization error to request the output of a document type
declaration, or of a standalone
parameter, if the data model
contains text nodes or multiple element nodes as children of the root node.
The processor must either signal the error, or recover by
ignoring the request to output a document type declaration or
standalone
parameter.
The result of serialization using the XML output method is not guaranteed to be well-formed XML if character maps have been specified (see 8 Character Maps) or if nodes in the data model contain characters that are invalid in XML (introduced, perhaps, by calling a user-written extension function: this is an error but the processor is not required to signal it).
version
ParameterThe version
parameter specifies the version of XML to be used
for outputting the data model. If the processor does not support this version
of XML, it must use a version of XML that it does support. The
version output in the XML declaration (if an XML declaration is output)
must correspond to the version of XML that the processor used
for outputting the data model. The value of the version
parameter must match the VersionNum
XML production of the XML Recommendation [XML].
encoding
ParameterThe encoding
parameter specifies the preferred encoding to
use for outputting the data model. Processors are required to respect values
of UTF-8
and UTF-16
. A serialization error occurs
when an output encoding other than UTF-8
or UTF-16
is requested, if the implementation does not support that encoding. The
processor must signal the error, or recover by using
UTF-8
or UTF-16
instead. The processor must not use
an encoding whose name does not match the EncName
XML production of the XML Recommendation [XML]. If no encoding
parameter is specified, then
the processor must use either UTF-8
or
UTF-16
.
When outputting a newline character in the data model, the implementation is free to represent it using any character sequence that will be normalized to a newline character by an XML parser, unless a specific mapping for the newline character is provided in a character map: see 8 Character Maps.
When outputting any other character that is defined in the selected encoding, the character must be output using the correct representation of that character in the selected encoding.
It is possible that the data model will contain a character that cannot be represented in the encoding that the processor is using for output. In this case, if the character occurs in a context where XML recognizes character references (that is, in the value of an attribute node or text node), then the character must be output as a character reference. A serialization error occurs if such a character appears in a context where character references are not allowed (for example if the character occurs in the name of an element). The processor must signal the error.
indent
ParameterIf the indent
parameter has the value yes
, then
the xml
output method may output whitespace in addition to the
whitespace in the data model (possibly based on whitespace stripped from
either the source document or the stylesheet, in the case of XSLT, or
guided by other means that might depend on the host language, in the case of
a data model created using some other process) in order to indent the
result nicely; if the indent
parameter has the value
no
, it must not output any additional whitespace.
If the xml
output method does output additional
whitespace, it must use an algorithm to output additional
whitespace that satisfies the following constraints:
Whitespace characters must not be added adjacent to a text node that contains non-whitespace characters.
Whitespace may only be added adjacent to an element node, that is, immediately before a start tag or immediately after an end tag.
The new whitespace characters may replace existing whitespace characters in the same position, for example a tab may be inserted as a replacement for existing spaces. However, existing whitespace must not be removed without such a replacement.
Whitespace characters must not be inserted in a part of the result
document that is controlled by an xml:space="preserve"
attribute.
Note: The effect of these rules is to ensure that whitespace is only added in places where (a) XSLT's
<xsl:strip-space>
declaration could cause it to be removed, and (b) it does not affect the string value of any element node with simple content. It is usually not safe to indent document types that include elements with mixed content.
cdata-section-elements
ParameterThe cdata-section-elements
parameter contains a list of
expanded-QNames. If the expanded-QName of the parent of a text node is a
member of the list, then the text node must be output as a CDATA
section, except in those circumstances described below.
If the text node contains the sequence of characters ]]>
,
then the currently open CDATA section must be closed following
the ]]
and a new CDATA section opened before the
>
.
If the text node contains characters that are not representable in the character encoding being used to output the data model, then the currently open CDATA section must be closed before such characters, the characters must be output using character references or entity references, and a new CDATA section must be opened for any further characters in the text node.
CDATA sections must not be used except where they have been
explicitly requested by the user, either by using the
cdata-section-elements
parameter, or by using some other
implementation-defined mechanism.
Note: This is phrased to permit an implementor to provide an option that attempts to preserve CDATA sections present in the source document.
omit-xml-declaration
ParameterThe xml
output method must output an XML
declaration unless the omit-xml-declaration
parameter has the
value yes
. The XML declaration must include both
version information and an encoding declaration. If the
standalone
parameter is specified, it must include
a standalone document declaration with the same value as the value of the
standalone
parameter. Otherwise, it must not
include a standalone document declaration; this ensures that it is both an
XML declaration (allowed at the beginning of a document entity) and a text
declaration (allowed at the beginning of an external general parsed
entity).
The omit-xml-declaration
parameter must be
ignored if the standalone
parameter is present, or if the
encoding
parameter specifies a value other than UTF-8 or
UTF-16.
doctype-system
and doctype-public
ParametersIf the doctype-system
parameter is specified, the
xml
output method must output a document type
declaration immediately before the first element. The name following
<!DOCTYPE
must be the name of the first element,
if any. If the doctype-public
parameter is also
specified, then the xml
output method must output
PUBLIC
followed by the public identifier and then the system
identifier; otherwise, it must output SYSTEM
followed by the system identifier. The internal subset must be
empty. The doctype-public
parameter must be ignored
unless the doctype-system
parameter is specified.
undeclare-namespaces
ParameterThe Data Model allows an element to have fewer in-scope namespaces than
its parent. In XML 1.1, this can be represented most accurately by
undeclaring namespaces. If undeclare-namespaces
is
"yes
" and the output method is XML and the version
is greater than 1.0, serialization must undeclare
namespaces.
Consider an element x:foo
with three in-scope namespaces:
<x:foo xmlns:x="http://example.org/x" xmlns:y="http://example.org/y" xmlns:z="http://example.org/z">
Suppose that it has a child element with two in-scope namespaces:
<x:bar xmlns:x="http://example.org/x" xmlns:y="http://example.org/y">...
If namespace undeclaration is in effect, it will be serialized this way:
<x:foo xmlns:x="http://example.org/x" xmlns:y="http://example.org/y" xmlns:z="http://example.org/z"> <x:bar xmlns:z="">...</x:bar> </x:foo>
In XML 1.0, namespace undeclaration is not possible. If the output
method is xml
and the value of the version
parameter is 1.0, namespace declaration is not performed, and the
undeclare-namespace
parameter is ignored.
The xhtml
output method serializes the data model as XML,
using the HTML compatibility guidelines defined in the XHTML
specification.
It is entirely the responsibility of the person or process that creates the data model to ensure that the data model conforms to the [XHTML 1.0] or [XHTML 1.1] specification. It is not an error if the data model is invalid XHTML. Equally, it is entirely under the control of the person or process that creates the data model whether the output conforms to XHTML Strict, XHTML Transitional, XHTML Frameset, or XHTML Basic.
The serialization of the data model follows the same rules as for the
xml
output method, with the exceptions noted below. These
differences are based on the HTML compatibility guidelines published in
Appendix C of [XHTML 1.0], which are designed to
ensure that as far as possible, XHTML is rendered correctly on user agents
designed originally to handle HTML.
Given an empty instance of an XHTML element whose content
model is not EMPTY (for example, an empty title or paragraph) the serializer
must not use the minimized form. That is, it must
output <p></p>
and not
<p />
.
Given an XHTML element whose content model is EMPTY, the serializer
must use the minimized tag syntax, for example
<br />
, as the alternative syntax
<br></br>
allowed by XML gives uncertain results in
many existing user agents. The serializer must include a space
before the trailing />
, e.g. <br />
,
<hr />
and <img src="karen.jpg"
alt="Karen" />
.
The serializer should avoid outputting line breaks and multiple whitespace characters within attribute values. These are handled inconsistently by user agents.
The serializer must not use the entity reference
'
which, although legal in XML and therefore in XHTML,
is not defined in HTML and is not recognized by all HTML user agents.
The serializer should output namespace declarations in a way that is
consistent with the requirements of the XHTML DTD if this is possible. The
DTD requires the declaration
xmlns="http://www.w3.org/1999/xhtml"
to appear on the
html
element, and only on the html
element. The
serializer must output namespace declarations that are consistent with the
namespace nodes present in the result tree, but it should avoid outputting
redundant namespace declarations on elements where the DTD would make them
invalid.
Note: Where the process used to construct the input data model does not provide complete control over the prefix used for an element name in the data model or control of whether the element is in the default namespace (for instance, the XSLT namespace fixup process), implementors are encouraged to provide means or endeavor to preserve the obvious intent of a user to place the
html
element in in the default namespace, wherever possible. For example, implementors of XSLT processors are encouraged to place thehtml
element that results from a literal result element like the following in the default namespace.<html xmlns="http://www.w3.org/1999/xhtml"> ... </html>
If the data model includes a head
element in the XHTML
namespace, then unless the include-content-type
parameter
has the value "no"
, the xhtml
output method
must add a meta
element immediately after the
start-tag of the head
element specifying the character encoding
actually used.
For example,
<head> <meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/> ...
The content type should be set to the value given for the
media-type
parameter; the default value for XHTML is
text/html
. The value application/xhtml+xml
,
registered in [RFC3236], may also be used.
If the data model includes a head
element that has a
meta
element child, the processor should replace any
content
attribute of the meta
element, or add such
an attribute, with the value as described above, rather than output a new
meta
element.
Unless the escape-uri-attributes
parameter has the value
no
, the xhtml
output method must
escape non-ASCII characters in URI attribute values using the method
recommended in [RFC2396] (section 2.4.1).
Note: This escaping is deliberately confined to non-ASCII characters, because escaping of ASCII characters is not always appropriate, for example when URIs or URI fragments are interpreted locally by the HTML user agent. Even in the case of non-ASCII characters, escaping can sometimes cause problems. More precise control of URI escaping is therefore available by setting
escape-uri-attributes
tono
, and controlling the escaping of URIs by means of the fn:escape-uri function defined in [Functions and Operators].
Note: As with the XML output method, the XHTML output method outputs an XML declaration unless it is suppressed using the
omit-xml-declaration
parameter. Appendix C.1 of [XHTML 1.0] provides advice on the consequences of including, or omitting, the XML declaration.
The html
output method outputs the data model as HTML.
For example,
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="html"/> <xsl:template match="/"> <html> <xsl:apply-templates/> </html> </xsl:template> ... </xsl:stylesheet>
The version
parameter indicates the version of the HTML. The
default value is 4.0
, which specifies that the result
should be output as HTML conforming to the HTML 4.0
Recommendation [HTML].
The html
output method must not output an
element differently from the xml
output method unless the
expanded-QName of the element has a null namespace URI; an element whose
expanded-QName has a non-null namespace URI must be output as
XML. If the expanded-QName of the element has a null namespace URI, but the
local part of the expanded-QName is not recognized as the name of an HTML
element, the element must be output in the same way as a
non-empty, inline element such as span
. In particular:
If the result tree contains namespace nodes for namespaces other than the
XML namespace, the HTML output method must represent these
namespaces using attributes named xmlns
or
xmlns:
prefix in the same way as the XML output method
would represent them when the version parameter is set to 1.0.
If the result tree contains elements or attributes whose names have a non-null namespace URI, the HTML output method must generate namespace-prefixed QNames for these nodes in the same way as the XML output method would do when the version parameter is set to 1.0.
Where special rules are defined later in this section for serializing specific HTML elements and attributes, these rules must not be applied to an element or attribute whose name has a non-null namespace URI. However, the generic rules for the HTML output method that apply to all elements and attributes, for example the rules for escaping special characters in the text and the rules for indentation, must be used also for namespaced elements and attributes.
When serializing an element whose name is not defined in the HTML
specification, but that is in the null namespace, the HTML output method
must apply the same rules (for example, indentation rules) as
when serializing a span
element. The descendants of such an
element must be serialized as if they were descendants of a
span
element.
When serializing an element whose name is in a non-null namespace, the
HTML output method must apply the same rules (for example,
indentation rules) as when serializing a div
element. The
descendants of such an element must be serialized as if they
were descendants of a div
element.
The html
output method must not output an
end-tag for empty elements. For HTML 4.0, the empty elements are
area
, base
, basefont
, br
,
col
, frame
, hr
, img
,
input
, isindex
, link
,
meta
and param
. For example, an element written as
<br/>
or <br></br>
in an XSLT
stylesheet must be output as <br>
.
The html
output method must recognize the names
of HTML elements regardless of case. For example, elements named
br
, BR
or Br
must all be
recognized as the HTML br
element and output without an
end-tag.
The html
output method must not perform escaping
for the content of the script
and style
elements.
For example, a script
element created by an XQuery
direct element constructor or an XSLT literal result element, such
as:
<script>if (a < b) foo()</script>
or
<script><![CDATA[if (a < b) foo()]]></script>
must be output as
<script>if (a < b) foo()</script>
A common requirement is to output a script
element as shown
in the example below:
<script type="text/javascript"> document.write ("<em>This won't work</em>") </script>
This is illegal HTML, for the reasons explained in section B.3.2 of the HTML 4.01 specification. Nevertheless, it is possible to output this fragment, using either of the following constructs:
Firstly, by use of a script
element created by an
XQuery direct element constructor or an XSLT literal result
element:
<script type="text/javascript"> document.write ("<em>This won't work</em>") </script>
Secondly, by constructing the markup from ordinary text characters:
<script type="text/javascript"> document.write ("<em>This won't work</em>") </script>
As the HTML specification points out, the correct way to write this is to use the escape conventions for the specific scripting language. For JavaScript, it can be written as:
<script type="text/javascript"> document.write ("<em>This will work<\/em>") </script>
The HTML 4.01 specification also shows examples of how to write this in various other scripting languages. The escaping must be done manually, it will not be done by the serializer.
The html
output method must not escape
"<
" characters occurring in attribute values.
If the indent
parameter has the value yes
, then
the html
output method may add or remove whitespace as it
outputs the data model, so long as it does not change how an HTML user agent
would render the output.
Unless the escape-uri-attributes
parameter is present and has
the value no
, the html
output method
must escape non-ASCII characters in URI attribute values using
the method recommended in [RFC2396] (section
2.4.1).
Note: This escaping is deliberately confined to non-ASCII characters, because escaping of ASCII characters is not always appropriate, for example when URIs or URI fragments are interpreted locally by the HTML user agent. Even in the case of non-ASCII characters, escaping can sometimes cause problems. More precise control of URI escaping is therefore available by setting
escape-uri-attributes
tono
, and controlling the escaping of URIs by means of the fn:escape-uri function defined in [Functions and Operators].
The html
output method must output boolean
attributes (that is attributes with only a single allowed value that is equal
to the name of the attribute) in minimized form.
For example, a start-tag created using the following XQuery direct element constructor or XSLT literal result element
<OPTION selected="selected">
must be output as
<OPTION selected>
The html
output method must not escape a
&
character occurring in an attribute value immediately
followed by a {
character (see Section
B.7.1 of the HTML 4.0 Recommendation).
For example, a start-tag created using the following XQuery direct element constructor or XSLT literal result element
<BODY bgcolor='&{{randomrbg}};'>
must be output as
<BODY bgcolor='&{randomrbg};'>
If the indent
attribute has the value yes
, then
the html
output method may add or remove whitespace as it
outputs the result tree, so long as it does not change the way that a
conforming HTML user agent would render the output. The default value is
yes
.
Note: This rule can be satisfied by observing the following constraints:
Whitespace must only be added before or after an element, or adjacent to an existing whitespace character.
Whitespace must not be added or removed adjacent to an inline element. The inline elements are those included in the
%inline
category of any of the HTML 4.01 DTD's, as well as theINS
andDEL
elements if they are used as inline elements (i.e., if they do not contain element children).Whitespace must not be added or removed inside a formatted element, the formatted elements being
pre
,script
,style
, andtextarea
.Note that the HTML definition of whitespace is different from the XML definition: see section 9.1 of the HTML 4.01 specification.
The html
output method may output a character using a
character entity reference in preference to using a numeric character
reference, if an entity is defined for the character in the version of HTML
that the output method is using. Entity references and character references
should be used only where the character is not present in the selected
encoding, or where the visual representation of the character is unclear (as
with
, for example).
When outputting a sequence of whitespace characters in the data model,
within an element where whitespace is treated normally, (but not in
elements such as pre
and textarea
) the
html
output method may represent it using any
character sequence that will be treated as whitespace by an HTML user
agent.
Certain characters, specifically the control characters #x7F-#x9F, are legal in XML but not in HTML. It is an error to use the HTML output method when such characters appear in the data model. The processor may signal the error, but is not required to do so. If it does not signal the error, it may copy the offending characters into the serialized output, creating invalid HTML.
The html
output method must terminate processing
instructions with >
rather than ?>
.
The encoding
parameter specifies the preferred encoding to be
used. If there is a HEAD
element, then unless the
include-content-type
parameter is present and has the value
"no"
, the html
output method must add
a META
element immediately after the start-tag of the
HEAD
element specifying the character encoding actually
used.
For example,
<HEAD> <META http-equiv="Content-Type" content="text/html; charset=EUC-JP"> ...
The content type must be set to the value given for the
media-type
parameter; the default value is
text/html
.
If the data model includes a head
element that has a
meta
element child, the processor should replace any
content
attribute of the meta
element, or add such
an attribute, with the value as described above, rather than output a new
meta
element.
It is possible that the data model will contain a character that cannot be
represented in the encoding that the processor is using for output. In this
case, if the character occurs in a context where HTML recognizes character
references, then the character must be output as a character
entity reference or decimal numeric character reference; otherwise (for
example, in a script
or style
element or in a
comment), the processor must signal a serialization error.
If the doctype-public
or doctype-system
parameters are specified, then the html
output method
must output a document type declaration immediately before the
first element. The name following <!DOCTYPE
must
be HTML
or html
. If the doctype-public
parameter is specified, then the output method must output
PUBLIC
followed by the specified public identifier; if the
doctype-system
parameter is also specified, it must
also output the specified system identifier following the public identifier.
If the doctype-system
parameter is specified but the
doctype-public
parameter is not specified, then the output
method must output SYSTEM
followed by the specified
system identifier.
The text
output method outputs the data model by outputting
the string-value of every text node in the data model in document order
without any escaping.
A newline character in the data model may be output using any character sequence that is conventionally used to represent a line ending in the chosen system environment.
The media-type
parameter is applicable for the
text
output method.
The encoding
parameter identifies the encoding that the
text
output method must use to convert sequences of
characters to sequences of bytes. The default is implementation-defined.
A serialization error occurs if the implementation does not support the
encoding specified by the encoding
parameter. If the data
model contains a character that cannot be represented in the encoding that
the processor is using for output, the implementation must
signal a serialization error.
The default encoding for the text
output method is
implementation-defined.
The unicode-normalization
parameter is applicable for the
text
output method.
The use-character-maps
parameter is applicable for the
text
output method.
The use-character-maps
parameter is a list of characters and
corresponding string substitutions.
Character maps allow a specific character appearing in a text or attribute node in the data model to be substituted by a specified string of characters during serialization. The string that is substituted is output "as is", and the serializer performs no checks that the resulting document is well-formed. This mechanism can therefore be used to introduce arbitrary markup in the serialized output.
Character mapping is applied to the characters that actually appear in a text or attribute node in the data model, before any other serialization operations such as escaping or Unicode normalization are applied. If a character is mapped, then it is not subjected to XML or HTML escaping, nor to Unicode normalization. The string that is substituted for a character is not validated or processed in any way by the serializer, except for translation into the target encoding. In particular, it is not subjected to XML or HTML escaping, it is not subjected to Unicode normalization, and it is not subjected to further character mapping. If the string cannot be represented using the target encoding, the serializer takes the same action as it would if the offending characters appeared directly in the data model.
Character mapping is not applied to characters in text nodes whose parent
elements are listed in the cdata-section-elements
parameter, nor
to characters in attribute values that are subject to the URI escaping
defined for the HTML and XHTML output methods, unless URI escaping has been
disabled using the escape-uri-attributes
parameter in the output
definition.
On serialization, occurrences of a character specified in the
use-character-maps
in text nodes and attribute values are
replaced by the corresponding string from the use-character-maps
parameter.
Note: Using a character map can result in non-well-formed documents if the string contains XML-significant characters. For example, it is possible to create documents containing unmatched start and end tags, references to entities that are not declared, or attributes that contain tags or unescaped quotation marks.
Character mapping is applied to the characters that actually appear in a text or attribute node in the data model, before any other serialization operations such as escaping or Unicode normalization are applied.
Character mapping is not applied to characters for which output escaping
has been disabled (disabling output escaping is an [XSLT
2.0] feature), nor to characters in text nodes whose parent elements are
listed in the cdata-section-elements
parameter, nor to
characters in attribute values that are subject to the URI escaping defined
for the HTML and XHTML output methods, unless URI escaping has been disabled
using the escape-uri-attributes
parameter.
If a character is mapped, then it is not subjected to XML or HTML escaping.
A serialization error occurs if character mapping causes the output of a string containing a character that cannot be represented in the encoding that the processor is using for output. The processor must signal the error.