4 Phases of Serialization

Serialization comprises three phases of processing (preceded optionally by the sequence normalization process described in 2 Sequence Normalization).

For an implementation-defined output method, any of these phases MAY be skipped or MAY be performed in a different order than is specified here. For the output methods defined in this specification, these phases are carried out sequentially as follows:

  1. Markup generation produces the character representation of those parts of the serialized result that describe the structure of the normalized sequence. In the cases of the XML, HTML and XHTML output methods, this phase produces the character representations of the following:

    In the cases of the XML and XHTML output methods, this phase also produces the following:

    In the case of the text output method, this phase has no effect.

  2. Character expansion is concerned with the representation of characters appearing in text and attribute nodes in the normalized sequence. For each text and attribute node, the following rules are applied in sequence.

    1. If the node is an attribute that is a URI attribute value and the escape-uri-attributes parameter is set to require escaping of URI attributes, apply URI escaping as defined below, and skip rules b-e. Otherwise, continue with rule b.

      [Definition: URI escaping consists of the following three steps applied in sequence to the content of URI attribute values:

      1. normalize to NFC using the method defined in Section 7.4.6 fn:normalize-unicodeFO

      2. percent-encode any special characters in the URI using the method defined in Section 7.4.12 fn:escape-html-uriFO

      3. escape according to HTML rules any characters (such as < and &) where HTML requires escaping, and any characters that cannot be represented in the selected encoding. For example, replace < with &lt;. (See also section 7.3 Writing Character Data)

      ]

      [Definition: The values of attributes listed in C List of URI Attributes are URI attribute values. Attributes are not considered to be URI attributes simply because they are namespace declaration attributes or have the type annotation xs:anyURI.]

    2. If the node is a text node whose parent element is selected by the rules of the cdata-section-elements parameter for the applicable output method, create CDATA sections as described below, and skip rules c-e. Otherwise, continue with rule c.

      Apply the following two processes in sequence to create CDATA sections

      1. Unicode Normalization if requested by the normalization-form parameter.

      2. apply changes as detailed in the description of the cdata-section-elements parameter for the applicable output method.

    3. Apply character mapping as determined by the use-character-maps parameter for the applicable output method. For characters that were substituted by this process, skip rules d and e. For the remaining characters that were not modified by character mapping, continue with rule d.

    4. Apply Unicode Normalization if requested by the normalization-form parameter.

      [Definition: Unicode Normalization is the process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence, as specified in [UAX #15: Unicode Normalization Forms]. For specific recommendations for character normalization on the World Wide Web, see [Character Model for the World Wide Web 1.0: Normalization].]

      The meanings associated with the possible values of the normalization-form parameter are defined in section 5.1.8 XML Output Method: the normalization-form Parameter.

      Continue with step e.

    5. Escape according to XML or HTML rules, as determined by the applicable output method, any characters (such as < and &) where XML or HTML requires escaping, and any characters that cannot be represented in the selected encoding. For example, replace < with &lt;. (See also section 7.3 Writing Character Data). For characters such as > where XML defines a built-in entity but does not require its use in all circumstances, it is implementation-dependent whether the character is escaped.

  3. Encoding, as controlled by the encoding parameter, converts the character stream produced by the previous phases into an octet stream.

    Note:

    Serialization is only defined in terms of encoding the result as a stream of octets. However, a serializer may provide an option that allows the encoding phase to be skipped, so that the result of serialization is a stream of Unicode characters. The effect of any such option is implementation-defined, and a serializer is not required to support such an option.