XML Normalization

Abstract

XML Normalization defines a means by which XML parsers can produce normalized output of any parsed document. This normalized form is similar to that produced by Canonicalized XML 1.1 [XML-C14N11], though the two are not interchangeable. Its intent is also different than that of Canonicalized XML 1.1: it exists primarily to assist clients of XML parser APIs such as SAX [SAX] to ensure that they are provided XML data in a predefined representation, whether as events or DOM nodes.

Any XML document is part of a set of XML documents that are logically equivalent within an application context, but which vary in physical representation based on syntactic changes permitted by XML 1.0 [XML10] and Namespaces in XML 1.0 [XML-NAMES]. This specification describes a method by which parsers can generate XML events or DOM nodes according to a normalized form that accounts for the permissible changes. It also allows for external specification of certain attributes of this normalized form.

The aim of this standard is to define a means by which a low-overhead streaming XML parser can output events in a manner which can be anticipated by a client of the parser, thus reducing that client's need for additional logic to handle variations in representation. It also provides a supplemental guide to implementing the same algorithm for DOM parsers. It is not intended to provide a canonicalized form of a document as defined by Canonical XML 1.1 [XML-C14N11], and has some incompatibilities with that standard, though its output is frequently similar. However, two semantically equivalent documents will produce similar output when processed using the same normalization parameters and algorithm.

Normalization for Streaming XML Parsers is applicable to XML 1.0. It is not defined for XML 1.1.

1. Introduction

1.1 Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [RFC2119].

1.2 Terminology

See [XML-NAMES] for the definition of QName.

document subset: A document subset is a portion of an XML document that may not include all of the nodes in the document.
normalized form: The normalized form of an XML document is a physical representation of the document produced by the method described in this specification.
normalized XML: The term normalized XML refers to XML that is in normalized form. The XML normalization method is the algorithm defined by this specification that generates the normalized form of a given XML document or document subset. The term XML normalization refers to the process of applying the XML normalization method to an XML document or document subset.
subtree: Subtree refers to one XML element node, and all that it contains. In XPath terminology it is an element node and all its descendant nodes.
DOM: DOM or Document Object Model is a model of representing an XML document in tree structure. The W3C DOM standard [DOM-LEVEL-2-CORE] is one such DOM, but this specification does not require this particular set of DOM APIs; any similar model can be used as long as it has a tree representation of the XML document, whose root is a document node, and the document node's descendants are element nodes, attribute nodes, text nodes etc.
DOM parser: A software module that reads an XML document and constructs a DOM tree.
Event parser: A software module that reads an XML document and posts parsing events to an API client. SAX [SAX] is an example of an event parser.
Stream parser: A software module that reads an XML document and constructs a stream of XML events like "beginElement", "text", "endElement", and exposes an iterator-based API allowing clients to 'pull' these events. StAX [XML-PARSER-STAX] is an example of a stream parser.

1.3 Applications

Since the XML 1.0 Recommendation [XML10] and the Namespaces in XML 1.0 Recommendation [XML-NAMES] define multiple syntactic methods for expressing the same information, XML applications tend to take liberties with changes that have no impact on the information content of the document. XML normalization is designed to be useful to applications that wish to process an XML document in regards to a predetermined semantic representation, allowing clients of a stream or event parser to delegate the handling of differing representations of semantically-identical XML documents to the parser itself.

For example, a representation may make use of a well-known XML namespace prefix or it may use one of its own devising. The algorithm defined in this specification can be used to translate those prefixes while parsing, such that the client API need not anticipate multiple prefixes, nor need to manually compare potentially long namespace URIs at every step. This also applies to any XPath or QName values contained within the document.

Another example allows a client to instruct the parser to ignore certain subtrees, or to only return certain subtrees, and whether to report them as DOM elements or as raw text. For example, an XML-RPC request might consist of a document fragment containing protocol information and a document fragment containing response data. This specification allows a stream or event parser client to request that only one of these fragments is parsed and reported; it may also request that the raw text content of the other fragment be reported as a single block of text which can then be fed into a less-able parser further back in the chain. This can provide a performant alternative to the use of XPath expressions in some simple use cases.

Note

Although not stated as a requirement on implementations, nor formally proved to be the case, it is the intent of this specification that if the output generated by normalizing a document according to this specification is itself parsed using the same normalization rules, the output generated by the second normalization will be the same as that generated by the first normalization.

1.4 Limitations

Two XML documents may have differing information content that is nonetheless logically equivalent within a given application context. Although two XML documents are equivalent (aside from limitations given in this section) if their normalized forms are identical, it is not a goal of this work to establish a method such that two XML documents are equivalent if and only if their normalized forms are identical. Such a method is unachievable, in part due to application-specific rules such as those governing unimportant whitespace and equivalent data (e.g. <color>black</color> versus <color>rgb(0,0,0)</color>). There are also equivalencies established by other W3C Recommendations and Working Drafts. Accounting for these additional equivalence rules is beyond the scope of this work. They can be applied by the application or become the subject of future specifications.

The normalized form of an XML document may not be completely operational within the application context, though the circumstances under which this occurs are unusual.

The difficulties arise due to the loss of the following information not available in the data model:

notations and external unparsed entity references
attribute types in the document type declaration

In the first case, the loss of external unparsed entity references and the notations that bind them to applications means that normalized forms cannot properly distinguish among XML documents that incorporate unparsed data via this mechanism. This is an unusual case precisely because most XML processors currently discard the document type declaration, which discards the notation, the entity's binding to a URI, and the attribute type that binds the attribute value to an entity name. For documents that must be subjected to more than one XML processor, the XML design typically indicates a reference to unparsed data using a URI in the attribute value.

In the second case, the loss of attribute types can affect the normalized form in different ways depending on the type. Attributes of type ID cease to be ID attributes. Hence, any XPath expressions that refer to the normalized form using the id() function cease to operate. The attribute types ENTITY and ENTITIES are not part of this case; they are covered in the second case above. Attributes of enumerated type and of type ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, and NOTATION fail to be appropriately constrained during future attempts to change the attribute value if the normalized form replaces the original document during application processing. Applications can avoid the difficulties of this case by ensuring that an appropriate document type declaration is prepended prior to using the normalized form in further XML processing. This is likely to be an easy task since attribute lists are usually acquired from a standard external DTD subset, and any entity and notation declarations not also in the external DTD subset are typically constructed from application configuration information and added to the internal DTD subset.

1.5 Requirements

Normalization for Streaming XML Parsers solves many of the major issues that have been identified by implementers with Canonical XML 1.0 [XML-C14N] and 1.1 [XML-C14N11]. It thus provides a better alternative to the use of canonicalization algorithms for the purposes outlined in this specification.

1.5.1 Performance

Canonicalization will be slow if the implementation uses the Canonical XML 1.1 specification as a formula without any attempt at optimization. This specification rectifies this problem by incorporating lessons learned from the implementation of that specification. Most mature canonicalization implementations solve the performance problem by inspecting the signature first, to see if it can be canonicalized using a simple tree walk algorithm whose performance is similar to regular XML serialization. If not they fall back to the expensive nodeset-based algorithm.

The use cases that cannot be addressed by the simple tree walk algorithm are mostly edge cases. This specification restricts the input to the normalization algorithm so that implementations can always use the simple tree walk algorithm. This facet is what lends this specification's suitability for use as part of a stream or event parser directly.

C14N 1.x uses an "XPath 1.0 Nodeset" to describe a document subset. This is the root cause of the performance problem and can be solved by not using a nodeset. This specification does not use a nodeset, visits each node exactly once, and only visits the nodes that are being normalized.

1.5.2 Streaming

A streaming implementation is required to be able to process very large documents without holding them all in memory; it should be able to process documents one chunk at a time.

1.5.3 Robustness

Whitespace handling in parser clients frequently means trimming all node contents. This specification provides a means for a parser to perform this duty internally depending on input from the parser client, and for such processing to be done in an intelligent manner with regards to QNames and XPaths in content. Specifically it uses three techniques for normalizing text content:

Optionally remove leading and trailing whitespace from text nodes,
Allow for QNames in content, particularly in the xsi:type attribute,
Optionally rewrite prefixes

1.5.4 Portability

It should be possible to normalize a sub-document in such a way that it may be moved into a completely different XML document while retaining its semantic meaning. This is the goal of Exclusive canonicalization [XML-EXC-C14N] that mostly satisfies this requirement except for the case of namespace prefixes embedded in content. This specification builds on exclusive canonicalization and solves the problem of namespaces in content, allowing parser clients to re-serialize sub-documents into larger documents without knowledge of the larger document's content or structure.

1.5.5 Simplicity

C14N 1.x algorithms are complex and depend on a full XPath library. This increases the work required for scripting languages make use of it as an XML document pre-processing tool. This specification addresses this issue by not using the complex nodeset model, and therefore not relying completely on XPath.

1.6 Test Cases for Canonical XML 2.0

Test cases for Canonical XML 2.0 are documented in "Test Cases for Canonical XML 2.0" [C14N2-TestCases].

2. XML Normalization

2.1 Data Model

The input to the normalization algorithm consists of an XML document subset, and set of options. The XML document subset can be expressed in two ways, with a DOM model or a Stream model.

2.1.1 Data Model for DOM Parsers

In the DOM model the XML subset is expressed as:

Inclusion List: Either the document Node D or a list of one or more element nodes E₁, E₂, … E_n.
(If out of this list, one element node E_i is a descendant of another E_j, then that element node E_i is ignored.)
Exclusion List (optional): A list of zero or more element nodes E₁, E₂, … E_m and a list of zero or more attribute nodes A₁, A₂, … A_M.
These attribute nodes should not be namespace declaration or attributes in the xml namespace.

The XML subset consists of all the nodes in the Inclusion list and their descendants, minus all the nodes that are in the Exclusion list and their descendants.

The element nodes in the Inclusion list are also referred as apex nodes.

Note: This input model is a very limited form of the generic XPath Nodeset that was the input model for Canonical XML 1.x. It is designed to be simple and allow for a high performance algorithm, while still supporting the most essential use cases. Specifically:

This model does not support re-inclusion; i.e. all the exclusions are applied after all the inclusions. It is effectively a simplified form of the XPath Filter 2 model [XMLDSIG-XPATH-FILTER2] with one intersect followed by one optional subtract operation. Re-inclusion complicates the normalization algorithm, especially in the areas of namespace and XML attribute inheritance.
Exclusion is limited to complete subtrees and attribute nodes. Other kinds of nodes (text, comment, PI) cannot be excluded.
Attribute exclusion is also limited, such that namespace declaration and attributes from the XML namespace cannot be excluded.
Some examples of subsets that were were permitted in the Canonical XML 1.x, but not in this new version:
- A subset consisting of a single attribute all by itself.
- A subset consisting of an attribute without its owner element.
- A subset consisting of a text node all by itself.
- A subset consisting of a text node without its parent node.
- A subset consisting of an element without some of its text node children.

Note

The DOM model of XML Normalization does not support direct input of an octet stream; the Stream model exists for that purpose. The transformation of such a stream into the input model required for DOM processing by this specification is application-specific and should be defined in specifications that reference or make use of this one.

2.1.2 Data Model for Stream and Event Parsers

In the Stream model, the XML subset is again expressed as an Inclusion List and an Exclusion List. For streaming, however, nodes are identified using a set of simple XPath paths. An empty XPath in the Inclusion list SHALL be interpreted as referring to the document's root element as though its value were /. An empty XPath in the Exclusion list SHALL be ignored.

Specifically, only absolute XPaths are allowed, and only if they are comprised of element names and QNames. In addition, the following special characters and wildcards are permitted:

// to allow for selection of deeply-nested elements.
* to allow for any single unnamed element.

The parser MUST treat the inclusion of any other XPath components as an error, including:

Axes.
Context-node (.) and parent-node (..) references.
Expressions.
Functions.

The purpose of this is to limit the description of included/excluded nodes such that they can be easily compared against a stack of node names or QNames assembled by the parser to keep track of its current location in the document.

Note

Since XPath 1.0 [XPATH] requires that any namespaced elements be identified by QName, and since the canonicalization algorithm provides a means to rewrite namespace prefixes, the XPaths used as input MUST use the rewritten prefix values.

2.2 Parameters

Instead of separate algorithms for each variant of normalization, this specification takes the approach of a single algorithm subject to a variety of parameters that change its behavior to address specific use cases.

The following dictionaries define the logical parameters supported by this algorithm. The actual serialization that expresses the parameters in use may be defined as appropriate to specific applications of this specification (e.g., the <ds:CanonicalizationMethod> element in [XMLDSIG-CORE2]).

dictionary QNameAware {
    DOMString Name;
};

2.2.1 Dictionary `QNameAware` Members

Name of type DOMString: The NCName name of an element or attribute.

dictionary Element : QNameAware {
    DOMString NS;
};

2.2.2 Dictionary `Element` Members

NS of type DOMString: The URI of the namespace to which this element belongs.

dictionary QualifiedAttribute : QNameAware {
    DOMString NS;
};

2.2.3 Dictionary `QualifiedAttribute` Members

NS of type DOMString: The URI of the namespace to which this attribute belongs.

dictionary UnqualifiedAttribute : QNameAware {
    DOMString ParentName;
    DOMString ParentNS;
};

2.2.4 Dictionary `UnqualifiedAttribute` Members

ParentNS of type DOMString: The URI of the namespace of this attribute's parent element.
ParentName of type DOMString: The NCName of this attribute's parent element.

dictionary XPath : QNameAware {
    DOMString NS;
};

2.2.5 Dictionary `XPath` Members

NS of type DOMString: The URI of the namespace to which this element belongs.

dictionary Parameters {
    boolean           IgnoreComments = true;
    boolean           TrimTextNodes = true;
    object            PrefixRewrite = "none";
    QNameAware[]      QNameAware = [];
    array[QNameAware] ReturnCharacters = [];
};

2.2.6 Dictionary `Parameters` Members

IgnoreComments of type boolean, defaulting to true: Whether to ignore comments during normalization.
PrefixRewrite of type object, defaulting to "none": With a string value of "none", prefixes are left unchanged. With a string value of "sequential", prefixes are changed to "n0", "n1", "n2" … except the special prefixes xml and xmlns which are left unchanged. With a value of type HashMap, prefixes are rewritten only for namespaces whose URIs defined in the enumeration, except for xml and xmlns as described above.
QNameAware of type array of QNameAware, defaulting to []: A set of nodes whose entire content must be processed as QName-valued for the purposes of normalization, including prefix rewriting and recognition of prefix "visible utilization"
ReturnCharacters of type array[QNameAware], defaulting to []: A set of nodes whose contents should be returned as raw UTF-8 characters, not parsed.
TrimTextNodes of type boolean, defaulting to true: Whether to trim (i.e. remove leading and trailing whitespace) all text nodes while normalizing. Adjacent text nodes must be coalesced prior to trimming. If an element has an xml:space="preserve" attribute, then text node descendants of that element are not trimmed regardless the value of this parameter.

All of these parameters MUST be implemented.

In the XML Canonicalization space there were two separate canonicalization algorithms - Inclusive Canonicalization [XML-C14N11] and Exclusive Canonicalization [XML-EXC-C14N]. The major differences between these two algorithms is the treatment of namespace declarations and inherited attributes in the xml: namespace. But in the current version of Canonical XML 2.0, Inclusive canonicalization has been removed completely.

Exclusive canonicalization has been far more popular than inclusive, because of its "portability" property. I.e. if a subdocument is signed with exclusive canonicalization, and then this subdocument is moved off to a different XML context, the signature on that subdocument still remains valid. Inclusive canonicalization doesn't have this portability property, however inclusive canonicalization has an advantage over exclusive canonicalization 1.0, when it comes to QNames in content.

Exclusive canonicalization 1.0 only emits namespaces declarations that it considers are visibly utilized, so if there is QName embedded in text node or an attribute node, it doesn't recognize it. For example in this attribute xsi:type="xsd:string", the "xsd" prefix is embedded in the content, and so Exclusive canonicalization 1.0 will not consider the "xsd" prefix to be visibly utilized and hence not emit the xsd namespace declaration. Not emitting the declaration, makes it susceptible to certain wrapping attacks. Exclusive canonicalization 1.0 offers the "InclusiveNamespace" mechanism to deal with these kinds of prefixes. Any prefixes mentioned in this list will be treated inclusively, i.e. their namespace declarations will be emitted even if they are not used.

XML Normalization addresses the shortcomings of Exclusive Canonicalization 1.0 with the QNameAware parameter. This parameter can be used to list element or attribute nodes that are expected to have QNames. XML Normalization will scan for prefixes in these elements and attributes and consider them to be visibly utilized too. Since this is a superior approach, no equivalent to Inclusive canonicalization is defined in this specification.

Note

The algorithm for prefix scanning doesn't cover all kinds of prefix embedding. For example if a text node's value is a space separated list of QNames, this algorithm will not detect the prefixes of these QNames. It will only detect two kinds of embedding:

When the entire text node or attribute is a QName.
When a text node is an XPath expression containing prefixes.

Inclusive canonicalization also preserves the values of xml: attributes in context; it looks at the ancestors of the subdocument being processed, and collects the value of any inheritable xml attributes, specifically xml:lang, xml:space and xml:base, from these ancestor elements and emits them at the root of the subdocument. Exclusive canonicalization does not do this as it this violates the portability requirement. Likewise, XML Normalization ignores these attributes as well.

2.3 Processing Model

The basic normalization process consists of traversing the tree and outputting octets for each node. In DOM mode, this is literally an ordered tree traversal, while in Stream mode the traversal involves the parsing and posting of events for each element and node as it is encountered in the input stream.

Input: The XML subset consisting of an Inclusion list and an Exclusion list.

Processing for DOM mode

Sort inclusion list by document order: If the inclusion list only has the document node D there is nothing to sort. Otherwise remove all element nodes E_i that are descendants of some other element node in the inclusion list. Then sort the remaining element nodes E₁, E₂, … E_n by document order.
Normalize each subtree: For each element node E_i or document node D in the sorted list, do a depth first traversal to visit all the descendant nodes in the E_i subtree, and normalize each one of them in-place. While traversing, if the current node is an element and that element is in the exclusion list, prune the traversal, i.e. skip over that element and all its descendants.

Processing for Stream mode

Prepare a stack for storing element names: As each start-element token is encountered, add its QName to the stack. As each end-element token is encountered, it is removed from the top of the stack.
Parse the input octet-stream: Create events according to whether the current QName stack matches an element in the Inclusion list. If it also matches an element in the Exclusion list then the parser MUST NOT post an event. All attributes of an element must be collected prior to posting events for any attribute, so that namespace processing can correctly determine the utilization state of a given namespace.

During traversal of each node (or upon encountering each token type), normalize the value depending on its type as follows:

Root Node— Ignore the byte order mark, XML declaration, and anything from within the document type declaration. Continue traversal.
Element Nodes— Normalize the element's QName as appropriate, and process its child nodes, including attributes and namespaces. If the PrefixRewrite parameter is sequential or predefined, the element's QName will be written with the changed prefix.
If the element is identified by the ReturnCharacters parameter, then the source octet-stream for this element is used to replace the element node with a CDATA node. In Stream mode, all text encountered from the start of the start-element token to the end of the corresponding end-element token is reported as a CDATA block. In neither case is any normalization applied to the identified element or its content.
Attribute Nodes- Normalize the node's QName, and modify its string value. The string value of the node is modified by replacing all ampersands (&) with &, all open angle brackets (<) with <, all quotation mark characters with ", and the whitespace characters #x9, #xA, and #xD, with character references. The character references are written in uppercase hexadecimal with no leading zeroes (for example, #xD is represented by the character reference ).
If parameter PrefixRewrite is sequential or predefined and the attribute name has a namespace prefix, the prefix is changed to the rewritten prefix. Also with prefix rewriting enabled, the attribute content is treated specially if the attribute is among those enumerated for the QNameAware parameter. If so, the QName value of the attribute is rewritten with the new prefix.
Namespace Nodes- Process according to the namespace processing rules and include if the namespace is considered visibly utilized at this point. Regardless of utilization, the namespace's details should be recorded as 'in-scope' until the end of the current element.
Text Nodes- the string value, except all ampersands are replaced by &, all open angle brackets (<) are replaced by <, all closing angle brackets (>) are replaced by >, and all #xD characters are replaced by .
If parameter TrimTextNodes is true and there is no xml:space="preserve" declaration in context, trim the leading and trailing whitespace. E.g. trim <A> <B/> to <A><B/> and trim <A> this is text </A> to <A>this is text</A>. Whitespace is as defined in [XML10] i.e. it consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.

Note
A DOM parser might split up a long text node into multiple adjacent text nodes, and a Stream parser might report multiple consecutive text tokens, some of which may be empty. Be aware when trimming whitespace in such cases; the net result should be equivalent to doing so as if the adjacent text nodes were concatenated.

Note
When any element is treated as character data due to the effects of the ReturnCharacters parameter, the resulting text node/event SHALL NOT be normalized according to these rules.

If parameter PrefixRewrite is sequential or predefined and if the parent element node is among those enumerated for the QNameAware parameter, then the QName value of the text node is rewritten with the new prefix.
Processing Instruction (PI) Nodes- these are not altered during normalization.
Comment Nodes- Deleted (or not reported) if generating normalized XML without comments. For normalized XML with comments, the comment is unchanged by the normalization algorithm.

Note

Although some XML models such as DOM don't distinguish namespace declarations from attributes, Normalization needs to treat them separately. In this document, attribute nodes that are actually namespace declarations are referred as "namespace nodes", other attributes are called "attribute nodes".

2.4 Namespace Processing

As part of the normalization process, while traversing the subtree, use the following algorithm to look at all the namespace declarations in an element, and decide which ones to output.

2.4.1 Namespace concepts

The following concepts are used in Namespace processing:

Explicit and Implicit namespace declarations

In DOM, there is no special node for namespace declarations, they are just present as regular attribute nodes. An "explicit" namespace declaration is an attribute node whose prefix is "xmlns" and whose localName is the prefix being declared.

DOM also allows declaring a namespace "implicitly", i.e. if a new DOM element or attribute is constructed using the createElementNS and createAttributeNS methods, then DOM adds a namespace declaration automatically when serializing the document.

Special namespaces

The "xml" and "xmlns" prefixes are reserved and have special behavior. See [XML-NAMES].

Apex nodes

An apex node is an element node in a document subset having no element node ancestor in the document subset.

Default namespace

The default namespace is declared by xmlns="...". To make the algorithm simpler this will be treated as a namespace declaration whose prefix value is "" i.e. an empty string.

Visibily utilized

This concept is required for exclusive normalization. An element E in the document subset visibly utilizes a namespace declaration, i.e. a namespace prefix P and bound value V, if any of the following conditions are true:

The element E itself has a qualified name that uses the prefix P. (Note if an element does not have a prefix, that means it visibly utilizes the default namespace.)
OR The element E is among those enumerated for the QNameAware parameter, and the QName value of the element uses the prefix P (or, lacking a prefix, it visibly utilizes the default namespace)
OR The element E is among those enumerated for the QNameAware parameter, and is listed as an XPathElement. This value of the element is to be interpreted as an XPath 1.0 expression and any prefixes used in this XPath expression are considered to be visibility utilized.
OR An attribute A of that element has a qualified name that uses the prefix P, and that attribute is not in the exclusion list. (Note that unlike elements, if an attribute doesn't have a prefix, that means it is a locally scoped attribute. It does NOT mean that the attribute visibly utilizes the default namespace.)
OR An attribute A of that element is among those enumerated for the QNameAware parameter, and the QName value of the attribute uses the prefix P (or, lacking a prefix, it visibly utilizes the default namespace)

2.4.2 Namespace Prefix Rewriting

When the parameter PrefixRewrite="sequential" or PrefixRewrite="predefined" is set, all the prefixes except "xml" are rewritten to new prefixes. In the normalized output there is a one to one mapping between namespace URIs and rewritten prefixes. E.g. if in the input document fragment, a particular prefix is declared to many different namespace URIs at different parts of the document, during normalization this prefix will get rewritten to different prefixes, one rewritten prefix for each different namespace URI. Similarly if in the input document, many prefixes are declared to the same namespace URI, all of these prefixes will be normalized to the same rewritten prefix.

With PrefixRewrite="sequential" the prefixes are rewritten to "n0", "n1", "n2", … etc.

With PrefixRewrite="predefined" the prefix for any namespace in the predefined set is replaced using the value provided by the input set.

Prefixes are considered for rewriting only when they are visibly utilized, not when they are declared.
Once a namespace URI has been assigned a prefix, it always gets that prefix everywhere in the document.
Element nodes are visited in document order.
At each element node, all the visibly utilized prefixes are considered. The namespace URIs for these visibly utilized prefixes are sorted by lexical order, duplicates namespace URIs are removed, those namespace URIs that have already been assigned prefixes are removed, and then the remaining namespace URIs are assigned prefixes sequentially.

Prefix Rewriting also considers QNames in content, and during normalization the prefixes in these QNames are also rewritten.

Note

with PrefixRewrite="sequential", the normalized output will never have a default namespace, as that is also rewritten into a "nN" style prefix. With PrefixRewrite="predefined" the default namespace is rewritten with an explicit prefix only if one has been specified in the input set. Note that when using predefined it is not possible to promote a namespace to the default by supplying a prefix of "" (the empty string)— this is an error.

2.4.3 Namespace processing algorithm

Initialization: For sequential prefix rewriting maintain a counter N. This counter should be set to 0 at the beginning of the normalization process. Also maintain a map of namespace URI to rewritten prefixes; this map should be initialized to empty.

The following steps need to be executed at every Element node E.

Step 1: Create a list of visibly utilized prefixes.

If E itself has a qualified name that uses the prefix P, then P is visibly utilized. Note if E does not have a prefix, that means it visibly utilizes the default namespace.
If an attribute A of that element E has a qualified name that uses the prefix P, and that attribute is not in the exclusion list. Note that, unlike elements, if an attribute doesn't have a prefix, that means it is a locally scoped attribute. It does NOT mean that the attribute visibly utilizes the default namespace.
If there is a QNameAware parameter, check whether the E or its attributes is enumerated in it as follows:
- If there is an Element subchild, whose Name and NS attributes match E's localname and namespace respectively, then E is expected to have a single text node child containing a QName. Extract the prefix from this QName, and consider this prefix as visibly utilized.
- If there is a QualifiedAttr subchild, whose Name and NS attributes match one of E's qualified attribute's localname and namespace respectively, then that attribute is expected to contain a QName. Extract this prefix from the QName and consider this prefix as visibly utilized.
- If there is a UnqualifiedAttr subchild, whose Name attribute match one of E's unqualified attribute's name, and its ParentName and ParentNS attributes match E's localname and namespace respectively, then that attribute is expected to contain a QName. Extract this prefix from the QName and consider this prefix as visibly utilized.
- If there is a XPathElement subchild, whose Name and NS attributes match E's localname and namespace respectively, then E is expected to have a single text node child containing a XPath 1.0 expression. Extract the prefixes from this XPath by using the following algorithm. All of these extracted prefixes should be considered as visibly utilized.
  - Search for single colons : in the XPath expression, but do not consider single colons inside quoted strings. Double colons are used for axes, e.g. in self::node() , "self:" is not a prefix, but an axis name.
  - The prefix will be present just before the single colon. Go backwards from the colon, skip whitespace, and extract the prefix, by collecting characters till the first non NCName match. e.g. in /soap : Body, extract the "soap". The NCName production is defined in [XML-NAMES].
  This can be evaluated using perl style regular expressions as follows. Note the regular expressions here are provided as an example only, they are not normative.
  1. First remove all single quoted and double quoted strings from the XPath, because prefixes cannot be present there. i.e. do a substitute of s/"[^"]*"//g and s/'[^']*'//g. Removing the quoted string eliminates false positives in the next step.
  2. In the resultant string search for single colons and get the word just before colon, i.e search for match for m/([\w-_.]+)?\s*:(?!:)/ Note prefixes follow the NCName production, i.e. consists of alphanumeric or hyphen or underscore or dot, but cannot start with digit, hyphen or dot. . In an NCName, the allowed alphanumeric characters are not just Ascii, but any Unicode alphanumeric characters. However the regular expression provided here is a very simplified form of NCName production.
- If PrefixRewrite parameter is set to sequential each of the prefixes found in the above steps would need to be replaced by the a new prefix. For efficiency, consider combining this searching for prefixes step with the subsequent replacing prefixes step.

Create a list containing the namespace declarations for these visibly utilized prefixes. Remove the "xml" prefix from the this list if present.

Note

XML Normalization never emits the declaration for the xml or xmlns prefixes. As mentioned in [XML-NAMES] a valid XML document should never have the declaration for xmlns, so XML Normalization should never encounter this declaration. Also a valid XML document can optionally declare the xml prefix, but if present it MUST be bound to http://www.w3.org/XML/1998/namespace. XML Normalization SHOULD ignore this declaration.

Step 2: If the PrefixRewrite="sequential" parameter is set , then compute new prefixes for all the namespaces declarations in the list from Step 1, as follows:

Ignore the prefix value in the namespace declaration, and only take the namespace URI. Put all these namespace URIs in a list.
Sort this list of namespace URIs by lexicographic(ascending) order.
Remove duplicates from this list.
Create a list of rewritten namespace declarations as follows:
Iterate through the namespace URI list - if a namespace URI has already been assigned a prefix, use that. Otherwise:
- If PrefixRewrite="sequential", assign a new prefix value "nN" to each prefix, and then increment the value of counter N. The counter should be set to 0 in the beginning of the normalization process. (e.g. if the value of this counter was 5 when the traversal reached this element, and this element had 3 prefixes to be output, then use the prefixes "n5", "n6", "n7" and set the counter to 8 after that).
- If PrefixRewrite="predefined", then look in the input set for the namespace's URI. If a match is found, assign the prefix from the match. Otherwise, the prefix remains unchanged.

Step 3: Filter the list to remove prefixes that have already been output.

Take the list of visibly utilized prefix declarations from Step 1, or if Prefix Rewriting is enabled then the modified list from Step 2.
If in this list, any of the namespace declarations have already been output during the canonicalization of one of the element E's ancestors, say E_j, and has not been redeclared since then to a different value, i.e not been redeclared by an element between E_j and E, then remove it from this list.

Step 4: Sort this list of namespace declarations in lexicographic (ascending) order of prefixes. In case of prefix rewriting, sort by rewritten prefixes, not original prefixes.
Note that default namespace declaration has no prefix, so it is considered lexicographically least.

Step 5: Output each of these namespace nodes, as specified in the Processing model.

2.4.4 Example of normalization with prefix rewriting

This following XML snippet will be used to determine the various options of prefixRewriting.

Example 1

<wsse:Security  
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<wsse:UserName wsu:Id="i1">
...
</wsse:UserName>
<wsse:Timestamp wsu:Id="i2">
...
</wsse:Timestamp>
<wsse:Security>

2.4.4.1 With `PrefixRewrite="none"`

Example 2

<wsse:Security 
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
<wsse:UserName
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
wsu:Id="i1">
...
</wsse:UserName>
<wsse:Timestamp
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
wsu:Id="i2">
...
</wsse:Timestamp>
</wsse:Security>

Note how the "wsu" prefix declaration is present in wsse:Security, but is not utilized. Normalization will "push the declaration down" into <UserName> and <Timestamp> where it is really used, i.e. the wsu declaration will be output twice, once in <UserName> and another in <Timestamp>, as shown above.

2.4.4.2 With `PrefixRewrite="sequential"`

Example 3

<n0:Security
xmlns:n0="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
<n0:UserName
xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
n1:Id="i1">
...
</n0:UserName>
<n0:Timestamp
xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
n1:Id="i2">
...
</n0:Timestamp>
</n0:Security>

Now observe what happens with sequential prefix rewriting, the "wsse" prefix is rewritten to "n0" and the "wsu" prefix is rewritten to "n1".

2.4.4.3 With `PrefixRewrite="predefined"`

Using the following predefined namespace prefixes:

http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd = "secutil"

Example 4

<wsse:Security 
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
<wsse:UserName
xmlns:secutil="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
secutil:Id="i1">
...
</wsse:UserName>
<wsse:Timestamp
xmlns:secutil="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
secutil:Id="i2">
...
</wsse:Timestamp>
</wsse:Security>

Note that the "wsu" prefix was rewritten to "secutil" while the "wsse" prefix remained unchanged.

2.5 Attribute processing

Note

Namespace declarations are not considered as attributes, they are processed separately as namespace nodes.

Processing the attributes of an element E consists of the following steps:

Ignore any attributes that are present in the exclusion list. However note that namespace nodes cannot be excluded.
Sort all the attributes in increasing lexicographic order with namespace URI as the primary key and local name as the secondary key (an empty namespace URI is lexicographically least).
If it is a qualified attribute and the PrefixRewrite parameter is sequential, modify the QName of the attribute name to use the new prefix. i.e. one of n0, n1, n2, ... etc. Do not do this for the xml prefix, as this is not changed during prefix rewriting.
If the attribute is among those enumerated by the QNameAware parameter, then change the QName in that attribute value to use the new prefix.

3. Algorithm for DOM Normalization

This section is non-normative.

This section presents an IDL representation of the normalization algorithm for DOM parsers, with function descriptions in the form of pseudocode.

The DOM normalization algorithm consists of two components: a HashMap, which is a simple dictionary mapping namespace URIs to prefixes; and an interface representing the normalizer functionality itself.

3.1 The `HashMap` type

This section is non-normative.

[Constructor]
interface HashMap {
    readonly    attribute unsigned long count;
    getter DOMString valueForKey ([TreatNullAs = EmptyString] DOMString key);
    setter void      setValueForKey ([TreatNullAs = EmptyString] DOMString key, DOMString? value);
    void             removeAll ();
};

3.1.1 Attributes

This section is non-normative.

count of type unsigned long, readonly: The number of items in the map.

3.1.2 Methods

removeAll

Removes all values in the map.

No parameters.

Return type: void

setValueForKey

Assigns a value for a key. A null value removes the entry from the map.

Parameter	Type	Nullable	Optional	Description
key	`DOMString`	✘	✘
value	`DOMString`	✔	✘

Return type: setter void

valueForKey

Fetches a value stored by the given key.

Parameter	Type	Nullable	Optional	Description
key	`DOMString`	✘	✘

Return type: getter DOMString

3.2 The `DOMNormalizer` Interface

[Constructor]
interface DOMNormalizer {
    readonly    attribute unsigned int prefixCounter;
    readonly    attribute HashMap      rewrittenPrefixes;
                attribute Parameters   properties;
                attribute DOMString[]  outputPrefixes;
    void        normalize (object<> inclusionList, object<> exclusionList);
    void        normalizeSubtree (object node);
    void        processNode (object node, HashMap namespaceContext);
    void        processDocument (object documentNode, HashMap namespaceContext);
    void        processElement (object elementNode, HashMap namespaceContext);
    void        processText (object textNode, HashMap namespaceContext);
    void        processComment (object commentNode, HashMap namespaceContext);
    void        addNamespaces (object elementNode, HashMap namespaceContext);
    DOMString[] processNamespaces (object elementNode, HashMap namespaceContext);
};

3.2.1 Attributes

outputPrefixes of type array of DOMString,: An array of prefixes which have been output and are thus 'in scope' for the current element.
prefixCounter of type unsigned int, readonly: This is a counter only for prefix rewriting in sequential mode. It is initialized to zero.
properties of type Parameters,: The parameters to the normalization process.
rewrittenPrefixes of type HashMap, readonly: A hash table of uri -> rewrittenPrefix. It is initialized to empty. Finding out the rewritten prefix for an original prefix is a two step lookup: first look up the URI for the original prefix in the namespaceContext hash table, then look up the rewritten prefix for the URI in the rewrittenPrefixes hash table.

3.2.2 Methods

addNamespaces

Add namespaces from this element to the namespace context. This function is called for every ancestor element, and also at every element of the subtrees (minus the exclusion set and any subtrees of elements identified by the properties.ReturnCharacters array).

Pseudocode:

addNamespaces(element, namespaceContext)
{
    for each explicit and implicit namespace declaration in the element
    {
        if namespaceContext already has this prefix with the same URI
        {
            do nothing
        }
        else if namespaceContext already has this prefix with a different URI
        {
            update the namespaceContext hash table with the new prefix->URI mapping
            
            if this prefix exists in outputPrefixes
                remove it
        }
        else if namespaceContext doesn't have this prefix
        {
            add the new prefix -> URI mapping to the namespaceContext
        }
    } 
}

Parameter	Type	Nullable	Optional	Description
elementNode	`object`	✘	✘
namespaceContext	`HashMap`	✘	✘

Return type: void

normalize

The top-level normalization function.

Pseudocode:

normalize(list of subtree, list of exclusion elements and attributes)
{
    put the exclusion elements and attributes in hash table for easier lookup
      
    sort the multiple subtrees by document order
      
    for each subtree
    {
        normalizeSubtree(subtree)
    }
}

Parameter	Type	Nullable	Optional	Description
inclusionList	`object<>`	✘	✘
exclusionList	`object<>`	✘	✘

Return type: void

normalizeSubtree

Normalize an individual subtree.

Pseudocode:

canonicalizeSubtree(node)
{
    if (node is the document node or a document root element) 
    {
        // (whole document is being processed, no ancestors to worry about)
        processNode(node)
    }
    else
    {
        starting from the element, walk up the tree to collect a list of
        ancestors
          
        for each of this node's ancestor elements starting with the document
        root, but not including the element itself 
            addNamespaces(element)
          
        processNode(node)
    }
}

Parameter	Type	Nullable	Optional	Description
node	`object`	✘	✘

Return type: void

processComment

Process a Comment node.

Preudocode:

processComment(commentNode, namespaceContext)
{
    if properties.IgnoreComments
        remove the node from the DOM
}

Parameter	Type	Nullable	Optional	Description
commentNode	`object`	✘	✘
namespaceContext	`HashMap`	✘	✘

Return type: void

processDocument

Process the Document node.

Pseudocode:

processDocument(document, namespaceContext)
{
    for (each child node)
    {
        processNode(child, namespaceContext)
    }
}

Parameter	Type	Nullable	Optional	Description
documentNode	`object`	✘	✘
namespaceContext	`HashMap`	✘	✘

Return type: void

processElement

Process an Element node.

Pseudocode:

processElement(elementNode, namespaceContext)
{
    if elementNode exists in the exclusion hash table
      return
    
    if elementNode is listed in properties.ReturnCharacters
    {
        serialize elementNode as UTF-8 text
        replace elementNode with a text node containing that text
        return
    }

    make copies of namespaceContext and outputPrefixes in the stack
    //(by copying, any changes made can be undone when this function returns)

    nsToBeOutputList = processNamespaces(element)
    attributeList = []

    if (properties.PrefixRewrite != "none")
    {
        determine the namespace for the element and update its prefix according to
              namespaceContext and rewrittenPrefixes
        element.namespace.prefix = new prefix value
    }

    for each of the namespaces in the nsToBeOutputList
        add appropriate "xmlns" attribute to attributeList

    for each non-namespace attribute in the element
    {
        replace/apply namespace prefix according to properties.PrefixRewrite
        if the element is in Properties.QNameAware
            adjust prefixes within its content as appropriate
        
        add attribute to attributeList
    }
    
    element.attributes = attributeList

    Loop through all child nodes and call
        processNode(child, copy(namespaceContext))

    remove namespace prefixes in nsToBeOutputList from outputPrefixes
}

Parameter	Type	Nullable	Optional	Description
elementNode	`object`	✘	✘
namespaceContext	`HashMap`	✘	✘

Return type: void

processNamespaces

Process the list of namespaces for this element.

Pseudocode:

processNamespaces(element)
{
    addNamespaces(element)

    create a list of visibly utilized prefixes - visiblePrefixes, which includes
        a) the prefix used by the element itself
        b) the prefix used by all the qualified attributes of the element
        c) the prefix embedded in the attribute value of any QName aware attributes
        d) the prefix embedded in the any text node child, if QName aware

    if properties.PrefixRewrite != "none"
    {
        newNamespaceURIs = []    // empty List

        for each prefix in visiblePrefixes
            get the URI for this prefix from the namespaceContext hash table
            check if the URI already exists in the rewrittenPrefixes hash table
            if it does not add the URI to newNamespaceURIs

        sort the newNamespaceURIs list in lexical order

        if properties.PrefixRewrite = "sequential"
        {
            for each URI in the newNamespaceURIs list
                assign a prefix "nN" where N is value of prefixCounter
                increment prefixCounter by 1
                add the mapping URI -> nN  into the rewrittenPrefixes hash table
        }
        else if properties.PrefixRewrite is a HashMap
        {
            for each URI in the newNamespaceURIs list
                lookup the prefix for this URI in properties.PrefixRewrite
                if there is a prefix
                    add the mapping URI -> prefix into rewrittenPrefixes
        }
    }

    nsToBeOutput = [] // empty hash table

    for each prefix in visiblePrefixes 
    {
        find the URI that this prefix maps to in the namespaceContext hash table

        if PrefixRewrite != "none"
            convert this prefix to rewrittenPrefix, by using the URI to
            lookup the rewrittenPrefix in the rewrittenPrefixes hash table

        if this prefix (original or rewritten) does not exist in outputPrefixes
            add this prefix to outputPrefixes 
            add the prefix-> URI mapping into the nsToBeOutput hash table
    }

    sort the nsToBeOutputList by the prefix

    return nsToBeOutputList
}

Parameter	Type	Nullable	Optional	Description
elementNode	`object`	✘	✘
namespaceContext	`HashMap`	✘	✘

Return type: DOMString[]

processNode

Redirects to the appropriate node processing function.

Pseudocode:

processNode(node, namespaceContext)
{
    call the appropriate function - processDocument, processElement,
    processTextNode, ... depending on the node type.
}

Parameter	Type	Nullable	Optional	Description
node	`object`	✘	✘
namespaceContext	`HashMap`	✘	✘

Return type: void

processText

Process a Text node.

Pseudocode:

processText(textNode)
{
    if this text node is outside document root
       return

    in the text replace 
       all ampersands by &amp;, 
       all open angle brackets (<) by &lt;, 
       all closing angle brackets (>) by &gt;, 
       and all #xD characters by &#xD;.

    if properties.TrimTextNodes is true and there is no xml:space="preserve"
            declaration in scope
    {
        if previous node was not a text node
            trim leading whitespace
        if next node is not a text node
            trim trailing whitespace
    }

    if propertiesPrefixRewrite != "none" and this text node is a child of
            a QName aware element
    {
        search for embedded prefixes, and replace with rewritten prefixes
    }

    replace the text content of the node with the modified text
}

Parameter	Type	Nullable	Optional	Description
textNode	`object`	✘	✘
namespaceContext	`HashMap`	✘	✘

Return type: void

4. Algorithm for Streaming Normalization

This section is non-normative.

Unlike DOM parsers which represent XML document as a tree of nodes, streaming parsers represent an XML document as stream of events like "start-element", "end-element", "text" etc. A document subset can also be represented as a stream of events. This stream of events in exactly in the same order as a tree walk, so the same approach can be also used to normalize an event stream. Below you can find a description of the SAX2 [SAX] event-handler interface with comments on the application of normalization to the generated events.

Since this algorithm and that employed for StAX [XML-PARSER-STAX] relies on much the same parsing events, we leave the application of this algorithm to a 'pull' parser up to the reader.

4.1 The `ElementStack` Type

This section is non-normative.

The ElementContext dictionary is used to store information about a single element. One of these is pushed onto the stack during processing of a startElement() event, and it is removed while processing the corresponding endElement() event.

dictionary ElementContext {
    HashMap     namespaceContext = [];
    DOMString[] outputPrefixes = [];
    DOMString   elementQName = "";
    DOMString   localName = "";
    DOMString   prefix = "";
    boolean     isQNameAware = false;
};

4.1.1 Dictionary `ElementContext` Members

This section is non-normative.

elementQName of type DOMString, defaulting to "": The QName of the current element.
isQNameAware of type boolean, defaulting to false: Whether the element is QName aware and must have its contents scanned for mapped prefixes.
localName of type DOMString, defaulting to "": The element's unqualified name. This may be derived from the elementQName property.
namespaceContext of type HashMap, defaulting to []: The current namespace mapping context, for use with prefix rewriting.
outputPrefixes of type array of DOMString, defaulting to []: The list of currently output namespaces prefixes, i.e. those that are considered visibly utilized at the present.
prefix of type DOMString, defaulting to "": The element's namespace prefix. This may be derived from the elementQName property.

The ElementStack interface implements a basic stack of ElementContext dictionaries. Its push() operation duplicates some of the properties of the current top-of-stack object for you.

[Constructor]
interface ElementStack {
    unsigned int   count ();
    ElementContext push (DOMString QName);
    ElementContext top ();
    void           pop (DOMString QName);
};

4.1.2 Methods

count

Returns the number of items on the stack.

No parameters.

Return type: unsigned int

pop

Checks the topmost ElementContext to ensure that it matches the given QName, and removes it from the stack if it matches. If it does not match, a DOMException is raised.

Parameter	Type	Nullable	Optional	Description
QName	`DOMString`	✘	✘

Return type: void

push

Creates a duplicate of the ElementContext on the top of the stack and replaces its elementQName, localName, and prefix properties based on the provided QName parameter. The new object is placed on top of the stack and returned.

Parameter	Type	Nullable	Optional	Description
QName	`DOMString`	✘	✘

Return type: ElementContext

top

Returns the topmost ElementContext without modifying the stack.

No parameters.

Return type: ElementContext

4.2 SAX2 Events

The following interface describes some events defined by the SAX2 parser specification. Any events not enumerated below are unchanged by this algorithm.

interface SAXEvents {
    const readonly int StartDocument = 1;
    const readonly int EndDocument = 2;
    const readonly int StartElement = 3;
    const readonly int EndElement = 4;
    const readonly int Characters = 5;
    const readonly int IgnorableWhitespace = 6;
    const readonly int ProcessingInstruction = 7;
    const readonly int Comment = 8;
    const readonly int CDATABlock = 9;
    const readonly int StartPrefixMapping = 10;
    const readonly int EndPrefixMapping = 11;
};

4.2.1 Constants

CDATABlock of type readonly int: The event contains some of the raw character data from within an XML block.
Characters of type readonly int: Characters from a text node (but not a CDATA node) will be posted using this event. According to [SAX], this may contain raw entity codes; for normalization entity-replacement MUST be enabled. Thus any occurrences of & will be replaced by the resulting & character, and so on.
Comment of type readonly int: An XML comment of the form  was parsed. The event contains the text content of the comment, i.e. A Comment.
EndDocument of type readonly int: The document's outermost element has been closed. This is preceded by the EndElement event for that element.
EndElement of type readonly int: A closing element tag has been parsed. In the case of a self-closing or 'empty' element, this event will follow directly from the StartElement event for this same element.
EndPrefixMapping of type readonly int: The element containing an xmlns attribute has been closed.
IgnorableWhitespace of type readonly int: The parser encountered some whitespace characters which may be safely ignored; they are present for formatting purposes only and have no semantic or lexical meaning.
ProcessingInstruction of type readonly int: An XML processing instruction of the form <?name param1="1" param2="2"?> has been parsed. The event provides the name component along with the remaining characters as a single character string (i.e. param1="1" param2="2").
StartDocument of type readonly int: The start of the document was encountered. The next event will be a StartElement for the document's outermost element.
StartElement of type readonly int: An element's opening tag has been parsed. Information on the element's namespace and all attached attributes is included with this event.
StartPrefixMapping of type readonly int: The parser has encountered an xmlns attribute and has mapped a prefix to a URI.

4.3 SAX2 Normalization Algorithm

Below is a partial definition of a SAX2 event handler interface. The documentation for each event defines how the parser should normalize the parameters for that event.

Note

Note that handling of characters when TrimTextNodes is true involves buffering each Characters event until the next event arrives. If the next event is not also Characters, then the buffered text has trailing whitespace trimmed and its event is posted to the client. It TrimTextNodes is false, then no buffering occurs.

[Constructor]
interface SAX2Normalizer {
                attribute ElementStack elementStack;
                attribute Parameters   normalizationParameters;
                attribute char[]       currentCharacters;
                attribute HashMap      pendingNamespaces;
                attribute int          rewriteCounter;
                attribute HashMap      rewrittenPrefixes;
    void postStartPrefixMappingEvent (DOMString prefix, DOMString uri);
    void postStartElementEvent (DOMString uri, DOMString localName, DOMString qName, object[] attrList);
    void postEndElementEvent (DOMString uri, DOMString localName, DOMString qName);
    void postIgnorableWhitespace (char[] text);
    void postComment (char[] comment);
    void postCDATA (char[] data);
    void postCharacters (char[] text);
};

4.3.1 Attributes

currentCharacters of type array of char,: When normalizationParameters.TrimTextNodes is true, the text for a Characters event are first placed into this variable. The event is function is passed these characters once the following event has been received. In this way, the parser can determine whether to trim whitespace from the end of the string without accumulating the entire text block in memory.
elementStack of type ElementStack,: A stack of element information representing the current path into the XML document's tree. A new ElementContext is pushed upon each SAXEvents.StartElement event, and is popped upon the corresponding SAXEvents.EndElement event.
normalizationParameters of type Parameters,: All normalization parameters are stored here.
pendingNamespaces of type HashMap,: Records all namespace prefix to URI mappings reported
rewriteCounter of type int,: When normalizationParameters.PrefixRewrite is "sequential", this attribute is used to generate the new, numbered prefixes. It is initialized to zero.
rewrittenPrefixes of type HashMap,: A map of namespace URIs to prefixes, containing only those which have been reassigned in accordance with normalizationParameters.PrefixRewrite.

4.3.2 Methods

postCDATA

As per XML Canonicalization, CDATA sections are replaced with their character content. This method instead posts a Characters event.

Parameter	Type	Nullable	Optional	Description
data	`char[]`	✘	✘

Return type: void

postCharacters

Certain characters are replaced with character entities and the characters are either posted directly or, if TrimTextNodes is enabled, they are buffered in case of needing to trim trailing whitespace based on the type of the next event.

Pseudocode:

void postCharacters(text)
{
    if normalizationParameters.TrimTextNodes is true
    {
        if currentCharacters is empty       // better: if previous event was not EndElement, Characters, or CDATA
        {
            // start of a text node
            trim leading whitespace
        }
        else
        {
            output any buffered characters (no trimming)
            currentCharacters := []
        }
    }
    
    replace all instances of "&" with "&amp;"
    replace all instances of "<" with "&lt;"
    replace all instances of ">" with "&rt;"
    replace all carriage returns ('\r') with "&#xD;"
    replace all tabs ('\t') with "&#x9;"
    
    if normalizationParameters.TrimTextNodes is true
    {
        currentCharacters := text
    }
    else
    {
        post the event immediately: characters(text)
    }
}

Parameter	Type	Nullable	Optional	Description
text	`char[]`	✘	✘

Return type: void

postComment

If IgnoreComments is true, does not post the event.

Parameter	Type	Nullable	Optional	Description
comment	`char[]`	✘	✘

Return type: void

postEndElementEvent

End element events only require prefix rewriting for the qName parameter, if appropriate.

Pseudocode:

void postEndElementEvent(uri, localName, qName)
{
    trim and post any buffered characters
    
    context := elementStack.top()
    elementStack.pop(qName)    // throws an exception if qNames do not match
    
    if normalizationParameters.PrefixRewrite is not "none"
    {
        prefix := rewrittenPrefixes(uri)
        qName := prefix + ":" + localName
    }
    
    post event: endElement(uri, localName, qName)
}

Parameter	Type	Nullable	Optional
uri	`DOMString`	✘	✘
localName	`DOMString`	✘	✘
qName	`DOMString`	✘	✘

Return type: void

postIgnorableWhitespace

If TrimTextNodes is true, does not post the event.

Parameter	Type	Nullable	Optional	Description
text	`char[]`	✘	✘

Return type: void

postStartElementEvent

When a start element event is to be sent, the following additional processing occurs to modify the parameters of that event. Note that attribute values are also normalized according to section 3.3.3 of [XML10].

Pseudocode:

void postStartElementEvent(uri, localName, qName, attrList)
{
    trim and post any buffered characters
    
    if normalizationParameters.ReturnCharacters references this element
    {
        postEvent(CDATABlock, element outer XML)
        skip processing of element subtree and EndElement event
        return
    }
    
    context := elementStack.push(qName)
    
    for each [prefix, uri] pair in pendingNamespaces
    {
        if context.namespaceContext(prefix) does not match attribute value
        {
            context.namespaceContext(prefix) := attribute value
            context.outputPrefixes(prefix) := null  // remove from outputPrefixes
        }
    }
    
    pendingNamespaces.removeAll()
    
    for each xmlns or xmlns:prefix attribute in attrList
    {
        remove attribute from attrList
    }
    
    if element is QName aware
        context.isQNameAware = true
    
    // get a HashMap of prefix -> uri
    // this also rewrites contents of QNameAware attributes
    usedNamespaces := visiblyUsedNamespaces(context, attrList)
    
    if qName has a prefix and normalizationParameters.PrefixRewrite is not "none"
    {
        prefix := element prefix
        if rewrittenPrefixes(uri) is not null
        {
            prefix := rewrittenPrefixes(uri)
        }
        else if normalizationParameters.PrefixRewrite is "sequential"
        {
            prefix := "nN" where N is the value of rewriteCounter
            increment rewriteCounter
            rewrittenPrefixes(uri) := prefix
        }
        else if normalizationParameters.PrefixRewrite is a HashMap and it contains a value for the uri
        {
            prefix := normalizationParameters.PrefixRewrite(uri)
            rewrittenPrefixes(uri) := prefix
        }
        
        qName := prefix + ":" + localName
    }
    
    append any default attributes for the element to attrList
    
    for each [name, value] in attrList
    {
        if name has a prefix other than 'xml' and normalizationParameters.PrefixRewrite is not "none"
        {
            // all prefixes have been enumerated by now
            split name into prefix and local
            attrUri := context.namespaceContext(prefix)
            if rewrittenNamespaces(attrUri) is not null
            {
                prefix := rewrittenNamespaces(attrUri)
                name := prefix + ":" + local     // replace name in attrList
            }
        }
        
        normalize attribute value
    }
    
    for each [prefix, uri] pair in usedNamespaces
    {
        if prefix is an empty string
        {
            insert new attribute with name "xmlns" and value uri at start of attributes
        }
        else
        {
            insert new attribute with name "xmlns:" + prefix and value uri at start of attributes
        }
    }
    
    post event: startElement(uri, qName, localName, attrList)
}

Parameter	Type	Nullable	Optional
uri	`DOMString`	✘	✘
localName	`DOMString`	✘	✘
qName	`DOMString`	✘	✘
attrList	`object[]`	✘	✘

Return type: void

postStartPrefixMappingEvent

Stores the mapping in pendingNamespaces; they will be placed into an element's context during the next StartElement event.

Parameter	Type	Nullable	Optional	Description
prefix	`DOMString`	✘	✘
uri	`DOMString`	✘	✘

Return type: void

XML Normalization

W3C Editor's Draft 15 March 2013

Abstract

Status of This Document

Table of Contents

1. Introduction

1.1 Conformance

1.2 Terminology

1.3 Applications

1.4 Limitations

1.5 Requirements

1.5.1 Performance

1.5.2 Streaming

1.5.3 Robustness

1.5.4 Portability

1.5.5 Simplicity

1.6 Test Cases for Canonical XML 2.0

2. XML Normalization

2.1 Data Model

2.1.1 Data Model for DOM Parsers

2.1.2 Data Model for Stream and Event Parsers

2.2 Parameters

2.2.1 Dictionary QNameAware Members

2.2.2 Dictionary Element Members

2.2.3 Dictionary QualifiedAttribute Members

2.2.4 Dictionary UnqualifiedAttribute Members

2.2.5 Dictionary XPath Members

2.2.6 Dictionary Parameters Members

2.3 Processing Model

2.4 Namespace Processing

2.4.1 Namespace concepts

2.4.2 Namespace Prefix Rewriting

2.4.3 Namespace processing algorithm

2.4.4 Example of normalization with prefix rewriting

2.4.4.1 With PrefixRewrite="none"

2.4.4.2 With PrefixRewrite="sequential"

2.4.4.3 With PrefixRewrite="predefined"

2.5 Attribute processing

3. Algorithm for DOM Normalization

3.1 The HashMap type

3.1.1 Attributes

3.1.2 Methods

3.2 The DOMNormalizer Interface

3.2.1 Attributes

3.2.2 Methods

4. Algorithm for Streaming Normalization

4.1 The ElementStack Type

4.1.1 Dictionary ElementContext Members

4.1.2 Methods

4.2 SAX2 Events

4.2.1 Constants

4.3 SAX2 Normalization Algorithm

4.3.1 Attributes

4.3.2 Methods

5. Output rules

A. References

A.1 Normative references

A.2 Informative references

2.2.1 Dictionary `QNameAware` Members

2.2.2 Dictionary `Element` Members

2.2.3 Dictionary `QualifiedAttribute` Members

2.2.4 Dictionary `UnqualifiedAttribute` Members

2.2.5 Dictionary `XPath` Members

2.2.6 Dictionary `Parameters` Members

2.4.4.1 With `PrefixRewrite="none"`

2.4.4.2 With `PrefixRewrite="sequential"`

2.4.4.3 With `PrefixRewrite="predefined"`

3.1 The `HashMap` type

3.2 The `DOMNormalizer` Interface

4.1 The `ElementStack` Type

4.1.1 Dictionary `ElementContext` Members