Canonical XML Version 2.0

Abstract

Canonical XML Version 2.0 is a major rewrite of Canonical XML Version 1.1 and Exclusive Canonical XML 1.0 to address issues around performance, streaming, hardware implementation, robustness, minimizing attack surface, determining what is signed and more. It combines inclusive and exclusive canonicalization algorithms into a single algorithm, that takes the canonicalization mode as a parameter.

Any XML document is part of a set of XML documents that are logically equivalent within an application context, but which vary in physical representation based on syntactic changes permitted by XML 1.0 [XML10] and Namespaces in XML 1.0 [XML-NAMES]. This specification describes a method for generating a physical representation, the canonical form, of an XML document that accounts for the permissible changes. Except for limitations regarding a few unusual cases, if two documents have the same canonical form, then the two documents are logically equivalent within the given application context. Note that two documents may have differing canonical forms yet still be equivalent in a given context based on application-specific equivalence rules for which no generalized XML specification could account.

Canonical XML Version 2.0 is applicable to XML 1.0. It is not defined for XML 1.1.

1. Introduction

1.1 Terminology

The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in RFC 2119 [RFC2119].

See [XML-NAMES] for the definition of QName.

document subset: A document subset is a portion of an XML document that may not include all of the nodes in the document.
canonical form: The canonical form of an XML document is physical representation of the document produced by the method described in this specification
canonical XML: The term canonical XML refers to XML that is in canonical form. The XML canonicalization method is the algorithm defined by this specification that generates the canonical form of a given XML document or document subset. The term XML canonicalization refers to the process of applying the XML canonicalization method to an XML document or document subset.
subtree: Subtree refers to one XML element node, and all that it contains. In XPath terminology it is an element node and all its descendant nodes
DOM: DOM or Document Object Model is a model of representing an XML document in tree structure. The W3C DOM standard [DOM-LEVEL-2-CORE] is one such DOM, but this specification does not require this particular set of DOM APIs, any similar model can be used as long as it has a tree representation of the XML document, whose root is a document node, and the document node's descendants are element nodes, attribute nodes, text nodes etc.
DOM parser: An software module that reads an XML document and constructs a DOM tree.
Stream parser: A software module that reads an XML document and constructs a stream of XML events like "beginElement", "text", "endElement". [StAX] is an example of a stream parser.

1.2 Applications

Since the XML 1.0 Recommendation [XML10] and the Namespaces in XML 1.0 Recommendation [XML-NAMES] define multiple syntactic methods for expressing the same information, XML applications tend to take liberties with changes that have no impact on the information content of the document. XML canonicalization is designed to be useful to applications that require the ability to test whether the information content of a document or document subset has been changed. This is done by comparing the canonical form of the original document before application processing with the canonical form of the document result of the application processing.

For example, a digital signature over the canonical form of an XML document or document subset would allow the signature digest calculations to be oblivious to changes in the original document's physical representation, provided that the changes are defined to be logically equivalent by the XML 1.0 or Namespaces in XML 1.0. During signature generation, the digest is computed over the canonical form of the document. The document is then transferred to the relying party, which validates the signature by reading the document and computing a digest of the canonical form of the received document. The equivalence of the digests computed by the signing and relying parties (and hence the equivalence of the canonical forms over which they were computed) ensures that the information content of the document has not been altered since it was signed.

Note: Although not stated as a requirement on implementations, nor formally proved to be the case, it is the intent of this specification that if the text generated by canonicalizing a document according to this specification is itself parsed and canonicalized according to this specification, the text generated by the second canonicalization will be the same as that generated by the first canonicalization.

1.3 Limitations

Two XML documents may have differing information content that is nonetheless logically equivalent within a given application context. Although two XML documents are equivalent (aside from limitations given in this section) if their canonical forms are identical, it is not a goal of this work to establish a method such that two XML documents are equivalent if and only if their canonical forms are identical. Such a method is unachievable, in part due to application-specific rules such as those governing unimportant whitespace and equivalent data (e.g. <color>black</color> versus <color>rgb(0,0,0)</color>). There are also equivalencies established by other W3C Recommendations and Working Drafts. Accounting for these additional equivalence rules is beyond the scope of this work. They can be applied by the application or become the subject of future specifications.

The canonical form of an XML document may not be completely operational within the application context, though the circumstances under which this occurs are unusual. This problem may be of concern in certain applications since the canonical form of a document and the canonical form of the canonical form of the document are equivalent. For example, in a digital signature application, it cannot be established whether the operational original document or the non-operational canonical form was signed because the canonical form can be substituted for the original document without changing the digest calculation. However, the security risk only occurs in the unusual circumstances described below, which can all be resolved or at least detected prior to digital signature generation.

The difficulties arise due to the loss of the following information not available in the data model:

base URI, especially in content derived from the replacement text of external general parsed entity references
notations and external unparsed entity references
attribute types in the document type declaration

In the first case, note that a document containing a relative URI [URI] is only operational when accessed from a specific URI that provides the proper base URI. In addition, if the document contains external general parsed entity references to content containing relative URIs, then the relative URIs will not be operational in the canonical form, which replaces the entity reference with internal content (thereby implicitly changing the default base URI of that content). Both of these problems can typically be solved by adding support for the xml:base attribute [XMLBASE] to the application, then adding appropriate xml:base attributes to document element and all top-level elements in external entities. In addition, applications often have an opportunity to resolve relative URIs prior to the need for a canonical form. For example, in a digital signature application, a document is often retrieved and processed prior to signature generation. The processing should create a new document in which relative URIs have been converted to absolute URIs, thereby mitigating any security risk for the new document.

In the second case, the loss of external unparsed entity references and the notations that bind them to applications means that canonical forms cannot properly distinguish among XML documents that incorporate unparsed data via this mechanism. This is an unusual case precisely because most XML processors currently discard the document type declaration, which discards the notation, the entity's binding to a URI, and the attribute type that binds the attribute value to an entity name. For documents that must be subjected to more than one XML processor, the XML design typically indicates a reference to unparsed data using a URI in the attribute value.

In the third case, the loss of attribute types can affect the canonical form in different ways depending on the type. Attributes of type ID cease to be ID attributes. Hence, any XPath expressions that refer to the canonical form using the id() function cease to operate. The attribute types ENTITY and ENTITIES are not part of this case; they are covered in the second case above. Attributes of enumerated type and of type ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, and NOTATION fail to be appropriately constrained during future attempts to change the attribute value if the canonical form replaces the original document during application processing. Applications can avoid the difficulties of this case by ensuring that an appropriate document type declaration is prepended prior to using the canonical form in further XML processing. This is likely to be an easy task since attribute lists are usually acquired from a standard external DTD subset, and any entity and notation declarations not also in the external DTD subset are typically constructed from application configuration information and added to the internal DTD subset.

While these limitations are not severe, it would be possible to resolve them in a future version of XML canonicalization if, for example, a new version of XPath were created based on the XML Information Set [XML-INFOSET] currently under development at the W3C.

1.4 Requirements for 2.0

Canonical XML 2.0 solves many of the major issues that have been identified by implementers with Canonical XML 1.0 [XML-C14N] and 1.1 [XML-C14N11].

1.4.1 Performance

A major factor in performance issues noted in XML Signature is often Canonical XML 1.1 processing. Canonicalization will be slow if the implementation uses the Canonical XML 1.1 specification as a formula without any attempt at optimization. This specification rectifies this problem by incorporating lessons learned from implementation into the specification. Most mature canonicalization implementations solve the performance problem by inspecting the signature first, to see if it can be canonicalized using a simple tree walk algorithm whose performance is similar to regular XML serialization. If not they fall back to the expensive nodeset-based algorithm.

The use cases that cannot be addressed by the simple tree walk algorithm are mostly edge cases. This specification restricts the input to the canonicalization algorithm, so that implementations can always use the simple tree walk algorithm.

C14N 1.x uses an "XPath 1.0 Nodeset" to describe a document subset. This is the root cause of the performance problem and can be solved by not using a nodeset. This version of the specification does not use a nodeset, visits each node exactly once, and only visits the nodes that are being canonicalized.

1.4.2 Streaming

A streaming implementation is required to be able to process very large documents without holding them all in memory; it should be able to process documents one chunk at a time.

1.4.3 Robustness

Whitespace handling was a common cause of signature breakage. XML libraries allow one to "pretty print" an XML document, and most people wrongly assume that the white space introduced by pretty printing will be removed by canonicalization but that is not the case. This specification adds three techniques to improve robustness:

Optionally remove leading and trailing whitespace from text nodes,
Allow for QNames in content, particularly in the xsi:type attribute,
Optionally rewrite prefixes

1.4.4 Simplicity

C14N 1.x algorithms are complex and depend on a full XPath library. This increases the work required for scripting languages to use XML Signatures. This specification addresses this issue by not using the complex nodeset model, and therefore not relying completely on XPath - it also introduces a minimal canonicalization mode.

2. Canonical XML 2.0

2.1 Data Model

The input to the canonicalization algorithm consists of an XML document subset, and set of options. The XML document subset can be expressed in two ways, with a DOM model or a Stream model.

In the DOM model the XML subset is expressed as:

Inclusion List: Either the document Node D or a list of one or more element nodes E₁, E₂, ... E_n.
(If out of this list, one element node E_i is a descendant of another E_j, then that element node E_i is ignored.)
Exclusion List (optional): A list of zero or more element nodes E₁, E₂, ... E_m and a list of zero or more attribute nodes A₁, A₂, ... A_M.
These attribute nodes should not be namespace declaration or attributes in the xml namespace.

The XML subset consists of all the nodes in the Inclusion list and their descendant, minus all the nodes that are in the Exclusion list and their descendants.

The element nodes in the Inclusion list are also referred as apex nodes.

Note:This input model is a very limited form of the generic XPath Nodeset that was the input model for Canonical XML 1.x. It is designed to be simple and allow for a high performance algorithm, while still supporting the most essential use cases. Specifically:

This model does not support re-inclusion; i.e. all the exclusions are applied after all the inclusions. It is effectively a simplified form of the XPath Filter 2 model [XMLDSIG-XPATH-FILTER2] with one intersect followed by one optional subtract operation. Re-inclusion complicates the canonicalization algorithm, especially in the areas of namespace and xml attribute inheritance.
Exclusion is limited to complete subtrees and attribute nodes. Other kinds of nodes (text, comment, PI) cannot be excluded.
Attribute exclusion is also limited, such that namespace declaration and attributes from the xml namespace cannot be excluded.
Some examples of subsets that were were permitted in the Canonical XML 1.x, but not in this new version:
- A subset consisting of a single attribute all by itself.
- A subset consisting of an attribute without its owner element.
- A subset consisting of a text node all by itself.
- A subset consisting of a text node without its parent node.
- A subset consisting of an element without some of its text node children.

Note: Canonical XML 2.0, unlike earlier versions, does not support direct input of an octet stream. The transformation of such a stream into the input model required by this specification is application-specific and should be defined in specifications that reference or make use of this one.

2.2 Parameters

Instead of separate algorithms for each variant of canonicalization, this specification takes the approach of a single algorithm subject to a variety of parameters that change its behavior to address specific use cases.

The following is a list of the logical parameters supported by this algorithm. The actual serialization that expresses the parameters in use may be defined as appropriate to specific applications of this specification (e.g., the <ds:CanonicalizationMethod> element in [XMLDSIG-CORE2]).

Name	Values	Description	Default
`ExclusiveMode`	true or false	whether to do inclusive or exclusive dealing of namespaces. In exclusive mode the InclusiveNamespaces parameter can be specified listing the prefixes that are to be treated in an inclusive mode	false
`InclusiveNamespace`	space separated list of prefixes	list of prefixes to be treated inclusively. Special token #default indicates the default namespace.	empty
`IgnoreComments`	true or false	whether to ignore comments during canonicalization	true
`TrimTextNodes`	true or false	whether to trim (i.e. remove leading and trailing whitespaces) all text nodes when canonicalizing. Adjacent text nodes must be coalesced prior to trimming. If an element has an xml:space="preserve" attribute, then text node descendants of that element are not trimmed regardless of the value of this parameter.	false
`Serialization`	serializeXML or serializeEXI	whether to do the normal XML serialization (`http://www.w3.org/2010/xml-c14n2#serializeXML`), or do an EXI serialization (`http://www.w3.org/2010/xml-c14n2#serializeEXI`) - which is useful if the original document to be canonicalized is already in EXI format.	serializeXML
`PrefixRewrite`	none, sequential, derived	with none, prefixes are not changed, with sequential prefixes are changed to n1, n2, n3 ... and with derived, each prefix is changed to nSuffix, where the suffix is derived by doing a digest of the namespace URI.	none
`SortAttributes`	true or false	whether the attributes need to be sorted before canonicalization. In some environments the order of attributes changes in transit so sorting is important.	true
`XmlAncestors`	inherit, none	where to inherit the simple inheritable attributes (`xml:lang` and `xml:space`) and combine the `xml:base` i.e. similar to Canonical XML 1.1 or to completely ignore xml attributes in ancestors similar to Exclusive Canonical XML 1.0	inherit
`QNameAware`	an enumeration of qualified element names, qualified attribute names, and unqualified attribute names (identified by name, and parent qualified name)	set of nodes whose entire content must be processed as QName-valued or [CURIE]-valued for the purposes of canonicalization, including prefix rewriting and recognition of prefix "visible utilization"	empty set

The defaults are chosen for equivalence to Canonical XML 1.1 with comments ignored.

2.2.1 Conformance profiles

Implementations may not support all of these parameters. We have identified the following profiles.

Name	Objective	Supported parameters	Unsupported parameter
"1.x features"	Only supports features in Canonical XML 1.x and Exclusive Canonical XML 1.0	Needs to support `ExclusiveMode=true/false` , `InclusiveNamespace`, `IgnoreComments=true/false`, `SortAttributes=true` and `XMLAncestors=inherit/none`.	Assumes default for others parameters i.e. `TrimTextNodes=false`, `Serialization=Xml`, `PrefixRewrite=none`, `QNameAware=""`
"1.x Simple Exclusive"	Only a subset of Exclusive Canonical XML 1.0.	Needs to support `ExclusiveMode=true`, `XMLAncestors=none` and `SortAttributes=true` The input to Canonicalization should only be a single complete subtree identified by ID. There is no XPath involved in this profile and hence no associated complexities on visible utilization of prefixes in `IncludedXPath` and `ExcludedXPath`	Assume defaults for other parameters, i.e. `InclusiveNamespace=""`, `IgnoreComments=true`, `TrimTextNodes=false`, `Serialization=Xml`, `PrefixRewrite=none`, `QNameAware=""`
"Streaming"	Similar to the profile "1.x features" but supports streaming XPath. Note "SortAttributes" and "XMLAncestors" may be difficult to support Streaming canonicalization proposal )

2.3 Processing Model

The basic canonicalization process consist of traversing the tree and outputting octets for each node.

Input: The XML subset consisting of an Inclusion list and an Exclusion list.

Processing

Sort inclusion list by document order: If inclusion list only has the document node D there is nothing to sort. Otherwise remove all element nodes E_i that are descendants of some other element node in the inclusion list. Then sort the remaining element nodes E₁, E₂, ...E_n by document order.
Canonicalize each subtree For each element node E_i or document node D in the sorted list, do a depth first traversal to visit all the descendant nodes in the E_i subtree, and canonicalize each one of them. While traversing, if the current node is an element and that element is in the exclusion list, prune the traversal, i.e skip over that element and all its descendants.

During traversal of each subtree, generate the canonicalized text depending on the node type as follows:

Root Node- Ignore the byte order mark, XML declaration, nor anything from within the document type declaration. Traverse through the children.
Element Nodes- The canonicalized result is an open angle bracket (<), the element QName, the result of processing the namespaces, the result of processing the attributes, a close angle bracket (>), traverse the child nodes of the element, an open angle bracket (<), a forward slash (/), the element QName, and a close angle bracket (>). Note if the prefix rewriting parameter is set, the QNames will be written with the changed prefixes.
Attribute Nodes- a space, the node's QName, an equals sign, an open quotation mark (double quote), the modified string value, and a close quotation mark (double quote). The string value of the node is modified by replacing all ampersands (&) with &, all open angle brackets (<) with <, all quotation mark characters with ", and the whitespace characters #x9, #xA, and #xD, with character references. The character references are written in uppercase hexadecimal with no leading zeroes (for example, #xD is represented by the character reference ).
If the prefix rewriting parameter is set, and the attribute name has a namespace prefix, the prefix is changed to the rewritten prefix. Also with prefix rewriting enabled, the attribute content is treated specially if the attribute is among those enumerated for the QNameAware option. If so, the QName or [CURIE] value of the attribute is rewritten with the new prefix.
Namespace Nodes- Take the ordered list of namespace nodes resulting from namespace processing, and process each of the namespace node N in the same way as an attribute node.
Text Nodes- the string value, except all ampersands are replaced by &, all open angle brackets (<) are replaced by <, all closing angle brackets (>) are replaced by >, and all #xD characters are replaced by .
If parameter TrimTextNodes is true and there is no xml:space="preserve" declaration in context, trim the leading and trailing space. E.g. trim <A> <B/> to <A><B/> and trim <A> this is text </A> to <A>this is text</A>.
Note: The DOM parser might have split up a long text node into multiple adjacent text nodes, some of which may be empty. Be aware when trimming whitespace in such cases; the net result should be equivalent to doing so as if the adjacent text nodes were concatenated.

If the prefix rewriting parameter is set, and if the parent element node is among those enumerated for the QNameAware option, then the QName or CURIE value of the text node is rewritten with the new prefix.
Processing Instruction (PI) Nodes- The opening PI symbol (<?), the PI target name of the node, a leading space and the string value if it is not empty, and the closing PI symbol (?>). If the string value is empty, then the leading space is not added. Also, a trailing #xA is rendered after the closing PI symbol for PI children of the root node with a lesser document order than the document element, and a leading #xA is rendered before the opening PI symbol of PI children of the root node with a greater document order than the document element.
Comment Nodes- Nothing if generating canonical XML without comments. For canonical XML with comments, generate the opening comment symbol (). Also, a trailing #xA is rendered after the closing comment symbol for comment children of the root node with a lesser document order than the document element, and a leading #xA is rendered before the opening comment symbol of comment children of the root node with a greater document order than the document element. (Comment children of the root node represent comments outside of the top-level document element and outside of the document type declaration).

Note although some XML models such as DOM don't distinguish namespace declarations from attributes, Canonicalization needs to treat them separately. In this document, attribute nodes that are actually namespace declarations are referred as "namespace nodes", other attributes are called "attribute nodes".

2.4 The Need for Exclusive XML Canonicalization

In some cases, particularly for signed XML in protocol applications, there is a need to canonicalize a subdocument in such a way that it is substantially independent of its XML context. This is because, in protocol applications, it is common to envelope XML in various layers of message or transport elements, to strip off such enveloping, and to construct new protocol messages, parts of which were extracted from different messages previously received. If the pieces of XML in question are signed, they need to be canonicalized in a way such that these operations do not break the signature but the signature still provides as much security as can be practically obtained.

2.4.1 A Simple Example

As a simple example of the type of problem that changes in XML context can cause for signatures, consider the following document:

<n1:elem1 xmlns:n1="http://b.example">
    content
</n1:elem1>

this is then enveloped in another document:

<n0:pdu xmlns:n0="http://a.example">
   <n1:elem1 xmlns:n1="http://b.example">
       content
   </n1:elem1>
</n0:pdu>

The first document above is in canonical form. But assume that document is enveloped as in the second case. The subdocument with elem1 as its apex node can be extracted from this second case with an XPath expression such as:

/descendant::n1:elem1

The result of performing inclusive canonicalization to the resulting xml subset is the following (except for line wrapping to fit this document):

<n1:elem1 xmlns:n0="http://a.example"
          xmlns:n1="http://b.example">
    content
</n1:elem1>

Note that the n0 namespace has been included by inclusive canonicalization because it includes namespace context. This change would break a signature over elem1 based on the first version.

2.4.2 General Problems with re-Enveloping

As a more complete example of the changes in canonical form that can occur when the enveloping context of a document subset is changed, consider the following document:

<n0:local xmlns:n0="foo:bar" xmlns:n3="ftp://example.org">
   <n1:elem2 xmlns:n1="http://example.net">
       <n3:stuff xmlns:n3="ftp://example.org"/>
   </n1:elem2>
</n0:local>

And the following which has been produced by changing the enveloping of elem2:

<n2:pdu xmlns:n1="http://example.com" xmlns:n2="http://foo.example">
   <n1:elem2 xmlns:n1="http://example.net">
       <n3:stuff xmlns:n3="ftp://example.org"/>
   </n1:elem2>
</n2:pdu>

Assume an xml subset produced from each case by applying the following XPath expression:

/descendant::n1:elem2

Applying inclusive canonicalization to the xml subset produced from the first document yields the following serialization:

<n1:elem2 xmlns:n0="foo:bar" xmlns:n3="ftp://example.org" xmlns:n1="http://example.net">
    <n3:stuff></n3:stuff>
</n1:elem2>

However, although elem2 is represented by the same octet sequence in both pieces of external XML above, the Canonical XML version of elem2 from the second case would be as follows:

<n1:elem2 xmlns:n1="http://example.net" xmlns:n2="http://foo.example">
    <n3:stuff xmlns:n3="ftp://example.org"></n3:stuff>
</n1:elem2>

Note that the change in context has resulted in lots of changes in the subdocument as serialized by the inclusive canonicalization. In the first example, n0 had been included from the context and the presence of an identical n3 namespace declaration in the context had elevated that declaration to the apex of the canonicalized form. In the second example, n0 has gone away but n2 has appeared, n3 is no longer elevated. But not all context changes have effect. In the second example, the presence of the n1 prefix namespace declaration have no effect because of existing declarations at the elem2 node.

On the other hand, using Exclusive canonicalization the physical form of elem2 as extracted by the XPath expression above is as follows:

<n1:elem2 xmlns:n1="http://example.net">
    <n3:stuff xmlns:n3="ftp://example.org"></n3:stuff>
</n1:elem2>

in both cases.

2.5 Namespace Processing

As part of the canonicalization process, while traversing the subtree, use the following algorithm to look at all the namespace declarations in an element, and decide which ones to output.

2.5.1 Namespace concepts

The following concepts are used in Namespace processing:

Explicit and Implicit namespace declarations

In DOM, there is no special node for namespace declarations, they are just present as regular attribute nodes. An "explicit" namespace declaration is an attribute node whose prefix is "xmlns" and whose localName is the prefix being declared.
DOM also allows declaring a namespace "implicitly", i.e. if a new DOM element or attribute is constructed using the createElementNS and createAttributeNS methods, then DOM adds a namespace declaration automatically when serializing the document.

Apex nodes

An apex node is an element node in a document subset having no element node ancestor in the document subset.

Default namespace

The default namespace is declared by xmlns="...". To make the algorithm simpler this will be treated as a namespace declaration whose prefix value is "" i.e. an empty string.

Visibility utilized

This concept is required for exclusive canonicalization. An element E in the document subset visibly utilizes a namespace declaration, i.e. a namespace prefix P and bound value V, if any of the following conditions are true:

The element E itself has a qualified name that uses the prefix P. (Note if an element does not have a prefix, that means it visibly utilizes the default namespace.)
OR The element E is among those enumerated for the QNameAware option, and the QName or CURIE value of the element uses the prefix P (or, lacking a prefix, it visibly utilizes the default namespace)
OR An attribute A of that element has a qualified name that uses the prefix P, and that attribute is not in the exclusion list. (Note: unlike elements, if an attribute doesn't have a prefix, that means it is a locally scoped attribute. It does NOT mean that the attribute visibly utilizes the default namespace.)
OR An attribute A of that element is among those enumerated for the QNameAware option, and the QName or CURIE value of the attribute uses the prefix P (or, lacking a prefix, it visibly utilizes the default namespace)
OR (TBD) Some special attribute or text nodes maybe have an XPath, e.g. the IncludedXPath and ExcludedXPath attributes in an XML Signature 2.0 Transform. Any prefixes used in this XPath expression are considered to be visibility utilized.

2.5.2 Namespace processing algorithm

Step 1: At first determine the namespaces to be output for an element E.

Find a list of namespace declarations that are in scope for this element E by looking at both implicit and explicit namespace declarations in this element and its ancestors.
If in this list, any of the namespace declaration has already been output during the canonicalization of one of the element E's ancestors, say E_j, and has not been redeclared since then to a different value, i.e not been redeclared by an element between E_j and E, then remove it from this list.
Of this list, check if there are any prefixes that are to be processed in exclusive mode. This is indicated by parameter ExclusiveMode="true" and this prefix being absent from parameter InclusiveNamespaces. For the prefixes that are to be treated in exclusive mode, check if the prefix is visibly utilized by this element E, and if it is not then remove it.
Return the list of namespace declarations left on the list.

Step 2: If the PrefixRewrite option is set to other than "none", then compute new prefixes for all the namespaces declarations in this list, except the prefixes starting with "xml", as follows:

For PrefixRewrite="sequential" sort this list of namespace declarations by URI. Then assign a new prefix value "nN" to each prefix, incrementing the value of N for every prefix. The counter should be set to 0 in the beginning of the canonicalization. (E.g. if the value of this counter was 5 when the traversal reached this element, and this element had 3 prefixes to be output, then use the prefixes "n5", "n6", "n7" and set the counter to 8 after that).
For PrefixRewrite="digest" assign new prefix values "nD" to each prefix in this list where D is SHA1 digest of the URI, expressed as a hexadecimal string using the characters '0'-'9' and 'a'-'f'. Before digesting, the URI should be converted to octets using US-ASCII encoding.

The "sequential" mode of prefix rewriting has the advantage of a smaller canonicalization output than the "digest" mode, but the downside is that it may result in different namespace prefixes in different contexts, see the example below. With the "digest" mode the namespace prefixes will be identical across documents and contexts. Note: with prefix rewriting the default namespace is never output, i.e. it is also rewritten into a new prefix.

Note: with exclusive canonicalization namespace declarations and output only when they are utilized, this may lead to one declaration being output multiple times, and if PrefixRewrite parameter is set to sequential, it may be rewritten to a different value every time.

Step 3: If SortAttributes="true" which is the default, then sort this list of namespaces as follows:
In case of PrefixRewrite="none" sort the namespace declaration in lexicographic(ascending) order of prefixes (the default namespace declaration has no prefix, so it is lexicographically least).
In case of PrefixRewrite="sequential" or PrefixRewrite="digest" sort them in ascending order of namespace URI.

Step 4: Output each of these namespace nodes, as specified in the Processing model.

2.5.3 Example of exclusive canonicalization with prefix rewriting

This following XML snippet will be used to determine the various options of prefixRewriting.

<wsse:Security  
  xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
  xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
    <wsse:UserName wsu:Id="i1">
        ...
    </wsse:UserName>
    <wsse:Timestamp wsu:Id="i2">
        ...
    </wsse:Timestamp>
<wsse:Security>

2.5.3.1 With `PrefixRewrite="none"`

<wsse:Security 
  xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
    <wsse:UserName
      xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
      wsu:Id="i1">
        ...
    </wsse:UserName>
    <wsse:Timestamp
      xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
      wsu:Id="i2">
        ...
    </wsse:Timestamp>
</wsse:Security>

Note how the "wsu" prefix declaration is present in wsse:Security, but is not utilized. So exclusive canonicalization will "push the declaration down" into <UserName> and <Timestamp> where it is really used, i.e. the wsu declaration will be output twice, once in <UserName> and another in <Timestamp>, as shown above.

2.5.3.2 With `PrefixRewrite="sequential"`

<n0:Security
  xmlns:n0="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
    <n0:UserName
      xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
      n1:Id="i1">
        ...
    </n0:UserName>
    <n0:Timestamp
      xmlns:n2="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
      n2:Id="i1">
        ...
    </n0:Timestamp>
</n0:Security>

Now observe what happens with sequential prefix rewriting, the wsu namespace is emitted twice, but each time with a different prefix. - "n1" and "n2", as shown above.

2.5.3.3 With `PrefixRewrite="digest"`

<n533be3d902dc7f54d5027ddd5917639d584e9d38:Security 
  xmlns:n533be3d902dc7f54d5027ddd5917639d584e9d38:="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
    <n533be3d902dc7f54d5027ddd5917639d584e9d38:UserName
      xmlns:ne2891a804ace8fbcc4a500f1dbc94cf01e38e023="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" 
      ne2891a804ace8fbcc4a500f1dbc94cf01e38e023:Id="i1">
        ...
    </n533be3d902dc7f54d5027ddd5917639d584e9d38:UserName>
    <n533be3d902dc7f54d5027ddd5917639d584e9d38:Timestamp
      xmlns:ne2891a804ace8fbcc4a500f1dbc94cf01e38e023="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" 
      ne2891a804ace8fbcc4a500f1dbc94cf01e38e023:Id="i2">
        ...
    </n533be3d902dc7f54d5027ddd5917639d584e9d38:Timestamp>
</n533be3d902dc7f54d5027ddd5917639d584e9d38:Security>

With digest prefix rewriting the wsu namespace is emitted twice as well, but it is the same every time. The downside is that the prefixes are very long.

2.6 Attribute processing

Note: namespace declarations are not considered as attributes, they are processed separately as namespace nodes.

Processing the attributes of an element E consists of the following steps:

If E is an apex node, then examine all ancestor element nodes of E for the nearest occurrences of simple inheritable attributes in the xml namespace, such as xml:lang and xml:space that are not already present in E's attributes. Then temporarily add these attributes to E's attribute list.
(Do this step only if the parameter XmlAncestors is set to "inherit".)

The xml:base attribute is not a simple inheritable attribute and requires special processing beyond a simple redeclaration. Collect the values of xml:base for all of E's ancestors, starting with the document root element, and including E itself into an ordered list. If there are two or more values in the list, combine them two at a time starting from the beginning, using the join-URI-references function. E.g. if the list has X₁,X₂, ... X_m, then join X₁ and X₂ first, then join the result with X₃ and so on.
(Do this step only if the parameter XmlAncestors is set to "inherit").
Ignore any attributes that are present in the exclusion list. However note that namespace nodes and xml: attributes cannot be excluded.
Sort all the attribute in increasing lexicographic order with namespace URI as the primary key and local name as the secondary key (an empty namespace URI is lexicographically least).
If the PrefixRewrite option is set to other than "none", modify the QNames for the attribute name to use the new prefixes. Also, if the attribute is among those enumerated for the QNameAware option, then change its QName or CURIE value to use the new prefix.

2.7 join-URI-References function

The join-URI-References function takes xml:base attribute values from all the ancestor elements and combines them to create a value for an updated xml:base attribute. A simple method for doing this is similar to that found in sections 5.2.1, 5.2.2 and 5.2.4 of RFC 3986 with the following modifications:

Perform RFC 3986 section 5.2.1. "Pre-parse the Base URI" modified as follows.
- The scheme component is not required in the base URI (Base). (i.e. Base.scheme may be null)
- Replace a trailing ".." segment with "../" segment before processing.
Section 5.2.4. "Remove Dot Segments" is modified as follows:
- Keep leading "../" segments
- Replace multiple consecutive "/" characters with a single "/" character.
- Append a "/" character to a trailing ".." segment
The "Remove Dot Segments" algorithm is modified to ensure that a combination of two xml:base attribute values that include relative path components (i.e., path components that do not begin with a '/' character) results in an attribute value that is a relative path component.
Perform RFC 3986 section 5.2.2. "Transform References" modified as follows to ignore the fragment part of R
- After parsing R set R.fragment = null

The following examples illustrate the modification of the "Remove Dot Segments" algorithm:

"abc/" and "../" should result in ""
"../" and "../" are combined as "../../" and the result is "../../"
".." and ".." are combined as "../../" and the result is "../../"

4. Pseudocode

This section presents the entire canonicalization algorithm in pseudo code. It is not normative.

4.1 canonicalize()

Top level canonicalize function.

canonicalize(list of subtree, list of exclusion elements and attributes, properties)
{
   put the exclusion elements and attributes in hash table for easier lookup
   
   sort the multiple subtrees by document order
   
   for each subtree
      canonicalizeSubtree(subtree) 
}

4.2 canonicalizeSubtree()

Canonicalize an individual subtree.

For efficiency the routines below maintain two contexts

namespaceContext: namespaceContext is a hash table of prefix -> (uri, hasBeenOutput, newPrefix).
- uri is the namespace URI that this prefix maps to.
- hasBeenOutput a boolean flag that indicates whether that namespace declaration has been output
- newPrefix the rewritten value of the prefix.
At the beginning of the canonicalization initialize this to contain only entry - the default namespace mapped to an empty URI, and hasBeenOutput = true. A prefix value of "" can be used to denote the default namespace.
xmlattribContext: xmlattribContext is a hash table of name -> value.

canonicalizeSubtree(node)
{
   initialize namespaceContext to contain the default prefix, mapped
   to an empty URI, and hasBeenOutput to true 
   
   if (node is the document node or a document root element) 
   {
      // (whole document is being processed, no ancestors to worry about)
      call processNode(node, namespaceContext)
   }
   else
   {
      starting from the element, walk up the tree to collect a list of
      ancestors 
    
      for each of this ancestor elements starting with the document
      root, but not including the element itself 
        addNamespaces(ancestorElem, namespaceContext)

      initialize xmlattribContext to empty

      for each of this ancestor elements starting with the document
      root, and also including the element itself 
        addXMLAttributes(ancestorElem, xmlattribContext)
      
      if there are any attributes in xmlattribContext 
         temporarily add/replace these XML attributes in node
            
      processNode(node, namspaceContext)
      
      restore the original XML attributes
   }   
}

4.3 processNode()

Redirect to appropriate node processing function

processNode(node, namespaceContext)
{
  call the appropriate function - processDocument, processElement, processTextNode, ... depending on the node type.
}

4.4 processDocument()

Process the Document Node.

processDocument(document, namespaceContext)
{
  Loop through all child nodes and call
    processNode(child, namespaceContext)
}

4.5 processElement()

Process an Element Node.

processElement(element, namespaceContext)
{
  if this exists in the exclusion hash table
    return
    
  make of copy of xmlattribContext and namespaceContext
  //(by copying, any changes made can be undone when this function returns)
  
  nsToBeOutputList = processNamespaces(element, namespaceContext)
  
  output('<')
  if PrefixRewrite is sequential or digest, temporatily modify the QName to have the new prefix value as determined from the namespaceContext
 
  output(element QName)  

  for each of the namespaces in the nsToBeOutputList
    output this namespace declaration 
    
  sort each of the non namespaces attributes by URI first then attribute name.
  output each of these attributes with original QName or a modifiedQName if PrefixRewrite is true
  
  output('>')
  
  Loop through all child nodes and call
    processNode(child, namespaceContext)
  
  output('</')
  output(element QName)
  output('>')
  
  restore xmlattribContext and namespaceContext
}

4.6 processText()

Process an Text Node.

processText(textNode)
{
  if this text node is outside document root
     return
     
  in the text replace 
    all ampersands by &amp;, 
    all open angle brackets (<) by &lt;, 
    all closing angle brackets (>) by &gt;, 
    and all  #xD characters by &#xD;.
    
  If TrimTextNodes is true and there is no xml:space="preserve" declaration in scope
    trim leading and trailing space
      
  output(text)
}

Note: The DOM parser might have split up a long text node into multiple adjacent text nodes, some of which may be empty. In that case be careful when trimming the leading and trailing space - the net result should be same as if it the adjacent text nodes were concatenated into one

4.7 processPI()

Process an Processing Instruction (PI) Node.

processPI(piNode)
{
  if after document node
    output('#xA')
    
  output('<?')
  output(the PI target name of the node)
  output(a leading space)
  output(the PI string value)
  output('?>') 

  if before document node
    output('#xA')
}

4.8 processComment()

Process an Comment Node.

processComment(commentNode)
{
  if ignoreComments
    return
    
  if after document node
    output('#xA')
    
  output('<!--')
  output(string value of node)
  output('-->')

  if before document node
    output('#xA')
}

4.9 addNamespaces()

Add namespaces from this element to the namespace context. This function is called for every ancestor element, and also at every element of the subtrees (minus the exclusion elements).

addNamespaces(element, namespaceContext)
{
  for each the explicit and implicit namespace declarations in the element
  {
     if there is already a declaration for this prefix, and this
     declaration is different from existing declaration 
     overwrite the URI , and set hasBeenOutput to false
      
     if there is no entry for this prefix
     add an entry for this URI, and hasBeenOutout to false
         
  } 
}

4.10 processNamespaces()

Process the list of namespaces for this element.

processNamespaces(element, namespaceContext)
{
  addNamespaces(element, namespaceContext)
  
  initialize nsToBeOutputList to empty list
  
  for each prefix in the namespaceContext for which hasBeenOutput is false
  {
     if ExclusiveMode and this prefix is not in the inclusiveNamespacesList
     {
         if the prefix is visibly utilized by this element
             add the prefix to the nsToBeOutputList and set
            hasBeenOutput to true 
     }
     else
         add the prefix to the nsToBeOutputList and set hasBeenOutput to true    
  }
  
  if (PrefixRewrite is none)
  {
    sort the nsToBeOutputList by the prefix
  }
  else if (PrefixRewrite is sequential) 
  {
    sort the nsToBeOutputList by URI
    assign new prefix values "nN" to each prefix in this
    nsToBeOutputList where N represents an incremented counter value ,
    i.e. n0, n1, n2 .. 
    // the counter should be set to 0 in the beginning of the canonicalization
    // note: prefix numbers are assigned in the order that the
    prefixes are present in nsToBeOutputList 
  }
  else if (PrefixRewrite in digest)
  {
    sort the nsToBeOutputList by URI
    assign new prefix values "nD" to each prefix in this nsToBeOutputList where
      D represents the SHA1 digest of the URI represented as a hex string
  }
  
  return nsToBeOutputList    
}

4.11 addXMLAttributes()

Combine/modify the 3 special xml attributes: xml:lang, xml:space and xml:base.

addXMLAttributes(element, xmlattribContext)
{
   for each of the xml: attributes of this element
   {

      case xml:lang attribute 
        if XmlAncestors is inherit then store this attribute value, else do nothing

      case xml:space attribute 
        if XmlAncestors is inherit then store this attribute value, else do nothing

      case xml:base attribute 
        if XmlAncestors is inherit, and there is a previous value of xml:base
           then do a "join-URI-References" to combine the new value and the old value 
        else do nothing
   } 
}

Input	Output
no/.././/pseudo-netpath/seg/file.ext	pseudo-netpath/seg/file.ext
no/..//.///pseudo-netpath/seg/file.ext	pseudo-netpath/seg/file.ext
yes/no//..//.///pseudo-netpath/seg/file.ext	yes/pseudo-netpath/seg/file.ext
no/../yes	yes
no/../yes/	yes/
no/../yes/no/..	yes/
../../no/../..	../../../
no/../..	../
no/..
no/../
/a/b/c/./../../g	/a/g
mid/content=5/../6	mid/6
../../..	../../../
no/../../	../
..yes/..no/..no/..no/../../../..yes	..yes/..yes
..yes/..no/..no/..no/../../../..yes/	..yes/..yes/
../..	../../
../../../	../../../
.
./
./.
//no/..	/
../../no/..	../../
../../no/../	../../
yes/no/../	yes/
yes/no/no/../..	yes/
yes/no/no/no/../../..	yes/
yes/no/../yes/no/no/../..	yes/yes/
yes/no/no/no/../../../yes	yes/yes
yes/no/no/no/../../../yes/	yes/yes/
/no/../	/
/yes/no/../	/yes/
/yes/no/no/../..	/yes/
/yes/no/no/no/../../..	/yes/
../../..no/..	../../
../../..no/../	../../
..yes/..no/../	..yes/
..yes/..no/..no/../..	..yes/
..yes/...no/..no/..no/../../..	..yes/
..yes/..no/../..yes/..no/..no/../..	..yes/..yes/
/..no/../	/
/..yes/..no/../	/..yes/
/..yes/..no/..no/../..	/..yes/
/..yes/..no/..no/..no/../../..	/..yes/
/	/
/.	/
/./	/
/./.	/
/././	/
/..	/
/../..	/
/../../..	/
/../../..	/
//..	/
//..//..	/
//..//..//..	/
/./..	/
/./.././..	/
/./.././.././..	/
.
./
./.
..	../
../	../