Canonical XML
Version 1.0

W3C Working Draft 13 June 2000

This version:: http://www.w3.org/TR/2000/WD-xml-c14n-20000613
Latest version:: http://www.w3.org/TR/xml-c14n
Previous version:: http://www.w3.org/TR/2000/WD-xml-c14n-20000601
Editor(s): John Boyer, PureEdge Solutions Inc., jboyer@PureEdge.com

Abstract

This specification describes a method for generating a physical representation, the canonical form, of an input XML document, that does not vary under syntactic variations of the input that are defined to be logically equivalent by the XML 1.0 Recommendation [XML]. If an XML document is changed by an application, but its Canonical-XML form has not changed, then the changed document and the original document are considered equivalent for the purposes of many applications. This document does not establish a method such that two XML documents are equivalent if and only if their canonical forms are identical.

Status of this document

This is the second draft of a proposal that (1) serves as an alternative approach to the Canonical XML specification using the XPath [XPath] data model, and (2) includes a few substantive changes that affect the canonical serialization of an XML document. It is not necessary for implementations to use XPath to generate the canonical form of an XML document. XPath simply provides a data model that is simplified compared to InfoSet, yet sufficient for the purpose of canonicalization. XPath also provides an expression syntax for describing the desired portion of a whole document. Any variances between that result from this specification's use of the XPath [XPath] data model and the XML Information Set [InfoSet] will be reported to the XML Information Set's comments list.

Prior versions of this document were published by the XML Core Working Group (the last of which was the 20000119 draft), which delegated the completion of this specification to the IETF/W3C XML Signature Working Group. We expect continued substantive discussion with respect to the treatment of XML namespaces, but hope to address that (any any other issues) quickly such that we can issue a second Last Call at the beginning of July 2000.

The XML Signature and XML WGs and other interested parties are invited to comment on this proposed direction, review the specification and report implementation experience. While we welcome implementation experience reports, the XML Signature Working Group will not allow early implementation to constrain its ability to make changes to this specification.

Please send comments to the editors and cc: the list <w3c-ietf-xmldsig@w3.org>. Publication as a Working Draft does not imply endorsement by the W3C membership or IESG. It is inappropriate to cite W3C Drafts as other than "work in progress." A list of current W3C working drafts can be found at http://www.w3.org/TR/. Current IETF drafts can be found at http://www.ietf.org/1id-abstracts.html.

There have been no solicitations nor declarations regarding patents related to this specification within the Signature WG.

Appendices

1 Introduction

The XML 1.0 Recommendation [XML] specifies the syntax of a class of resources called XML documents. It is possible for XML documents which are equivalent for the purposes of many applications to differ in physical representation. In particular, they may differ in their entity structure, attribute ordering, and character encoding.

It is not a goal of this work to establish a method such that two XML documents are equivalent if and only if their canonical forms are identical. Such a method is unachievable, in part due to application-specific rules such as those governing unimportant whitespace and equivalent data (e.g. <color>black</color> versus <color>rgb(0,0,0)</color>). There are also equivalencies established by other W3C Recommendations and Working Drafts. Accounting for these additional equivalence rules is beyond the scope of this work. They can be applied by the application or become the subject of future specifications.

The XPath 1.0 Recommendation [XPath] specifies a data model for representing an input XML document as well as an expression syntax for describing portions of the document (as well as arbitrary strings, booleans and numbers). When an XPath expression is used to describe portions of an XML document, the result is called a document subset.

This specification describes a method for generating a physical representation of an input XML document or document subset that does not vary under syntactic variations of the input XML document that are defined to be logically equivalent by the XML 1.0 Recommendation. The input must be a well-formed XML document with an optional XPath expression and evaluation context. The output physical representation is called a canonical form or simply Canonical XML.

The Canonical XML generated for an entire XML document is well-formed. The canonical form of an XML document subset may not be well-formed XML. However, since the canonical form will often be subjected to further XML processing, most XPath expressions provided for canonicalization will be designed to produce a document subset that is a well-formed XML document or external general parsed entity.

Canonical XML is designed to be used by applications that require the ability to test whether a document or document subset has been changed in a way that is not defined to be logically equivalent by the XML 1.0 Recommendation. For example, a digital signature over the canonical form of an XML document or document subset would allow the signature digest calculations to be oblivious to changes in the document's physical representation provided that the changes are defined to be logically equivalent by the XML 1.0 Recommendation.

2 Canonical XML Data Model

The data model used to create Canonical XML is equivalent to the data model defined in the XPath 1.0 Recommendation [XPath]. Although an implementation of this specification need not be based on an XPath implementation, this specification discusses the canonicalization method based on the XPath definition of a node-set.

Under the XPath data model, an XML processor is used to perform the following tasks in order:

Canonical XML requires that the input document be well-formed XML, but the input need not be validated. However, Canonical XML requires that attribute value normalization and entity reference resolution be performed in accordance with the behaviors of a validating XML processor. Thus, the declarations in the document type declaration are used to help create the canonical form, but the document type declaration is not retained in the canonical form (in part because it is omitted from the XPath data model and in part because it is not needed by the canonical form).

In the XPath data model, there exist the following node types: root, element, comment, processing instruction, text, attribute and namespace. There exists a single root node whose children are processing instruction nodes and comment nodes to represent information outside of the document element (and outside of the document type declaration). The root node also has a single element node representing this top-level element. Each element node can have child nodes of type element, text, processing instruction, and comment. The attributes and namespaces associated with an element are not considered to be child nodes of the element, but they are associated with the element by inclusion in the element's attribute and namespace axes. Note that attribute and namespace axes may not directly correspond to the text appearing in the element's start tag in the original document.

Although the XML 1.0 Recommendation states that an XML processor need not provide the text of comments, the XPath data model supports comments, so Canonical XML may include comments. However, since XML 1.0 did not require comments to be provided, comment nodes are excluded by default.

An element has attribute nodes to represent the non-namespace attribute declarations appearing in its start tag as well as nodes to represent default attributes that were not specified and not declared as #implied.

By virtue of the XPath data model, Canonical XML is namespace-aware [Names], but it cannot and therefore does not account for namespace equivalencies via namespace prefix rewriting (see below). In the XPath data model, each element and attribute has a name returned by the function name() which can, at the discretion of the application, be the QName appearing in the original document. Canonical XML requires that the XML processor retain the sufficient information such that the QName of the element as it appeared in the original document can be provided.

An element E has namespace nodes that represent its namespace declarations, any namespace declarations made by its ancestor that have not been overridden in E's declaration, the default namespace if it is non-empty, and the declaration of the prefix xml. The XPath data model expects the XML processor to convert relative URIs to absolute URIs.

Character content is represented in the XPath data model with text nodes. All consecutive characters are placed into a single text node. Furthermore, the text node's characters are represented in the UCS character domain. Canonical XML does not perform character model normalization (see below).

The Canonical XML generator is specified in terms of producing a canonical form by processing an XPath node-set. The node-set is defined to be the result of setting an initial evaluation context of:

then evaluating the expression

(//. | //@* |
//namespace::*)[not(self::comment())]

. This expression generates a node-set containing every node of the XML document except the comments.

3 Document Order for Canonical XML

Although an XPath node-set is defined to be unordered, the XPath 1.0 Recommendation [XPath] defines the term document order to be the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities, except for namespace and attribute nodes whose document order is application-dependent.

During XPath expression evaluation, Canonical XML imposes no order on the namespace and attribute axes of elements. After evaluating the expression, a node-set is processed by imposing the following additional document order rules on the namespace and attribute nodes of an element:

Lexicographic comparison is based on the UCS codepoint values, which is equivalent to lexical ordering based on UTF-8.

4 Generation of Canonical XML

The XPath node-set is converted into a UTF-8 string by generating the representative text for each node in the node-set in ascending document order with a UTF-8 encoding. No node is processed more than once. Note that processing an element node E includes the processing of all members of the node-set for which E is an ancestor. Therefore, directly after the representative text for E is generated, E and all nodes for which E is an ancestor are removed from the node-set (or some logically equivalent operation occurs such that the node-set's next node in document order has not been processed).

The result of processing a node depends on its type and on whether or not it is in the node-set. If a node is not in the node-set, then no text is generated for the node except for the result of processing its namespace and attribute axes (elements only) and its children (elements and the root node). If the node is in the node-set, then text is generated to represent the node in the canonical form in addition to the text generated by processing the node's namespace and attribute axes and child nodes.

NOTE: The node-set is treated as a set of nodes, not a list of subtrees. To canonicalize an element including its namespaces, attributes, and content, the node-set must actually contain all of the nodes corresponding to these parts of the document, not just the element node.

The text generated for a node is dependent on the node type and given in the following list:

The QName of a node is either the local name if the namespace prefix string is empty or the namespace prefix, a colon, then the local name of the element. The namespace prefix used in the QName MUST be the same one which appeared in the input document.

5 XML Document Subsets

Some applications require the ability to create a physical representation for an XML document subset (other than the one generated by default, which is technically a document subset because the comments are omitted). Canonical XML implementations based on XPath can provide this functionality with little additional overhead. The following additional steps must be taken:

The node-set passed to the canonical form generator is calculated by setting the initial evaluation context as described in the section Canonical XML Data Model, except replacing the variable bindings and namespace declarations with those provided above, then evaluating X.

The resultant node-set MUST contain a comment node for each comment of the input document, except those comments excluded by the expression X.

The processing of the element node is also modified slightly when the XPath expression is not the default given in the Canonical XML Data Model. The method for processing the attribute axis of an element Ein the node-set is enhanced if the element's parent is not in the node-set. All element nodes along E's ancestor axis are examined for nearest occurences of attributes in the xml namespace, such as xml:lang and xml:space (whether or not they are in the node-set). >From this list of attributes, remove any that are in E's attribute axis (whether or not they are in the node-set). Then, lexicographically merge this attribute list with the nodes of E's attribute axis that are in the node-set. The result of visiting the attribute axis is computed by processing the attribute nodes in this merged attribute list.

NOTE:XML entities can derive application-specific meaning from anywhere in the XML markup as well as by rules not expressed in XML 1.0. Clearly, these rules cannot be specified in this document, so the author of the expression X must be responsible for creating an expression that preserves the information necessary to capture the full semantics of the members of the resulting node-set.

Appendix A: Resolutions

Although this specification now defines Canonical XML in terms of the XPath data model rather than XML InfoSet, the canonical form described in this document is quite similar in most respects to the canonical form described in prior versions of the Canonical XML specification. However, there are some differences. This section discusses the differences and provides a rational for changes.

A.1 No Character Model Normalization

The Unicode standard [Unicode] allows multiple different representations of certain "precomposed characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. The W3C has recommended a normalized representation [CharModel]. Prior drafts of Canonical XML used this normalized form. However, most XML 1.0 processors do not perform the this normalization. Furthermore, applications that must solve this problem typically perform the character model normalization as character content is created, which would obviate the need for character model normalization during canonicalization. Therefore, character model normalization has been moved out of scope for Canonical XML.

A.2 No Namespace Prefix Rewriting

Prior drafts of the Canonical XML specification described a method for rewriting namespace prefixes such that two documents having logically equivalent namespace declarations would also have identical namespace prefixes. However, the statement in Namespaces in XML that "the prefix functions only as a placeholder for a namespace name" is incorrect. Namespace prefixes can impart information value in an XML document if they are referenced in an attribute value or element content (for example, and element or attribute containing an XPath expression). Thus, rewriting the namespace prefixes would damage such a document by changing its meaning (and it cannot be logically equivalent if its meaning has changed). The theorems below state the results more formally.

Theorem 1: With namespace rewriting, there exist two XML documents D1 and D2 that are logically equivalent yet their canonical forms are not equal.

Proof:Let D1 be a document containing an XPath in an attribute value or element content that refers to namespace prefixes used in D1. Further assume that the namespace prefixes in D1 will all be rewritten by the canonicalization method. Let D2 = D1, then modify the namespace prefixes in D2 and modify the XPath expression's references to namespace prefixes such that D2 and D1 remain logically equivalent. Since namespace rewriting does not include occurences of namespace references in attribute values and element content, the canonical form of D1 does not equal the canonical form of D2 because the XPath will be different. []

Remark:The same condition exists if we remove namespace rewriting. The purpose of this theorem is simply to show that namespace rewriting does not accomplish the goal for which it is intended.

Theorem 2:With namespace rewriting, there exist two XML documents D1 and D2 that have equivalent canonical forms and yet are not logically equivalent.

Proof:Let D1 be a document containing an XPath in an attribute value or element content that refers to namespace prefixes used in D1. Further assume that the namespace prefixes in D1 will all be rewritten by the canonicalization method. Now let D2 = the canonical form of D1. Clearly, the canonical forms of D1 and D2 are equivalent (since D2 is the canonical form of the canonical form of D1), yet D1 and D2 are not logically equivalent because the aforementioned XPath works in D1 and doesn't work in D2. []

Remark:Since D1 and D2 are not logically equivalent, and D2 is the canonical form of D1, we can conclude that namespace rewriting is harmful rather than simply ineffective.

The conclusion to be draw from these theorems is that namespace prefixes should not be altered by XML canonicalization. Applications that need to test for logical equivalence will need to perform more sophisticated tests than mere octet stream comparison. However, this is quite likely to be necessary in any case in order to test for logical equivalencies based on application rules as well as rules from other XML-related recommendations, working drafts, and future works.

A.3 Handling of Default Namespace

Prior drafts of the Canonical XML specification stated that the default namespace is not used. In the XPath data model, a non-empty default namespace is indicated by a namespace node with an empty local name. An empty namespace is indicated by the absence of such a node. In keeping with the policy of not rewriting namespace prefixes, which includes not adding prefixes that were not in the source document, the default namespace system has been added to Canonical XML. When there is no default namespace node, the canonicalization method indicates this with xmlns="" even if the source document did not contain this declaration explicitly (because there is no way to find out whether it did or not). The result is logically equivalent but, like the addition of default attribute nodes, implies that XPath expression authors should be wary of creating expressions that test for the position of attribute or namespace nodes (they are bound to fail in most cases because the sorting of namespace and attribute axes occurs only on output, not during the XPath expression evaluation).

A.4 Order of Namespace Declarations and Attributes

Prior drafts of the Canonical XML specification alternated between namespace declarations and attribute declarations. This is part of the namespace prefix rewriting scheme, which this specification eliminates. This specification follows the XPath data model of putting all namespace nodes before all attribute nodes.

A.5 Handling of Whitespace Outside Document Element

Prior drafts of the Canonical XML specification placed a #xA after each PI outside of the document element as well as a #xA after the end tag of the document element. The method in this specification performs the same function except for omitting the final #xA after the last PI (or comment or end tag of the document element). This technique ensures that PI (and comment) children of the root are separated from markup by a linefeed even if root node or the document element are omitted from the output node-set.

Appendix B: References

Appendix C Acknowledgements (Non-Normative)

The following people provided valuable feedback that improved the quality of this specification:

Canonical XML Version 1.0