Syntax WG Response to XML-C14N Comments

This version:
Tim Bray (Textuality) <>
Joseph Reagle (W3C) <>

1. Multple Levels.

The C14N Requirements Document  (section 2.7) states that:

The specification proposes to describe 3 level of canonicalization, changeable by the needs of the DSig WG:

Level-0: Do nothing to the XML document.
Level-1: Convert the encoding of the document to UTF-8 is it is not already UTF-8
Level-2: Whatever is decided upon as full Canonicalization.

The group considers this requirement to be met by the current draft, since Level-2 c14n can be accomplished merely by applying the character-model rules in the I18n WG draft.[CharModel]. The XML Signature WG has indicated that it will not require "mandatory implementation" of Level-2 as specified in [c14n]. However, it has also expressed interest in a required "mandatory to implement" specification that does UTF-8 and CR/LF processing. The syntax WG would like the XML Signature WG to propose such as algorithm -- as is tasked by the Simon Action item in the 990830 XML Signature minutes -- and we agree to review the proposal.

2. From: Hiroshi Maruyama

1. The commentator questions the need for the trailing newline character. We put this in for two reasons: first, it would be nice to keep open the possibility of creating canonical XML in a text editor, most of which insist on inserting a trailing newline. Second, the absence of a trailing newline tends to mess up the operation of some automatic comparison tools such as "diff".

2. The commentator questions the omission of PIs. Upon discussion, and given the changes since our WD in the information available in an internal Infoset draft (the DTD information item now has a children property containing the comments and PIs that occur in the DTD) [infoset], the Syntax WG achieved consensus on including in the canonical form all PIs except those that are appear in the scope of the document type declaration, which obviously includes those in the external subset and external parameter entities.

For PIs outside the root element, we will terminate each with a single newline.

3. The commentator observes that the namespace handling may produce very large canonical-form documents. We agree but believe the benefits (context-independence) seem to justify this.

3. From: Richard D. Brown

1. The commentator regrets the absence of the DTD from the canonical form. The WG observes that information from the DTD that can affect the actual information set (default attributes and so on) is propagated from the DTD into the canonical form. What's left is a set of syntactic rules governing which elements can contain which others, and which attributes may be attached, that only come into play in preserving validity when the document is being modified. We do not believe this latter set of information (that which is not propogated into the resulting XML) is material (for most applications) to the meaning of a document and that information could change without affecting the validation of a signature. Consequently, this seems out-of-scope for the c14n work, and the requirement to use a full validating parser to generate the canonical form would make it substantially more expensive.

2. We acknowledge and will fix the editorial error in 5.2, &#x13; should be &#13;

3 and 4. The commentator argues that the first bullet set is not accurate with respect to XML attribute-value normalization, and that the first bullet list conflicts with the second and with 5.7. The WG observes that the draft is unclear in that the prose between the two bullet lists doesn't make it clear that the second describes what the canonical form does to address the issues raised in the first.

ACTION ITEM: James Clark - double-check that we are in fact correct in our relationship to attribute-value normalization

4. From: Milton M. Anderson

The commentator regrets the preservation of all whitespace in the canonical form, including that "between tags", and argues that when modifying canonical-form XML, it will be necessry to copy in and remember the location of all whitespace characters.

The WG observes, based on long experience with SGML, that the rules necessary to determine which white space should be retained and which discarded in the presence of all possible combination of element and mixed content, rapidly become extremely complex, hard to understand, and hard to implement.

This is why XML 1.0 took a deliberate decision to preserve all whitespace.

Since XML 1.0 is clear that all whitespace is part of the data in the XML document, it seems unavoidable that it should be preserved in the canonical form.

The observation that this will require apps to track where line-ends and other white space were in order to preserve them is correct, but should not present great problems since all conformant XML processors are required to preserve this information when reading XML documents.

5. From: John Cowan

Objects to the usage of the term "combining character", instead wants "precomposed characters". The problem is real but the solution requires a bit more subtlety - the problem is exactly the possibility of using either combining *or* precomposed characters for the same logical content, and the text could be a bit clearer. ACTION ITEM: Tim Bray, clarify text.


Canonical XML. Bray, Clark, and Tauber.
XML Canonicalization Requirements. James Tauber, Joel Nava .
Character Model for the World Wide Web, ed. Martin J. Dürst. .
XML Information Set, eds. John Cowan and David Megginson. Available at .
Namespaces in XML, eds. Tim Bray, Dave Hollander, and Andrew Layman. Available at .
The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, Mass.: Addison-Wesley Developers Press, 1996.
Extensible Markup Language (XML) 1.0, eds. Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. 10 February 1998. Available at .