This document outlines use cases, requirements and design choices for XML Security 2.0, specifically Canonical XML 2.0 and XML Signature 2.0. It includes a proposed simplification of the XML Signature Transform mechanism, intended to enhance security, performance, streamability and to ease adoption.

Requirements and Design Options

Web Services Security

Assumptions

Message content will be provided and processed by multiple software components acting autonomously. The XML will make use of multiple namespaces, potentially with duplicate element names.
Messages may pass through multiple intermediary nodes which may add, subtract or alter content in either the SOAP header or body.

Requirements

Generally the ability to provide ephemeral authentication, integrity protection and confidentiality of message content including attachments, using a variety of technologies. In some cases, messages with signatures may be stored for purposes of dispute resolution.
Any or all of messages may be signed and/or encrypted zero or more times in any order. Signatures and encryptions may overlap. A receiver must be able to properly verify signatures and decrypt data in the proper order (assuming access to the necessary secrets or trust points) based on nothing but the message.
It must be possible to determine whether the correct portions of the message have been signed and encrypted with the correct keys according to policy.
To the extent possible allowed by the ordering of data and cryptographic operations it should be possible for a sender or a receiver to perform processing in a single pass over the message.

Enable Integrity Protection of Portions of Binary Content

Binary Portions Use Case

A digital image file contains the raw image data and optional metadata. This metadata contains information like the date the photo was taken, exposure information, search info, general description, etc. Now a photographer wants to use an XML signature to digital sign their photo to ensure it isn't modified by someone, but still wants allows other users to add new meta-data to their photo. This can only be done if the photographer only signs the raw image data and excludes the metadata.

Binary Portions Requirements

The XML Signature 1.0 specification allows authors of XML Signatures to sign a subset of an XML document, but doesn't define any grammar that allows a subset of a non XML resource to be signed. The requirement for the next version of the XML Signature specification is to define a mechanism that allows a subset of a non XML resource to be signed.

Canonicalization

Besides the explicit design principles and requirements in [[XML-CANONICAL-REQ]], the Canonical XML and Exclusive Canonicalization specifications are guided by a number of design decisions that we present and discuss in this section.

Historical requirements

The basic idea of a canonical XML is to have a representation of an XML document (the output being a concrete string of bytes) that captures some kind of "essence" of the document, while disregarding certain properties that are considered artifacts of the input document (thought of, again, as an octet stream), and deemed to be safely ignorable.

The historic Canonical XML Requirements [[XML-CANONICAL-REQ]] include:

The specification for Canonical XML shall describe how to derive the canonical form of any XML document. Every XML document shall have a unique canonical form.
The canonical form of an XML document shall be a well formed XML document with the following invariant property:
- Any XML document, say X, processed by a canonicalizer, will produce an XML Document X'.
- X' passed through the same canonicalizer must produce X'.
- X' passed through any other conforming canonicalizer should produce X', or else one of them in not conformant.

In other words, Canonicalization is historically thought of as a well-defined, idempotent mapping from the set of XML documents into itself.

In its main use case, XML Signature, Canonical XML [[XML-C14N]] (and its cousin, Exclusive Canonicalization [[XML-EXC-C14N]]) is actually used to fulfill a number of distinct functions:

Canonical XML is used as the canonical mapping from a node-set to an octet stream whenever such a mapping is required to connect distinct transforms to each other.
Canonical XML is used to serialize the ds:SignedInfo element before it is hashed as part of the signing process; note that this element does not necessarily exist as a serialization.
Canonical XML is used to discard artifacts of a specific representation before that representation is hashed in the course of either signature generation or validation.

Modified Requirements

This section summarizes a number of design options that arise when some of the requirements listed above are relaxed.

Only use Canonicalization for pre-hashing

It is not required to have canonicalization as general purpose transform to be used anywhere in a transform chain. Its only use would be to produce an octet stream that will be hashed.

Currently canonicalization is used whenever there is an impedance mismatch with one transform emitting binary, and next transform requiring nodeset. This is not required of a 2.0 version.

XML Canonicalization is used in some other specs e.g. DSS to do some cleanup of the XML. This is not required of a 2.0 version.

Canonical output need not be valid XML

Assuming that a canonicalization step is necessary to be performed as the last step of reference processing before hashing of the resulting octet-stream, the requirement that XML canonicalization produce valid XML could be relaxed. Some interesting things can be done with this relaxation - namespace prefixes can be expanded out, tag names in closing tags can be omitted, and EXI serialization format can be used. A possible design is described in [[XMLDSIG-THOMPSON]].

Define a well-defined (and limited) serialization for `ds:SignedInfo`

For every application of XML Signature, a ds:SignedInfo element needs to be hashed and signed. This step always involves canonicalization of a document subset. While some parts of ds:SignedInfo include an open content model (ds:Object, in particular), there is a large class of signatures for which the content model of ds:SignedInfo is well-understood. A special-purpose canonicalization algorithm might be cost-effective if it can reduce the computational cost for canonicalizing ds:SignedInfo in a suitably large portion of use cases.

Limit the acceptable inputs for Canonicalization

This design option could manifest itself in several ways.

Constrain the classes of node-sets that are acceptable.

There is no need to be able to canonicalize a fully generic nodeset. Nodeset is an XPath concept and a generic nodeset can have many strange things - like attribute nodes without the containing element, removal of namespace nodes without removal of the corresponding namespace declarations - these kinds of things only increase the complexity of the Canonicalization algorithm without adding any value.

Instead of a generic nodeset, canonicalization needs to work on a different data model :

Start with a subtree or a set of subtrees. These subtrees must be rooted at element nodes. For example, these subtrees can't be a single text node or a single attribute node.
Optionally from this set, exclude some subtrees (of element nodes) or exclude some attribute nodes. Only regular attributes can be excluded, not attributes that are namespace declarations or in the xml namespace.
Optionally to this set, reinclude some subtrees (of element nodes). (Note: this is not supported in Canonical XML 2.0, in order to support goals related to simplicity.)

This data model avoids namespace nodes completely. It only deals with namespace declarations. It also prohibits attribute nodes without parent element nodes. Another simplification with this model is if an element node is present, all its namespace declarations and all its child text nodes have to be present.

Constrain the classes of XML documents that are acceptable.

Canonical XML currently expends much complexity on merging relative URI references appearing in xml:base parameters. A revised version of Canonical XML could be defined to fail on documents in which the xml:base URI reference cannot be successfully absolutized.

Enable optional prefix rewriting

Handling of namespaces is a known major source of complexity in Canonical XML (and, to a lesser extent, in Exclusive Canonicalization). At least part of this complexity is due to a design decision to preserve namespace prefixes, which in turn is necessary to protect the meaning of QNames.

Canonical XML should support the option of namespace prefix re-writing, optionally including rewriting prefixes that are embedded in the content as QNames. This can include, for example, QNames inside an xsi:type attribute. QNames embedded in xsi:type are easy to detect, but some other instances of QNames in content may be hard to detect, so prefix rewriting may break the meaning of QNames. The advantage of using prefix rewriting is to avoid attaching significance to the prefix name since two different prefix names are considered to semantically equivalent if the prefixes map to the same namespace URI. In this case they should canonicalize to the same value, as will happen with prefix rewriting. Prefixes may be rewritten using unique string values, URIs or other mechanisms, depending on the specification design.

Transformation Simplification

Discussion

One use of an XML Signature is for integrity protection, to determine if content has been changed. Content is identified by one or more ds:Reference elements, causing that content to be located and hashed. In the current XML Signature Second Edition processing model each ds:Reference may include a transform chain to apply one or more transforms before hashing the content for inclusion in a signature.

Obviously a signature operation may occur in a workflow after various transformations have been performed on content, as long as the content can be identified by a ds:Reference at the appropriate point in that workflow. In this sense, XML Signature could be viewed as a step in a processing model, for example in XProc [[XPROC]]. What is referred to here is not such application processing steps, but only the limited case of transforms defined and processed as part of the XML Signature processing.

There are cases however where transformations must occur as part of signature processing itself. The reasons for these are more limited, however, so we propose in this document to simplify such processing. Reasons include the following:

Signing only pertains to a portion of the content, but the entire content has meaning outside of signing. Thus the signing operation should be able to sign a selected portion of content (and this may be also specified by signing all apart from a portion to be excluded).
A signature XML element may be included with the content, yet upon verification the signature element itself is excluded from the content that is verified.
Some content within a signature element might be included in signing and verification (e.g. signature properties) even though the signature is not itself.
Sometimes it may be necessary to sign, not the raw data, but the data that a user actually sees. This is called "sign what you see" requirement in Section 8.1.2 of the XML Signature specification. This might require, for example, using XSLT to transform the raw data into an HTML form, and signing this HTML data.

Well-defined signature processing is necessary to handle needs specific to signing, but should not be expected to handle arbitrary processing that could he handled as well as part of a workflow outside of signing.

As an example of the need to sign or verify a portion of the content, suppose you have a document with the familiar "office use only" section. When a user signs the document, the document subset should be the entire document less the "office use only" section. This way, any change made to the document in any place except the "office use only" section would invalidate the signature. The purpose of a digital signature is to become invalid when any change is made, except those anticipated by the signer. Thus, subtraction filtering is the best fit for a document subset signature.

By comparison, if a document subset signature merely selects the portion of the document to be signed, then additions can be made not only to the "office use only" section but also to any other location in the document that is outside of the selected portions of the document. It is entirely too easy to exploit the document semantics and inject unintended side effects. That is why exclusion is necessary. All is signed apart from the excluded portion, thus eliminating possibility of unwanted undetected additions.

Requirements

There are specific requirements associated with Signature transform processing:

Enable applications to determine what is signed.

Support "see what you sign" by allowing applications to determine what was included for signing and possibly confirm that with users. The current unrestricted transform model makes it very difficult to inspect the signature to determine what was really signed, without actually executing all the transforms.
Enable higher performance and streamability

Signing XML data should be almost as fast as serializing the XML to bytes (using an identity transformer) and then signing the bytes. Currently transforms are defined in terms of a "nodeset" and a nodeset implies using a DOM parser, which is very slow. It should be possible to sign documents using a streaming XML parser, in which the whole document is never loaded in memory at once.
Avoid performance penalties and security risks associated with arbitrary transformations by restricting the possible transformation technologies.

Such generality may still be applied in a workflow outside of signature processing with this restriction.
Define a more robust canonicalization

There are many problems with the current canonicalization algorithms. For example people are really taken aback when they are told that canonicalization does not remove whitespace in between tags. Whitespaces in base64 encoded content causes problems as well. Prefix names being significant is yet another source of issues. Schema aware canonicalization is another possibility, but this may have issues related to requiring a schema.

Enable applications to determine what is signed

The current Transform chain model is very procedural; it can have XPath, C14N, EnvelopedSign, Base64, XSLT etc transforms any number of times in any order. While this gives a lot of flexibility to the signer, it makes it extremely hard for the verifier to determine what was actually signed.

Current mechanisms to determine what is signed

Applications usually follow one of these mechanisms to determine what is signed

Trust the signer completely

Some applications do not inspect the transform chain at all. They expect that signer has sent a meaningful and safe transform chain, and since the transform chain is also signed it assures that the chain has not changed in transit.

This does not work for scenarios where the verifier has little trust in the signer. As an example, suppose there is a application that expects requests to signed with the user's password, and there are tens of thousands of users. This application will of course not trust all of its users, and given the possibility of DoS attacks, and that some transforms can change which is really signed, it will not want to run a chain of transforms that it doesn't understand.
Check predigested data

Some XML signature libraries have a provision to return the predigested data back to the application, i.e. the octet stream that results from running all the transforms, including an implicit canonicalization at the end.

The predigested data however cannot be easily compared with the expected data. Suppose the application expects XML elements A, B and C to be signed, it cannot just convert A, B, C to octet streams and search for them inside the predigested data octet stream. The predigested data is canonicalized, and so the search might fail. Also this mechanism is subject to wrapping attacks, as there is no information as to which part of the original document produced this predigested data.
Check nodeset just before canonicalization

If the transform chain only has nodeset->nodeset transforms (i.e. XPath or EnvelopedSig) in the beginning, followed by one final nodeset->binary transform (i.e. a C14n transform), then an implementation can return the nodeset just before the canonicalization. Unlike the predigested data, this is much easier to compare - DOM specifically has a method to compare nodes for equality, so this method could be used to compare expected nodeset with nodeset just before canonicalization.

Unfortunately this mechanism does not work if there is any transform that causes an internal conversion from nodeset->binary->nodeset, because in such case the nodes cannot be compared any more. An XSLT transform does this kind of conversion as does the DecryptTransform.
Put restrictions on transforms

Many higher level protocols put restrictions on the transforms. For example, ebXML specifies that there should be exactly two transforms, namely XPath and then the EnvelopedSig transform. SAML specifies there should be only one transform, the EnvelopedSig transform. This is not a generic solution, but it works well for these specific cases.

Problems with Id based references and XPath Transforms

The XPath transform is a very useful transform to specify what is to be signed. Id based mechanisms are simpler, but they have many problems:

An Id identifies a complete subtree, if some parts of the subtree have to be excluded an XPath has to be used.
An Id attribute has to be of type ID. If there is no schema/DTD information it is not possible to determine the type. Some implementations get around this by having certain reserved names, e.g. xml:id or wsu:id. These attributes are allowed everywhere and assumed to be of type ID even if there is no schema available.
Ids usually require schema changes, i.e. the schema has to identify which elements can have ID attributes.
Ids can also lead to wrapping attacks.

These problems are solved with XPath, but XPath has problems of its own:

A regular XPath Filter specifies XPaths "inside out". Anything more difficult than the simplest XPath requires using the "count" and other special functions. The XPath is often so complex it almost impossible to determine what is being signed by looking at the XPath expression.
An XPath 2.0 filter solves this problem and lets people write regular XPath, but it hasn't gained wide acceptance because it is optional. Also it offers too much unneeded flexibility allowing any number of union, intersect and subtract operations in any order. This flexibility again makes it harder for the verifier.
Unlike the ID which can only be once per reference, an XPath transform can be anywhere in the transform chain. For example, a transform chain can have XPath->C14N->XPath. A verifier getting this kind of transform chain would be clueless about the intent of the transform.

Required "declarative selection"

What would be preferable if instead of transforms the signature were more declarative and clearly separated selection from canonicalization. For example it could list out all the URIs, ids, or included XPaths, excluded XPaths of the the elements that are signed. Then it could apply canonicalization. This would make it easier for the verifier to first inspect the signature to determine what is signed and compare against a policy. To give one example, there might be a WS-SecurityPolicy with an expected list of XPaths. Only if this matches, will the verifier do the canonicalization to compute the digests.

Avoid Security risks

The XML Signature Best Practices document [[XMLDSIG-BESTPRACTICES]] points out many potential security risks in XML Signatures.

Order of operations

Reference validation before signature validation is extremely susceptible to denial of service attacks in some scenarios.
Insecurities in XSLT transforms

XSLT is a complete programming language. An untrusted XSLT can use deeply nested loops to launch DoS attacks, or use "user defined extensions" like "os.exec" to execute system commands.
Full expansion of Nodesets

As mentioned above a full expansion of an XPath nodesets results in a huge amount of memory usage, and this can be exploited for DoS attacks.
Complex XPaths

XPath Filter 1.0 requires very complex looking XPaths, these are very hard to understand, and an application can be potentially fooled into believing something is signed, whereas is is actually not. Also complex XPaths can use too many resources.
Wrapping attacks

ID based references and lack of a mechanism to determine what was really signed can enable wrapping attacks [[MCINTOSH-WRAP]].
Problems with RetrievalMethod

RetrievalMethod can lead to infinite loops. Also transforms in retrieval method can lead to many attacks, and these cannot be solved by changing the order of operations.

These security risks need to be addressed in the new specification.

Enable higher performance and streamability

XML Signature should not require DOM. There are existing streaming XML Signature implementations but they make various assumptions. It would be better to formalize these assumptions and requirements at the standardization level, rather than leave it up to each implementation.

Overheads of DOM

DOM parsers have a large overhead. Suppose there is a 1MB XML document. If this loaded into memory as a byte array it remains as a 1MB byte array. But if it is parsed into a DOM it explodes to 5-10x in size. This is because in DOM, each XML node has to become an object. Objects have overheads of memory book keeping, virtual function tables etc. Also each XML node needs parent, next sibling, previous sibling pointers, and it also needs prefix, namespaceURI etc, which could be objects themselves. All these eat up memory and it is a popular misconception that memory is very cheap. Even if this memory were temporary allocation only it would still be expensive - in garbage collected languages allocating and freeing too much of memory triggers the garbage collector too often which drastically slows down the system. Also this 10x DOM explosion can result in physical memory getting exhausted and requiring more pages to be swapped from disk. That is why web services often use streaming XML parsers on the server side. DOM parsers will croak and groan if asked to process multiple large XML documents simultaneously, whereas streaming XML parsers will happily chug along because of their low memory consumption.

One Pass

It is important to distinguish between one-pass and streamability. Streamability means not requiring to have the whole document in a parsed form available for random access, i.e. not requiring a DOM. While one pass is desirable, two pass doesn't take away all the merits of streaming. Suppose the signature value is before the data to be signed. This means that the signature value cannot be updated in the first pass, but only in the second pass - this is not really bad from the performance point of view. Let us the say the document is being streamed out into 1MB byte array, then in the first pass write some dummy bytes for this signature value and remember the location, and in the 2nd pass just update this location with the actual signature bytes, so the 2nd pass is very quick.

Also streamability does not require the ordering between the subelements of signature element. It can be assumed that the entire Signature element (assuming it is detached or enveloped signature) will be loaded up into a java/c++ object, so the order of the elements inside the Signature element does not affect streamability.

Verification in particular cannot be 1 pass - let us say you have a signed 1GB incoming message, which you need to verify first and then upload to a database. So you have to make two passes on this data - a first pass to verify and second pass to upload to the database. One cannot combine these two into 1 pass because verification result is determined only after reading the last byte.

Nodeset

The main impediment to streamability is the transform chain, because many of the transforms are defined on nodesets and nodeset requires a DOM. An XPath transform is the biggest culprit as there are many XPath expressions which cannot be streamed. It is necessary to define a streamable subset of XPath (which has been done for XPath 1.0, see [[XMLDSIG-XPATH]]).

Nodesets have another big problem. This nodeset concept was borrowed from XPath 1.0, and an XPath nodeset introduces a new kind of XML node - the namespace node. Namespace nodes are different from namespace declarations in an important way - they are not inherited. This means they need to be repeated for every node for which they are applicable. To give an example, if there is a document with 100 namespace declarations at the top element and with 99 child elements of the top element, a regular DOM will only have 200 (1 top element node + 99 child element nodes + 100 attribute nodes), whereas a nodeset will have 10,100 nodes (1 top element + 99 child element + 100*100 namespace nodes).

A naive implementation which uses the nodeset as defined will therefore be very slow, and be also be subject to various denial of service attacks. A smart implementation can try to not expand the nodeset fully and use inheritance, but they it won't be fully compliant with the XML Signature spec. This is because an XPath filter can address each of namespace nodes individually and filter them out, even though it is meaningless in XML. The Y4 test vector in the Exclusive Canonicalization Implementation and Interoperability Report has an example of this. Because of these performance problems some implementations do not support this Y4 test vector or only support it conditionally.

Streaming XPath Profile for XML Signature 2.0

XML Signature requires a profile of XPath to enable streaming.

Signature verification can be done in two passes. The first pass is a very cursory pass to collect the signature element and signing keys from the document. Signatures are often present in the beginning of the document, so this usually a very short pass. At the end of the first pass, the IncludedXPath and ExcludedPath are taken from each reference and used to construct "state machines" from these XPaths.

After the first pass, the second pass is performed. In this pass the document is parsed using a streaming XML parser to generate XML events. These events are fed into a state machine. If the event is accepted by an IncludedXpath, but not accepted by an ExcludedXPath then it is included, in that case the event is passed on to a streaming canonicalizer, and then to a streaming digestor. At the end of the second pass the result is digests for each reference.

The operation and requirements of this XPath profile is different from the requirements of other XPath profiles, such as that for XSLT template processing [[XSLT21]]. For this reason, XML Security requires its own XPath profile, although it might be suitable for other uses as well.

The reason the XSLT XPath profile is not suitable is that the assumptions and requirements are different. In XSLT processing the XPaths are not known in advance. The XSLT processor has to be ready to process any XPath that it comes across, so it maintains a context. This context consists of all the ancestors of the current element and some histograms so that it can process the position() function. The XPath needs to evaluated with only this context and nothing else. This is a fundamental difference from XML Signature model. In XML Signature, the XPaths are known in advance, and being continuously evaluated for every node. But in XSLT, they are evaluated only once.

The XPath subset is defined as the kind of subset can be evaluated with the XPath context. In the XSLT profile, for example, all sideways axis are disallowed by the subset i.e. following, preceding, following-sibling, or preceding-sibling. But the Signature subset allows following, and following-sibling.

Another big difference is the way this subset is defined. XML Signature defines the subset by syntax. Although this kind of definition is simpler to define and understand, it results in XPaths that are allowed in one syntax, but not allowed in another syntax. e.g. /a/b is allowed, but (/a)/b is not allowed in XML Signature. XSLT defines the subset by a "data flow graph". This has restrictions like once you start going up, you can't go down. (See the seven such rules in http://www.w3.org/TR/xslt-21/#streamability-conditions.) While XML Signature is very strict in allowing only attributes in predicate, XSLT is much more lax, e.g. /a[b] is not allowed in XML Signature, but is allowed in XSLT, because the rule 4 says that it is ok to go downwards as long you don't revisit a node more than once.

Another difference arising from this evaluation model is that XSLT allows relative XPaths - in fact that is a very important part of XSLT. There is always a current context node, when evaluating the XSLT XPath. So it allows parent and ancestor axis.

In summary, the two subsets have completely different purpose and there is no benefit in making them similar, that will only cripple both the use cases.

There are subsets whose use cases are similar to XML Signature where XPath expressions are known in advance and XPath expressions are used for selection. An example is the WS-Transfer use case.

Introduction

Principles

Requirements and Design Options

Web Services Security

Assumptions

Requirements

Enable Integrity Protection of Portions of Binary Content

Binary Portions Use Case

Binary Portions Requirements

Canonicalization

Historical requirements

Modified Requirements

Only use Canonicalization for pre-hashing

Canonical output need not be valid XML

Define a well-defined (and limited) serialization for `ds:SignedInfo`

Limit the acceptable inputs for Canonicalization

Enable optional prefix rewriting

Transformation Simplification

Discussion

Requirements

Enable applications to determine what is signed

Current mechanisms to determine what is signed

Problems with Id based references and XPath Transforms

Required "declarative selection"

Avoid Security risks

Enable higher performance and streamability

Overheads of DOM

One Pass

Nodeset

Streaming XPath Profile for XML Signature 2.0

Acknowledgments

Introduction

Principles

Requirements and Design Options

Web Services Security

Assumptions

Requirements

Enable Integrity Protection of Portions of Binary Content

Binary Portions Use Case

Binary Portions Requirements

Canonicalization

Historical requirements

Modified Requirements

Only use Canonicalization for pre-hashing

Canonical output need not be valid XML

Define a well-defined (and limited) serialization for ds:SignedInfo

Limit the acceptable inputs for Canonicalization

Enable optional prefix rewriting

Transformation Simplification

Discussion

Requirements

Enable applications to determine what is signed

Current mechanisms to determine what is signed

Problems with Id based references and XPath Transforms

Required "declarative selection"

Avoid Security risks

Enable higher performance and streamability

Overheads of DOM

One Pass

Nodeset

Streaming XPath Profile for XML Signature 2.0

Acknowledgments

Define a well-defined (and limited) serialization for `ds:SignedInfo`