Canonical EXI

1. Introduction

The EXI 1.0 Recommendation [Efficient XML Interchange (EXI) Format 1.0] specifies the syntax of a class of resources called EXI streams. It is possible for EXI streams which are equivalent for the purposes of many applications to differ in physical representation. For example, they may differ in their datatype representation and attribute ordering. It is the goal of this specification to establish a method for determining whether two documents are identical, or whether an application has not changed a document, except for transformations permitted by EXI 1.0.

1.1 Notational Conventions and Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear EMPHASIZED in this document, are to be interpreted as described in RFC 2119 [IETF RFC 2119].

The term canonical is used throughout this document to denote a normative form in regard to the physical representation. The term canonical EXI refers to EXI that is in canonical form produced by the method described in this specification.

The term sorted lexicographically denotes lexicographical ordering of strings which is done by comparing character by character. Individual characters are ordered by comparing their Unicode code points.

1.2 Need of Canonical EXI

Many environments and device classes have difficulties to handle plain-text XML due to various reasons (e.g., document size and processing overhead). W3Cs Efficient XML Interchange Format has been developed to provide a solution to these issues and to extend the use of XML and its tools.

With EXI, constrained environments and device classes (low memory, bandwidth, and processing power) have the possibility to be part of the XML world. However, some use cases also require a canonical representation of the XML-based data for comparison of logical and physical equivalence. Hence, supporting EXI canonicalization without going through plain-text XML where nothing else but EXI is available is needed.

1.3 Applications

One application field for the canonical form of an XML-based document or document subset is digital signature. During signature generation, the digest is computed over the canonical form of the document or the document subset respectively. The document is then transferred to the relying party, which validates the signature by reading the document and computing a digest of the canonical form of the received document (see 3.2 Signature Processing Steps). If there is equivalence the relying parties can ensure that the information content of the document has not been altered since it was signed.

Although EXI supports plain-text XML Signature by preserving XML information such as comments and prefixes (see EXI Best Practices for XML Signature) this strategy is not suited for all environments and use cases.

1.4 Limitations

It is the goal of this specification to provide a canonical EXI form for various use-cases. For example, restricted and very limited devices should be able to create or check against a canonical EXI stream. This applies to devices that may be able to speak only a given EXI language only (according to an XML Schema) or support only a subset of all EXI features.

Therefore, the process for building a Canonical EXI stream bases upon the knowledge of the used EXI options. Moreover, there is not one canonical EXI stream of an EXI event sequence but many according to the schema knowledge in use and the according EXI options and fidelity settings. These EXI options MUST be set or known (e.g., out-of-band) so that two parties produce the same octet stream. That said, this also implies that the process for building a Canonical EXI stream does not take into account the EXI header and its options but instead solely bases on the EXI Body stream.

2. EXI Canonicalization

The subsequently described EXI Canonicalization steps and algorithms expect as input EXI events (e.g., SE, NS, and AT events). The input MUST be a sequence of EXI events and produces as output a canonicalized EXI stream. Following the presented algorithms guarantees that logically-identical documents produce identical serialized EXI Body stream representations (assuming the same EXI coding options).

Note:

An EXI stream can be passed to a final recipient over multiple intermediary nodes. In general, it is feasible to parse and re-encode the EXI stream on such intermediary node without affecting the canonical EXI stream. However, please note that alternating EXI Options (e.g., preserve option or schemaId) used to encode the body of the EXI stream, may lead to irrecoverable data loss or differences.

2.1 EXI Alignment Options and Streams

EXI provides four alignment options, namely bit-packed, byte-alignment, pre-compression, and compression.

The canonicalized EXI form is the resulting EXI stream following the rules defined in this document. When the alignment option compression is set for an EXI stream, its canonical form is computed as if the EXI stream was encoded using the alignment option pre-compression.

EXI processors may make use of padding bits, for example to make the length of the EXI stream byte-aligned. In a Canonical EXI stream padding bits, if necessary, MUST always be represented as a sequence of 0 (zero) bits.

Each EXI stream begins with an EXI Header but it MUST NOT be taken into account when building the canonical EXI form.

2.2 EXI Event Selection

EXI processors represent a given event such as a start element or an attribute by serializing an event code first, followed by the according event content. Each event code is represented by a sequence of 1 to 3 parts that uniquely identifies an event.

In situations where an EXI processor disposes of more than one possible event (-code) the canonical EXI form prescribes which event and respectively which event code has to be chosen.

That said, it is not uncommon that an EXI processors has certain flexibility in choosing the appropriate EXI grammar production, or respectively the appropriate event. Moreover, the availability of grammar productions is subject to the convention used by the application. A prominent convention is the [Efficient XML Interchange (EXI) Profile], which is more restrictive in regard to which production is usable than the [Efficient XML Interchange (EXI) Format 1.0] specification.

After excluding productions that are not usable according to the convention in use a canonical EXI processor MUST also follow the subsequent order:

Use the event with the most accurate event content first
For Start Element events the order is as follows:
1. SE( qname )
2. SE ( uri : * )
3. SE ( * )
For Attribute events the order is as follows:
1. AT( qname )
2. AT ( uri : * )
3. AT ( * )
IF the accurateness is the same use the event with the least event code parts

The subsequently following example depicts the available productions for an example DocContent grammar. From the perspective of the [Efficient XML Interchange (EXI) Format 1.0] specification it is perfectly fine to match a start element "A" with event code 0 (zero) or 4 (four). A canonical EXI form prescribes event code 0 (zero).

Example 2-1. Example productions with event codes


Syntax			Event Code
	DocContent
		SE ("A") DocEnd	0
		SE ("B") DocEnd	1
		SE ("C") DocEnd	2
		SE ("D") DocEnd	3
		SE() DocEnd*	4
		DT DocContent	5.0
		CM DocContent	5.1.0
		PI DocContent	5.1.1

2.3 EXI Stream Order

In general, a canonical EXI processor SHALL NOT change the order of the EXI input sequence. The only exceptions to this statement are sequences of attributes and/or namespace declarations.

The EXI specification defines that namespace (NS) and attribute (AT) events associated with a given element occur directly after the start element (SE) event in the following order:

...

AT (xsi:type)

AT (xsi:nil)

...

In addition, canonical EXI specifies that namespace declarations for a given element MUST be sorted lexicographically according to the NS prefix. Further, canonical EXI strictly requires that an xsi:type or an xsi:nil attribute MUST occur before other AT events even if it does not impact grammar selection. Moreover, attributes other than xsi:type and xsi:nil for a given element MUST be sorted lexicographically, first by qname local-name then by qname uri.

Note:

Optimizations such as pruning insignificant xsi:type values (e.g., xsi:type="xsd:string" for string values) or insignificant xsi:nil values (e.g., xsi:nil="false") is prohibited for a Canonical EXI processor.

2.4 EXI Datatypes

This section describes the built-in EXI datatype representations used for representing content items in canonical EXI streams.

A value content item that can be represented by the associated EXI datatype MUST be represented with the associated datatype representation. When the strict option is false, attributes and character events that cannot be represented by the associated EXI datatype representations (e.g., schema-invalid values) MUST use the additional untyped AT and CH terminal symbols.

Note:

A Canonical EXI processor MUST NOT account for XML schema validity (just like an EXI processor). The verification solely bases on EXI grammars and EXI datatypes.

When the Preserve.lexicalValues option is true, individual items are represented as String. Each value MUST be represented as a String with the associated restricted character set, if such a set is defined for the associated datatype representation (see Restricted Character Sets for Built-in EXI Datatype Representations). String content items associated with a restricted character MUST also follow the rules described in 2.4.6 Restricted Character Sets.

When the Preserve.lexicalValues option is false, a value content item MUST be represented with the associated datatype representation. The following sub-sections describe the Canonical EXI behavior for datatypes that otherwise may not lead to a uniquely defined representation.

Canonical EXI processors SHOULD support string-based EXI input stream values that according to Canonical EXI must be represented with an EXI datatype other than String (e.g., the value "0.1230" typed as String that according to Canonical EXI Float would be mantissa 123 and exponent -3). However, due to increased code footprint and processing complexity, Canonical EXI processors MUST support only EXI input streams that use the according datatype representation already. Be aware of this restriction when passing EXI streams to a recipient that is required to create the canonical EXI form.

2.4.1 Unsigned Integer

The EXI specification defines that the Unsigned Integer datatype representation supports unsigned integer numbers of arbitrary magnitude. EXI processors SHOULD support arbitrarily large Unsigned Integer values. EXI processors MUST support Unsigned Integer values less than 2147483648.

Canonical EXI processors MUST use the Unsigned Integer datatype representation even if a value goes beyond the value 2147483647.

2.4.2 Enumeration

The EXI Enumeration assigns to each item an unsigned integer value that corresponds to its ordinal position in the enumeration in schema-order starting with position zero. When there is more than one item that represents the same value in the enumeration, the value MUST be represented by using the first ordinal position that represents the value.

2.4.3 Float

The EXI Float datatype uses two consecutive EXI Integers. The first Integer represents the mantissa of the floating point number and the second Integer represents the base-10 exponent of the floating point number.

The canonical EXI Float MUST respect the following constraints.

A mantissa value of -0 MUST be changed to 0. If the mantissa is 0, then the exponent MUST be 0. If the mantissa is not 0, mantissas MUST have no trailing zeros.
An exponent value of -0 MUST be changed to 0.

Given an EXI Float value that consists of one integer representing its mantissa and the other integer representing its exponent, Canonical EXI processors MUST find an equivalent canonical EXI Float that satisfies the above constraints, where the rules of determining equivalence is described below.

Two floats A and B each denoted as (mantissa, exponent) pair of (mA, eA) and (mB, eB) where eA >= eB are equivalent under the following circumstances.

Both mantissa and exponent are the same between the two floats.
Otherwise, if two exponents are different (i.e. eA > eB), substitute A with A2 where A2 has exponent eB and mantissa mA * 10^(eA-eB). If A2 and B are equivalent per the rule 1 above, A and B are equivalent.

The appendix section A.2 EXI Floats depicts one example algorithm for finding the canonical EXI Float that is equivalent to a given EXI Float value.

2.4.4 Date-Time

The EXI Date-Time is a sequence of values representing the individual components of the Date-Time.

The values MUST be canonicalized according to XML Schema dateTime canonical representation.

2.4.5 Strings and String Table

A String value MUST be represented as string value hit if possible. Unless the convention used by the application dictates differently (e.g., EXI Profile parameter localValuePartitions set to "0"). EXI processors MUST first try to represent the string value as local hit and only when this is not successful as global value hit.

Note:

A String value miss MAY also need to follow the rules described in 2.4.6 Restricted Character Sets according to the given restricted character set, if available.

Note that a Canonical EXI processor MUST also respect the XML schema whiteSpace facet, if available.

2.4.6 Restricted Character Sets

Restricted Character Sets in EXI enable to restrict the characters of the string datatype. The canonical representation dictates that characters from the restricted character set MUST use the according n-bit Unsigned Integer. Hence, only characters that are not in the set SHALL be represented by the n-bit Unsigned Integer N followed by the Unicode code point of the character represented as an Unsigned Integer.

2.4.7 Datatype Representation Map

The EXI option datatypeRepresentationMap may specify an alternate set of datatype representations for typed values in the EXI body stream. This specification does not define any canonicalization rules for alternate representations. Other specifications and/or groups making use of this feature MAY describe a canonical form.

3. Canonical EXI Applications

3.1 Canonicalization Method

EXI Canonicalization may be used as a canonicalization method algorithm in XML Signature and XML Encryption. The identifier http://www.w3.org/TR/exi-c14n hereby specifies the rules of this document.

3.2 Signature Processing Steps

The figure below describes the involved processing steps when Canonical EXI is used for signing an EXI document or a fragment and the signature value is embedded within the document: First, the EXI stream or fragment of the EXI stream to be signed has to be transformed in a canonical form according to the requirements given in this document (see 2. EXI Canonicalization). Then, the canonical representation is used to determine the signature value based on the intended signature algorithm. At this point the signature value can be set within the EXI document and can be transmitted to the recipient.

To validate the signature value for compliance, the receiver has to build the canonical EXI stream for the signed portion. Note, this step can be skipped if there is pre-knowledge on the receiver side that the EXI stream already fulfills the requirements of Canonical EXI. Finally, to determine the correctness of the signature value (based on the signature algorithm) it can be compared with the embedded signature value provided by the sender.

Figure 3-1. Canonical EXI used in Signature

4. Resolutions

This section discusses a number of key decision points. A rationale for each decision is given as well as background information is provided.

4.1 No XML Relationship

This document assumes as input a sequence of EXI events. Hence, there is no strong relationship with XML nor with Canonical XML.

The working group conducted that applications such as XML Signature do not see justifiable benefit from Canonical EXI. Therefore plain-text XML document use-cases should use Canonical XML.

4.2 No EXI Header

An EXI stream is an EXI Header followed by an EXI body.

The EXI Header MUST NOT be taken into account when building the canonical EXI form. This strategy ensures that the flexibility in the EXI Header, such as the optional presence of an EXI Cookie, avoids creating different physical representations.

Therefore the process for building a Canonical EXI stream bases upon the knowledge of the used EXI options. These EXI options MUST be set or known (e.g., out-of-band) so that two parties produce the same octet stream. That said, Canonical EXI is solely based upon the EXI Body stream

4.3 No Unicode Normalization

The Unicode standard allows multiple different representations of certain characters. Thus two character sequences that have the same appearance and meaning when printed or displayed may differ in sequences of code points. EXI Canonicalization uses as its input EXI events in which Strings are represented as sequences of Unicode code points.

A canonical EXI processor MUST NOT change the code points as it is not allowed to alter any other event. However, character model normalization may become an issue when working with plain-text XML.

A Canonical EXI Examples

A.1 EXI Stream Order

Example A-1. Attribute and Namespace Declaration Sorting

EXI Stream (Input EXI Event Sequence)
SD	SE(root)	NS(www.foo.com, foo)	NS(www.bla.com, bla)	AT(c)	AT(b)	AT(a)	EE	ED
Canonical EXI Stream
SD	SE(root)	NS(www.bla.com, bla)	NS(www.foo.com, foo)	AT(a)	AT(b)	AT(c)	EE	ED

A.2 EXI Floats

The Float datatype representation can be converted to the canonical form going through the following steps. Note, implementations are free to choose any strategy as long as the constraints in 2.4.3 Float are met.

Example A-2. Example algorithm for converting float values to the canonical form

Let the float value have a decimal notation of the form <before>.<after> where before represents the value before the decimal point and after represents the value after the decimal point. The canonical representation of the mantissa and exponent SHALL be determined as follows:

Initialize the exponent with the value 0 (zero) and jump to step 2.
Examine the float value and extract the portion before and after the decimal point. If the value after the decimal point can be represented as 0 (zero) without losing precision jump to step 4, otherwise to step 3.
Decrement the exponent by 1 (one) and shift the decimal point of the float value by one digit to the right. Jump back to step 2.
The portion before the decimal point can be safely converted to the signed mantissa value. Jump to step 5.
If the signed mantissa is unequal 0 (zero), unequal -0 (negative zero), and contains a trailing zero jump to 6, otherwise to step 7.
Increment the exponent by 1 (one) and shift the mantissa by one digit to the right. Jump back to 5.
If the mantissa is equal -0 set the mantissa value to 0 (zero). Finished.

The subsequently following examples depict possible float values opposed to their canonical form.

Example A-3. Canonicalized EXI Float values

Float Value		Canonical EXI Float Value
		Mantissa	Exponent
123.012300	⇒	1230123	-4
0.0		0	0
-0.0		0	0
1.0		1	0
-1230.01		-123001	-2
0.1230		123	-3
12300		123	2
12.0		12	0
120E-1		12	0
1.2E1		12	0

B References

Efficient XML Interchange (EXI) Format 1.0: Efficient XML Interchange (EXI) Format 1.0 , John Schneider and Takuki Kamiya, Editors. World Wide Web Consortium. The latest version is available at http://www.w3.org/TR/exi/ . (See http://www.w3.org/TR/2011/REC-exi-20110310/.)
XML Schema Datatypes: XML Schema Part 2: Datatypes Second Edition , P. Byron and A. Malhotra, Editors. World Wide Web Consortium, 2 May 2001, revised 28 October 2004. The latest version is available at http://www.w3.org/TR/xmlschema-2 . (See http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/.)
XML Canonical: Canonical XML - Version 1.0 John Boyer, Editors. World Wide Web Consortium, W3C Recommendation 15 March 2001. The latest version is available at http://www.w3.org/TR/xml-c14n . (See http://www.w3.org/TR/xml-c14n.)
Efficient XML Interchange (EXI) Impacts: Efficient XML Interchange (EXI) Impacts , Jaakko Kangasharju, Editor. World Wide Web Consortium. The latest version is available at http://www.w3.org/TR/exi-impacts/ . (See http://www.w3.org/TR/2008/WD-exi-impacts-20080903.)
Efficient XML Interchange (EXI) Best Practices: Efficient XML Interchange (EXI) Best Practices , Mike Cokus and Daniel Vogelheim, Editors. World Wide Web Consortium. The latest version is available at http://www.w3.org/TR/exi-best-practices/ . (See http://www.w3.org/TR/2007/WD-exi-best-practices-20071219/.)
Efficient XML Interchange (EXI) Profile: Efficient XML Interchange (EXI) Profile , Youenn Fablet and Daniel Peintner, Editors. World Wide Web Consortium. The latest version is available at http://www.w3.org/TR/exi-profile/ . (See http://www.w3.org/TR/2012/WD-exi-profile-20120731/.)
IETF RFC 2119: Key words for use in RFCs to Indicate Requirement Levels, S. Bradner, Author. Internet Engineering Task Force, June 1999. Available at http://www.ietf.org/rfc/rfc2119.txt. (See http://www.ietf.org/rfc/rfc2119.txt.)

Canonical EXI

W3C First Public Working Draft 24 September 2013

Abstract

Status of this Document

Table of Contents

Appendices

1. Introduction

1.1 Notational Conventions and Terminology

1.2 Need of Canonical EXI

1.3 Applications

1.4 Limitations

2. EXI Canonicalization

2.1 EXI Alignment Options and Streams

2.2 EXI Event Selection

2.3 EXI Stream Order

2.4 EXI Datatypes

2.4.1 Unsigned Integer

2.4.2 Enumeration

2.4.3 Float

2.4.4 Date-Time

2.4.5 Strings and String Table

2.4.6 Restricted Character Sets

2.4.7 Datatype Representation Map

3. Canonical EXI Applications

3.1 Canonicalization Method

3.2 Signature Processing Steps

4. Resolutions

4.1 No XML Relationship

4.2 No EXI Header

4.3 No Unicode Normalization

A Canonical EXI Examples

A.1 EXI Stream Order

A.2 EXI Floats

B References