Re: Comments: Canonical EXI -- Last Call Working Draft from John Schneider on 2015-09-30 (public-exi-comments@w3.org from September 2015)

From: John Schneider <john.schneider@agiledelta.com>
Date: Wed, 30 Sep 2015 16:56:26 -0700
To: "Peintner, Daniel (ext)" <daniel.peintner.ext@siemens.com>
Cc: "public-exi-comments@w3.org" <public-exi-comments@w3.org>
Message-Id: <374D0CF7-339F-4C99-95AE-C5C6410821EB@agiledelta.com>
Daniel,

Thank you very much for taking the time to review and address our comments on the EXI canonicalization specfication. I very much appreciate your consideration. I particularly appreciate your decision to base EXI canonicalization on the XML Infoset rather than an input EXI document. I know this will take a bit of editing, but it is the right thing to do and will expand the benefits of EXI canonicalization to a broader set of XML users. 

I’ve included some follow-up comments and clarifications on selected sections in-line below. I hope they are helpful and will further contribute to the quality of the EXI canonicalization standard. 

Please let me know if I can assist with further questions or clarifications.

	Thank you again!,

	John

> 

> # Working group decisions
> 
>> 11. Section 4.4.1: The last sentence of this section specifies that
>> all canonical EXI processors MUST support arbitrarily large integer
>> values. This means there will be some canonical EXI documents that
>> devices without support for arbitrarily large integers cannot process.
>> Recommend you consider updating this definition so it is possible to
>> generate a canonical representation for any EXI document that any
>> device that meets the minimum EXI processing requirements can handle.
>> In particular, recommend you consider changing this definition such
>> that canonical EXI processors MUST represent all Unsigned Integer
>> values using the Unsigned Integer datatype representation when strict
>> is true. However, when strict is false canonical EXI processors must
>> represent Unsigned Integer values greater than 2147483647 using the
>> String datatype representation. This would enable devices with limited
>> capabilities to at least read, display and retransmit arbitrarily large
>> values — even if they don’t have the capability to process them.
> 
> We think that retransmitting arbitrary large values is doable also if the the device is not capable to represent it properly.

Yes, I agree. However, displaying or otherwise making use of the value on limited devices is problematic without increasing the code footprint with data structures and algorithms for representing and processing big numbers. 

> Note: a  limited intermediary device can fall-back to string. The only device that has to use integer encoding is the one that checks the signature and requires a canonicalized document.

Unfortunately, most limited devices occur at the edge of the network, so, this wouldn’t help most use cases. It would also negate the possibility of using canonical EXI as a transmission format. Using canonical EXI as a transmission format is particularly attractive for limited devices because it eliminates the need to implement the canonicalization algorithms on the limited devices, saving valuable code-footprint, memory and CPU cycles. 

All that said, we spent a bit more time thinking about this feature and if you’re not willing to consider alternate solutions, we can live with it as is. This is an edge case that most devices will not encounter. When it does occur, it may be possible to handle the required translations and associated re-computation of signatures on more capable, trusted gateway nodes. I don’t like this solution as much as the one we originally proposed because it is more complex and would require gateway nodes to be implicitly trusted, which is not always possible. However, its a trade-off that could still keep simple devices simple. 

> 
>> 12. Section 4.4.5: This section states that EXI Date-Time values
>> MUST be canonicalized according to the XML Schema dateTime canonical
>> representation. While this definition might be convenient, it is not
>> entirely appropriate for canonicalization and will lead to surprising
>> results for some. The canonical form for XML Schema dateTime values is
>> defined to make it easy to determine whether two Date-Time values refer
>> to the same instant, regardless of the timezone used. However, for many
>> applications, the Date-Time timezone is an important piece of information
>> that should be preserved. As such, it will be surprising if the digital
>> signature is not able to detect changes to this information. In addition,
>> those using canonical EXI as a transmission format will be surprised if
>> the canonical EXI format loses all their timezone information and changes
>> all Date-Time values to GMT. Recommend this section be updated to exclude
>> canonicalization of timezones in Date-Time values.
> 
> We do see the point in your explanation. Nevertheless, there are also circumstances and use-cases that require timezone normalization. Let's suppose people are agreeing on a meeting. The starting time can be expressed in many forms with time zones including and all forms are just fine, e.g.,:
> 
> 2015-08-11T23:00:00+09:00
> 2015-08-11T16:00:00+02:00
> 2015-08-11T14:00:00Z (UTC)
> 2015-08-11T07:00:00-07:00
> 
> In fact the idea behind normalizing to UTC is to make those lexical values interchangeable in terms of signature validation.
> Also, what if an intermediary node changes the timezone when processing and re-transmitting the time. With your suggestion to exclude the timezone you cannot confidently validate the data any more.
> Your comment #3 brings up a similar use-case allowing intermediaries to use binding-models and/or type-aware processing.

Thank you for your explanation and for recognizing the validity of our point. I see your point as well. Its seems that preserving timezone is important to some classes of applications, but unimportant to others. As such, some applications will want the signature to break if timezones change and some will not. I know we have customer use cases that require preservation of timezone information and I’m concerned that standardizing on one approach that does not satisfy both classes of users will cause splintering of the standard.

As such, I recommend you consider introducing a preserveTimezone switch that allows users to express what they need. The default value could be false, but having the ability to set it to true would support both classes of users. 

> 
> 
>> 1. Architecture & Design: The specification defines canonical EXI with
>> respect to an input EXI stream. This limits one’s ability to use canonical
>> EXI with traditional XML or other XML Infoset representations and creates
>> a poor architectural fit with the rest of the XML stack of technologies
>> that are defined with respect to the XML Infoset. The strict dependency
>> on an EXI input stream, the EXI options document and the EXI schemaId
>> creates intrinsic incompatibilities with XML, which does not support
>> these EXI-specific artifacts. This leads to practical implementation
>> problems, such as the inability for canonical EXI to support digital
>> signatures through XML intermediary nodes, which you identified at the
>> end of section A.1.
>> 
>> To be useful in all XML contexts and with all XML technologies, EXI
>> canonicalization must be defined with respect to the XML Infoset. We
>> recommend you update the specification to define canonical EXI with
>> respect to a given XML Infoset, a given XML Schema and a given set of
>> EXI options. The schema and EXI options may be provided any number of
>> ways, as you describe well in section C.2. As with EXI, the user should
>> be allowed to embed these in the EXI header when it is advantageous,
>> but should not be required to do so when it is not. Mandating the
>> inclusion of the EXI options and a schemaID in every message is at
>> odds with EXI’s efficiency objectives and makes it onerous to use
>> canonical EXI as a transmission format. As you point out in section
>> C.1., using canonical EXI as a transmission format can eliminate
>> the need to perform [redundant] canonicalization at the receiver —
>> further increasing efficiency. We have users that currently employ
>> canonical EXI this way and it is very advantageous to them. However,
>> requiring the EXI options and schemaId in every message would quickly
>> overwhelm the benefits of using canonical EXI as a transmission format.
> 
>> 4. Section 3: As mentioned above, making the EXI options document and
>> the EXI schemaId mandatory in every canonical EXI document is at odds
>> with the efficiency objectives of EXI. In many or perhaps even most
>> use cases that require efficiency, these can be (and are) provided
>> out of band or specified by a higher-level protocol. As such, including
>> them in every canonical EXI message introduces unnecessary overhead
>> and provides no value since all cooperating nodes already have this
>> information.
> 
>> Furthermore, forcing the inclusion of a schemaId in every message does
>> not actually solve the problem of ensuring the sender and receiver use
>> the same schemas. The EXI schemaId is not guaranteed to be unique and
>> would be easy for a sender and receiver to end up using the same schemaId
>> for two different versions of the same schema or even two completely
>> different schemas (breaking any signature that depends on schemaId).
>> There are more reliable ways to ensure senders and receivers are using
>> the same schemas for encoding/decoding EXI documents. This problem is
>> not unique to EXI canonicalization and the EXI canonicalization
>> specification should not force a specific, sub-optimal solution on EXI
>> users. As with EXI, users should be allowed to use the EXI options
>> document and schemaId to address this issue, but they should not be
>> forced to do so if they have a better, more efficient solution that
>> is already working.
> 
>> 5. Section 4: As stated above, to be useful in all XML contexts and with
>> all XML technologies, EXI canonicalization must be defined with respect to
>> a given XML Infoset rather than a given EXI stream. The semantics of the
>> specification should be specified with respect to a given XML Infoset,
>> a given XML Schema and a given set of EXI options (independent on how
>> these are acquired).
> 
>> 15. Section A.1: The second paragraph states that Canonical EXI deals with
>> EXI documents. As alluded in the third paragraph of this section, this is
>> not strictly true. Canonical EXI should be usable with and provide benefits
>> to XML, EXI or any other XML Infoset representation. However, as stated
>> earlier in these comments, canonical EXI must be defined with respect to
>> the XML Infoset rather than an EXI input document to achieve this. Defining
>> EXI canonicalization with respect to only EXI is limiting and fails to
>> realize the full potential of the technology.
>> 
>> The last sentence in this section also states that it is not possible to
>> use XML on intermediary nodes when Canonical EXI has been used for signing.
>> This is a limitation of the current specification and not of canonical EXI
>> in general. If you define canonical EXI with regard to a given XML Infoset,
>> XML Schema and given set of EXI options and ensure all EXI nodes use the
>> same XML Schema and EXI options, this limitation goes away. As stated earlier,
>> there are more reliable and efficient ways to ensure cooperating nodes use
>> the same XML Schemas and EXI options than including the EXI options document
>> and schemaId in every message. And these methods do not fail when transcoding
>> to XML because they do not depend on the XML/EXI message for the schema and
>> EXI options. The reason the current specification fails in this regard is
>> because it depends strictly on the EXI document to carry the options and
>> schemaId and transcoding to XML loses this information. As discussed earlier,
>> this is a design flaw that should be fixed.
> 
> 
>> 16. Section C.2: It is interesting and encouraging to see a good description
>> of best practices for sharing EXI options without the EXI options document.
>> This is the flexibility the specification should allow rather than mandating
>> that the EXI options and schemaId be specified inside every canonical EXI stream.
> 
> We agree with your comment that Canonical EXI should be based on XML Infoset and changed the specification accordingly. Thanks!

Excellent! This is great news. Thanks again for taking care of this. 

> 
> However, we disagree that the EXI options document and its schemaId creates incompatibilities or problems.

It looks like I didn’t successfully express my concerns in this area. Let me try to enumerate my points more clearly. Here are the reasons I think we should retain the current flexibility of the EXI format and not force all users to include the EXI options and schema ID in their Canonical EXI streams. 

Efficiency. For those that use Canonical EXI as a transmission format, inserting the EXI options and schema ID in every Canonical EXI stream increases the size of every transmission and can significantly reduce efficiency. This is especially true for use cases that involve a lot of smaller messages, large schema IDs and/or DTRs. Using Canonical EXI as a transmission format is important for use cases that require fast processing speeds because it eliminates the need for a redundant canonicalization step on the receiver (as noted in section C.1). It is also important for small device use cases because it eliminates the need to spend critical code-footprint, memory and CPU resources on canonicalization. Canonical EXI will be less attractive for some of these use cases and unusable for others if using it as a transmission format comes with a significant performance penalty. Canonical EXI should support use cases that need both transmission efficiency and the ability to use it as a transmission format.
Redundancy. For those that already have a more efficient and/or sophisticated way to ensure the signer and validator use the same EXI options and schemas, forcing them to include the EXI options and schema ID in every Canonical EXI stream is completely redundant and provides no benefits. For example, schema negotiation can be used to identify the best common schemas between encoders, decoders, signers and validators. This is generally a superior solution to the EXI header because you never encode/sign with one schema only to later find out the decoder/validator can’t handle it. You can determine what schemas to use in advance and you can avoid adding schema ID overhead to every message. And, of course, if these systems also want the efficiencies associated with using Canonical EXI as a transmission format, inserting the EXI options and schema ID will introduce an unnecessary penalty in addition to providing no benefit.
Security. The stated reason for requiring the EXI options and schema ID in every Canonical EXI stream is for added security. However, providing the schema ID in the header is a very weak form of security. Those that are serious about security will not want to depend on the EXI header. They will sign the schemas. This is a more secure way to be ensure the decoder/verifier is using exactly the same set of schemas as the encoder/signer. And for those using Canonical EXI as a transmission format, it is far more efficient to distribute the signature once with the schema than transmitting it with every single message. So again, for systems that need strong security, adding the EXI header is completely redundant and adds no value. 
Architecture. The underlying cause for problems 2 and 3 above is a poor separation of concerns. All EXI processors need some mechanism to ensure they use the same schemas and options for encoding and decoding. Because this problem is not specific to EXI Canonicalization, creating a solution that is specific to or forced by EXI Canonicalization creates a poor separation of concerns. This poor separation of concerns leads to problems, such as duplication of the same functionality at different layers of the stack. But it also leads to added complexity, user confusion and higher testing/support costs. A clean, flexible architecture with a clear and consistent separation of concerns is always better. I believe the issues you are trying to address are very important, but EXI Canonicalization is not the right place to address them. 

So, forcing the inclusion of the EXI options and schema ID in every Canonical EXI stream definitely creates unnecessary inefficiencies for those that want to use Canonical EXI as a transmission format. For those that already have more sophisticated, more efficient and/or more secure ways to ensure the same schemas and options are used for signing and validating, it is wholly redundant and unnecessary. 

I agree there are definitely some use cases that will want to include the EXI options and schema ID in all their Canonical EXI streams. However, there are definitely other use cases that will not want this. Therefore, I recommend Canonical EXI provide the same flexibility as EXI. Retain EXI's capability to include the EXI options and schema ID in the header for those that need it, but do not force this for those that don’t. If you don’t support both use cases, you will certainly splinter the standard. I know we have current users that will not be able to use Canonical EXI if you force them to include the EXI options and schema ID in every Canonical EXI stream. 

> The requirement of adding the EXI options document is only required IF e.g., validation takes place. Applications may still exchange these information out-of-band (appendix C2 sketches various possibilities).

Right, and if they already have a reliable out-of-band way to guarantee the signer and verifier used the same EXI options and schema, including these in the Canonical EXI stream provides no benefits and can add overhead. We should not force it on those that do not need it. 

> Moreover, mandating the inclusion of the EXI options and a schemaID provides additional security while introducing a minimal overhead especially for circumstances using always the same set of options.

As mentioned above, it provides only weak security. Those requiring stronger security will have no need for it and should not be forced to use it.

> That said, there is no "on-the-wire" penalty and the added processing cost is really limited.
> 

That is not correct. There can be a significant on-the-wire penalty for those that want the efficiencies associated with using Canonical EXI as their wire format. This is an important use case we should support without unnecessary performance penalties.

> 
> 
> 
> 
> 
> 
> 
> 
> ________________________________
> Von: John Schneider [john.schneider@agiledelta.com <mailto:john.schneider@agiledelta.com>]
> Gesendet: Donnerstag, 16. Juli 2015 06:39
> An: public-exi-comments@w3.org <mailto:public-exi-comments@w3.org>
> Betreff: Comments: Canonical EXI -- Last Call Working Draft
> 
> Dear EXI Friends and Colleagues,
> 
> Thank you for the opportunity to review the Last Call Working Draft of the Canonical EXI specification dated 21 May 2015. It is rewarding to see the work we started together long ago nearing completion. We’ve completed a comprehensive review of the specification and have provided our comments below. We have also implemented canonical EXI in selected Efficient XML products, deployed it with a set of users and incorporated their feedback and experience into our comments below.
> 
> Our comments are enumerated to facilitate discussion. I hope they are helpful and will contribute to the creation of a high-quality standard that will address critical needs in the EXI and XML Security domains.
> 
> Please let me know if I you have any questions or if I can help to clarify any of the comments.
> 
> All the best!,
> 
> John
> 
> AgileDelta, Inc.
> john.schneider@agiledelta.com <mailto:john.schneider@agiledelta.com><mailto:john.schneider@agiledelta.com <mailto:john.schneider@agiledelta.com>>
> http://www.agiledelta.com<http://www.agiledelta.com/> <http://www.agiledelta.com<http://www.agiledelta.com/>>
> 
> ——— Specific Comments ———
> 
> 1. Architecture & Design: The specification defines canonical EXI with respect to an input EXI stream. This limits one’s ability to use canonical EXI with traditional XML or other XML Infoset representations and creates a poor architectural fit with the rest of the XML stack of technologies that are defined with respect to the XML Infoset. The strict dependency on an EXI input stream, the EXI options document and the EXI schemaId creates intrinsic incompatibilities with XML, which does not support these EXI-specific artifacts. This leads to practical implementation problems, such as the inability for canonical EXI to support digital signatures through XML intermediary nodes, which you identified at the end of section A.1.
> 
> To be useful in all XML contexts and with all XML technologies, EXI canonicalization must be defined with respect to the XML Infoset. We recommend you update the specification to define canonical EXI with respect to a given XML Infoset, a given XML Schema and a given set of EXI options. The schema and EXI options may be provided any number of ways, as you describe well in section C.2. As with EXI, the user should be allowed to embed these in the EXI header when it is advantageous, but should not be required to do so when it is not. Mandating the inclusion of the EXI options and a schemaID in every message is at odds with EXI’s efficiency objectives and makes it onerous to use canonical EXI as a transmission format. As you point out in section C.1., using canonical EXI as a transmission format can eliminate the need to perform [redundant] canonicalization at the receiver — further increasing efficiency. We have users that currently employ canonical EXI this way and it is very advantageous to them. However, requiring the EXI options and schemaId in every message would quickly overwhelm the benefits of using canonical EXI as a transmission format.
> 
> 2. Section 1, last sentence: Change “… whether two documents are identical …” to “… whether two documents are equivalent …”
> 
> 3. Section 1.2: We agree EXI canonicalization is important for EXI environments that cannot afford to revert to traditional XML canonicalization methods. In addition, we recommend you mention some of the ways EXI canonicalization is useful for traditional XML users. For example, EXI canonicalization provides the first type-aware canonicalization scheme that can discern that +1, 1, 1.0, 1e0 and 1E0 are equivalent representations of the same floating-point value. This allows intermediaries to use binding-models and/or type-aware processing without breaking signatures. In addition, with a fast EXI processor, EXI canonicalization can be much faster than traditional XML canonicalization and can help cure some of the well-known XML security bottlenecks.
> 
> 4. Section 3: As mentioned above, making the EXI options document and the EXI schemaId mandatory in every canonical EXI document is at odds with the efficiency objectives of EXI. In many or perhaps even most use cases that require efficiency, these can be (and are) provided out of band or specified by a higher-level protocol. As such, including them in every canonical EXI message introduces unnecessary overhead and provides no value since all cooperating nodes already have this information.
> 
> Furthermore, forcing the inclusion of a schemaId in every message does not actually solve the problem of ensuring the sender and receiver use the same schemas. The EXI schemaId is not guaranteed to be unique and would be easy for a sender and receiver to end up using the same schemaId for two different versions of the same schema or even two completely different schemas (breaking any signature that depends on schemaId).  There are more reliable ways to ensure senders and receivers are using the same schemas for encoding/decoding EXI documents. This problem is not unique to EXI canonicalization and the EXI canonicalization specification should not force a specific, sub-optimal solution on EXI users. As with EXI, users should be allowed to use the EXI options document and schemaId to address this issue, but they should not be forced to do so if they have a better, more efficient solution that is already working.
> 
> 5. Section 4: As stated above, to be useful in all XML contexts and with all XML technologies, EXI canonicalization must be defined with respect to a given XML Infoset rather than a given EXI stream. The semantics of the specification should be specified with respect to a given XML Infoset, a given XML Schema and a given set of EXI options (independent on how these are acquired).
> 
> 6. Section 4.2.1: Change “Prune productions” to “Select productions” in heading. Pruning productions will remove them from the grammars, changing the event codes of the following events and causing incompatibility with the EXI 1.0 specification. I expect the specification intends to specify which productions must be selected rather than removing productions from the grammars.
> 
> 7. Section 4.2.2: Change “Prune productions” to “Select productions” in heading. The word “prune” should also be replaced in the body of this section. See above rationale.
> 
> 8. Section 4.2.2: The meaning of this section is not entirely clear. Presumably, it is not possible with the current EXI specification to use a production that is not capable of representing the content value (by definition). Are there circumstances that this section is attempting to prohibit that are currently allowed by the EXI 1.0 specification?
> 
> 9. Section 4.2.3: Change heading “Use the event with the most accurate event” to “Use the event that matches most precisely” or something similar. Current wording is unclear.
> 
> 10. Section 4.4: The last sentences of this section indicates that Canonical EXI processors SHOULD be able to convert an untyped value to each datatype representation defined in EXI 1.0. This special language would not be required if EXI canonicalization were defined more generally with respect to the XML Infoset rather than an input EXI stream.
> 
> 11. Section 4.4.1: The last sentence of this section specifies that all canonical EXI processors MUST support arbitrarily large integer values. This means there will be some canonical EXI documents that devices without support for arbitrarily large integers cannot process. Recommend you consider updating this definition so it is possible to generate a canonical representation for any EXI document that any device that meets the minimum EXI processing requirements can handle. In particular, recommend you consider changing this definition such that canonical EXI processors MUST represent all Unsigned Integer values using the Unsigned Integer datatype representation when strict is true. However, when strict is false canonical EXI processors must represent Unsigned Integer values greater than 2147483647 using the String datatype representation. This would enable devices with limited capabilities to at least read, display and retransmit arbitrarily large values — even if they don’t have the capability to process them.
> 
> 12. Section 4.4.5: This section states that EXI Date-Time values MUST be canonicalized according to the XML Schema dateTime canonical representation. While this definition might be convenient, it is not entirely appropriate for canonicalization and will lead to surprising results for some. The canonical form for XML Schema dateTime values is defined to make it easy to determine whether two Date-Time values refer to the same instant, regardless of the timezone used. However, for many applications, the Date-Time timezone is an important piece of information that should be preserved. As such, it will be surprising if the digital signature is not able to detect changes to this information. In addition, those using canonical EXI as a transmission format will be surprised if the canonical EXI format loses all their timezone information and changes all Date-Time values to GMT. Recommend this section be updated to exclude canonicalization of timezones in Date-Time values.
> 
> 13. Section 4.4.6: The W3C is standardizing on Unicode Normalization Form C and recommending all web data be stored and transmitted in this form. It may be useful to state this and reference the relevant W3C specification here: http://www.w3.org/TR/charmod-norm/.
> 
> 14. Section 4.4.6: The last sentence in the second paragraph states that EXI processors must first try to represent the string value as a local hit and when this is not successful as a global hit. It might be useful to clarify that one of the reasons the attempt to represent the string value as a local hit may fail is because the string has already been used as a local hit previously. EXI supports only one local table hit per value.
> 
> 15. Section A.1: The second paragraph states that Canonical EXI deals with EXI documents. As alluded in the third paragraph of this section, this is not strictly true. Canonical EXI should be usable with and provide benefits to XML, EXI or any other XML Infoset representation. However, as stated earlier in these comments, canonical EXI must be defined with respect to the XML Infoset rather than an EXI input document to achieve this. Defining EXI canonicalization with respect to only EXI is limiting and fails to realize the full potential of the technology.
> 
> The last sentence in this section also states that it is not possible to use XML on intermediary nodes when Canonical EXI has been used for signing. This is a limitation of the current specification and not of canonical EXI in general. If you define canonical EXI with regard to a given XML Infoset, XML Schema and given set of EXI options and ensure all EXI nodes use the same XML Schema and EXI options, this limitation goes away. As stated earlier, there are more reliable and efficient ways to ensure cooperating nodes use the same XML Schemas and EXI options than including the EXI options document and schemaId in every message. And these methods do not fail when transcoding to XML because they do not depend on the XML/EXI message for the schema and EXI options. The reason the current specification fails in this regard is because it depends strictly on the EXI document to carry the options and schemaId and transcoding to XML loses this information. As discussed earlier, this is a design flaw that should be fixed.
> 
> 16. Section C.2: It is interesting and encouraging to see a good description of best practices for sharing EXI options without the EXI options document. This is the flexibility the specification should allow rather than mandating that the EXI options and schemaId be specified inside every canonical EXI stream.
> 
> 
> 
> 

AgileDelta, Inc.
john.schneider@agiledelta.com <mailto:john.schneider@agiledelta.com>
http://www.agiledelta.com <http://www.agiledelta.com/>
w: 425-644-7122
m: 425-503-3403
f: 425-644-7126
Received on Wednesday, 30 September 2015 23:57:02 UTC