Issue on EXI value validation from FABLET Youenn on 2009-08-24 (public-exi@w3.org from August 2009)

From: FABLET Youenn <Youenn.Fablet@crf.canon.fr>
Date: Mon, 24 Aug 2009 17:07:03 +0200
To: "'public-exi@w3.org'" <public-exi@w3.org>
Message-ID: <C1797CB6A125334AB23C5A0A160944AD3609183C75@cressida.crf.canon.fr>
Dear all,

Reading the EXI specification, I have the following issue (two additional related sub-issues are also described below).

1.  Schema validation of XML values

As per the current specification, when a schema informed grammar is in use, the encoder needs to check the validity of all XML values to retrieve the correct production (section 8.5.4.4.1).
AIUI, the validity is computed according the schema in use.
This adds a real burden on EXI implementations, mainly in terms of processing efficiency but also in terms of compression:

1)      The EXI encoder needs to keep all simple type information present in the schema (increased in-memory schema representation)

2)      The EXI encoder needs to implement XSD1.0 part 2, regexp validation notably (increased code footprint)

3)      The EXI encoder needs to do schema part 2 validation on all XML values(processing penalty, reg exp for instance)

4)      Some invalid values can be correctly encoded using the built-in EXI type (compression penalty)

Some examples that illustrate some of the issues:

-          A string 'abc' would be encoded using a schema-invalid production because its length is 3 and its simple type definition has a length facet of 4.

-          The float '1.0' would benefit from being encoded as a float even if its simple type states that it must be in the range [0,1.0[

-          The integer '19' would be correctly encoded using a 5 bits integer encoding even if its simple type states that the integer range is [0,18].

-          A string 'ofo' would benefit from being encoded according the regexp {foo|oof} although it is an invalid value

I understand that schema validation is well defined in schema part2 specification and already well deployed but the purpose of EXI is to achieve very good compression not validation.
Replacing MUST statements by SHOULD statements (to state that valid values SHOULD be encoded with the schema-valid productions and invalid values SHOULD be encoded with the schema-invalid productions) may be sufficient?
Or maybe there is a simple way to redefine the validation criteria in terms of whether a specific codec can actually represent a given XML value (schema-valid production) or not (schema-invalid production)?

2. Restricted charset behavior

Section 7.1.10.1 states that string characters not present in the restricted charset may be encoded using a specific technique.
According section 8.5.4.4.1, restricted charset encoding will be limited to valid values.
Schema-valid values will contain only characters in the restricted charset and whitespaces.
I am therefore wondering whether the ability to encode characters not in the restricted charset is restrained to whitespaces only or if its purpose is larger.

3. Automatic selection of productions

The use of restricted charset may be useful when compressing valid values but also invalid values.
But compression may hurt from invalid values that have few or no characters in the charset (and we cannot always change the schema to improve the situation).
While it can be difficult to draw a specific line in the specification, the best decision can always be done by the encoder which has the knowledge of both the string to encode and the restricted charset.
Would it be sensible to let the encoder make the decision to choose which production to use: the one that uses the built in type codec or the one that uses the generic string codec?

This ability would also leave the door open for various optimization tricks. For instance, if the same float value ("0" typically) happens 100 times in a document, it may be more compact to encode it using the string codec and benefit afterwards from indexing than to encode 100 times the same values with the typed codec.

Although these use cases may sound somewhat marginal, adding this flexibility would not hurt the interoperability nor hurt the decoder's footprint.
What do you think?

Regards,
                youenn
Received on Monday, 24 August 2009 15:07:46 UTC