This document:Public document·Annotated document·View comments·Search comments·Add a new comment·Send replies to comments·Disposition of Comments·
Nearby:Efficient XML Interchange Working Group
Other specs in this tool
Quick access to LC-2361
There are 8 comments (sorted by their types, and the section they are about).
Section 3 states that:
"The built-in EXI grammars accept any XML document or fragment and may be augmented with productions derived from XML Schemas [XML Schema Structures]<http://www.w3.org/TR/exi/#schema1> [XML Schema Datatypes]<http://www.w3.org/TR/exi/#schema2>, RELAX NG schemas [ISO/IEC 19757-2:2003]<http://www.w3.org/TR/exi/#relaxng>, DTDs [XML 1.0]<http://www.w3.org/TR/exi/#XML10> [XML 1.1]<http://www.w3.org/TR/exi/#XML11> or other sources of information";
Section 5.4 states that :
"Section 8.5 Schema-informed Grammars<http://www.w3.org/TR/exi/#informedGrammars> describes the system to derive schema-informed grammars from XML Schemas."
Section 8.5 states that:
"This section describes the schema-informed grammars used by EXI when schema information is available to describe the contents of the EXI stream<http://www.w3.org/TR/exi/#key-existream>. Schema information used for processing an EXI stream is either indicated by the header option schemaID<http://www.w3.org/TR/exi/#key-schemaIDOption>, or communicated out-of-band in the absence of schemaID<http://www.w3.org/TR/exi/#key-schemaIDOption>. Schema-informed grammars are independent of any particular schema language and can be derived from XML Schemas [XML Schema Structures]<http://www.w3.org/TR/exi/#schema1> [XML Schema Datatypes]<http://www.w3.org/TR/exi/#schema2>, RELAX NG schemas [ISO/IEC 19757-2:2003]<http://www.w3.org/TR/exi/#relaxng>, DTDs [XML 1.0]<http://www.w3.org/TR/exi/#XML10> [XML 1.1]<http://www.w3.org/TR/exi/#XML11> or other schema languages for describing what is likely to occur in an EXI stream. "
The exact meaning of these three sentences are somehow unclear when put together.
In particular it is easy to infer from the start of section 8.5 that this section defines a mapping from several languages (XSD, RNG, DTD...), which is not the case since the mapping is restricted to XSD only. On the contrary, section 5.4 seems to induce that only XSD is supported currently (?).
It would be nice that, at a high level, the specification clearly states the following two points:
- This specification defines schema informed grammars that are schema-neutral
- This specification defines one mapping to schema informed grammars, the input being XSD.
Of course, the specification is free to state that other mappings can be done.
The additional related comment I have is that section 8.5 defines both what are schema-informed grammars and how to generate them from XSD.
The definition of the schema-informed grammars is tightly linked in the spec to XSD words and concepts like GED for instance.
This is perfectly fine to me. However, to create another mapping, one would need to rewrite something similar to the whole section 8.5.
Therefore the sentence quoted from section 8.5 could be clarified with that respect.
This is a mail about the @xsi:type feature in the EXI specification.
These thoughts are based on @xsi:type implementation feedback, which is what the CR period is all about.
Currently, @xsi:type is used as a dynamic typing mechanism that can improve compression at the EXI level.
This is a good idea. I would also point that other dynamic typing mechanism, e.g. typed APIs, could also be very useful for compressing XML content.
To illustrate the issue we have with the current @xsi:type behavior, I will take the XML signature schema as an example.
Currently we have the following XML signature definition:
<element name="DSAKeyValue" type="ds:DSAKeyValueType"/>
<element name="P" type="ds:CryptoBinary"/>
<element name="Q" type="ds:CryptoBinary"/>
Basically this means that elements P, Q and some others have base64Binary content (through CryptoBinary simple type definition).
For some applications, it will be especially important that these element contents be encoded using the base64Binary EXI encoding, since this is the bulk of the XML data.
An existing application that wants to ensure that these elements are correctly compressed using the base64Binary EXI encoding have some options:
1) Verify the EXI setup
a. Ensure that schema mode is in action and that at least the aforementioned elements (P, Q...) have a schema-informed grammar associated to them
i. This fine-grained check is not practical, it may be typically: does the EXI processor have the full XML signature schema or not?
b. But the application may not have the choice and/or the knowledge of the schema
2) Try to put @xsi:type within the produced XML documents so that EXI encoders will always encode data using the base64 codec
a. The application can set @xsi:type to xs:base64Binary
This is actually working well at the EXI level since base64Binary is available with all EXI processors.
Unfortunately, this document is not valid since CryptoBinary type is deriving from base64Binary and not the reverse.
This may cause issues within applications. This simple solution is therefore unavailable :(
b. The only valid solution is the following:
But this requires the use/sharing of the CryptoBinary grammar and we are back to case 1
Even worse , the xsi:type may be useless in these cases:
- full schema is in use: this will be already encoded efficiently without @xsi:type
- no schema is in use: no grammar is retrieved from @xsi:type.
Note that this issue would not happen if we could modify the schema.
The definition of CryptoBinary has its own benefits (editorial, DTR usefulness, reuse, semantics...) and at the end, it is in the EXI technology flexibility that we should count on.
>From this example, the current link between @xsi:type and the grammar selection seems too tight.
One possibility would be to update the @xsi:type production behavior to be slightly more generic:
- This updated @xsi:type production enables the grammar selection
o Using the existing QName mechanism defined by the specification
- This updated @xsi:type production does not automatically carry any infoset implication
o Somehow similar to SC productions which do not have any infoset implication
- This updated @xsi:type production may modify the infoset
o for instance through the encoding a boolean to state whether @xsi:type is included or not in the infoset
This modification has several benefits:
- This is a very small modification so current EXI implementations and current spec would be upgradable very easily
o Added complexity on EXI decoders is minimal and added complexity on EXI encoders can also be minimal
- Dynamic typing becomes even more usable
o The previous issue is solved
o EXI encoders could actually use value typing information given by XML typed APIs
o EXI encoders can have various ways to select the best compression-wise grammar
Â§ EXI decoders will just follow the decision
- This gives additional compression and efficiency improvements for some common cases
I do not see any real drawback in terms of complexity or compression for general use cases since a boolean does not cost a lot compared to the actual encoding cost of a @xsi:type.
In addition, applications that want to achieve the best compression in their environments should probably, as a first step, remove any non-necessary @xsi:type.
This is a mail about the @xsi:nil feature in EXI.
These thoughts are based on @xsi:nil implementation feedback, which is exactly what the CR period is all about.
Currently, similarly to @xsi:type, @xsi:nil may modify dynamically the grammar in use.
The EXI technology strives to get a good balance between simplicity and efficiency.
A sensible approach is to favor efficiency for usual cases and simplicity for the other cases.
For @xsi:nil, we think simplicity is better since this feature is rarely used in today's XML documents, at least in our environments.
1) the actual gain in terms of compression of the @xsi:nil specific behavior is small
2) The cost within the specification is not small
3) The cost (code size at least, possibly runtime speed also) within EXI codec runtimes is real
Our main issue with that feature is that it requires specific code not only at the schema mapping level but also at codec runtime level.
We would prefer this feature to be handled at the schema mapping level only.
It can easily be done by:
- adding @xsi:nil and EE productions wherever needed in strict mode
- removing any specific @xsi:nil dynamic grammar modification behavior and compression mode specific rule (codec runtime level).
The downside is a small compression loss, but I do not think this is a real issue:
- Compression loss would only happen when schema writers use nillable="true", which is not very common
o In those case, the compression loss is not that high
Â§ At max, 1 bit for each instance of nillable="true" elements (due to one additional EE production).
Â§ This would happen only in strict/bit-packed/no-compression mode
- When having a @xsi:nil="true" element, the compression loss is slightly higher
o Additional encoding of a EE production
o Additional SE entries make the last attributes productions index length potentially longer
Â§ It depends on the exact schema/XML instance
- Schema writers are able to predict the cost of adding nillable="true" to their schemas
o If worried about EXI compaction results, they may well find more compact alternatives for their schemas (minOccurs="0" e.g.).
Based on internal feedback, I would like to make the following observations on the current EXI specification.
Note that this is not a request for change of the EXI specification.
I think however that this may be of some use/interest for the community.
Currently, all attributes of an XML element are stored by EXI encoders before being actually written.
This is a different behavior from text XML writers that can write attributes as soon as applications provide them.
In some environments, this attribute storage behavior has a real processing cost that does not appear with traditional StAX-like text XML writers.
There are two main technical reasons for storing attributes:
- In schema mode, it is better to give specific attributes order so as to get good compression
- @xsi:type and/or @xsi:nil must appear first in schema and schemaless modes
The first reason is a strong reason. I also note that EXI enables some flexibility in the attribute ordering if that better suits the application needs. This seems very reasonable as this added flexibility does not impact performances nor interoperability.
The second reason seems weaker since, at least in our scenarios, @xsi:type/@xsi:nil do not appear very often anyway.
It would have been good to have the flexibility to put @xsi:type and @xsi:nil in the order desired by applications.
Of course, this would need changing the way these attributes actually impact on the EXI grammars.
I did not do the full exercise, but I am confident that there are some reasonably simple workarounds that would get us back to a similar functionality level anyway.
The advantages would have been:
- No more special attribute behavior handling at the codec runtimes level
o General spec simplification
o Smaller and potentially faster EXI codec runtimes
- Performances improvement by enabling streamed encoding of attributes
o At least in the case of built-in grammars but also in schema-deviation mode
o More consistent with some text XML writer behavior
This is feedback related to the EXI specification.
Currently, according http://www.w3.org/TR/exi/#addingProductionsStrict, a @xsi:type production is added only when named subtypes are known to the EXI processor.
The intention, AIUI, is that as few @xsi:type productions as possible are actually added to grammars so as to get some compression gain.
I see a practical drawback with this current approach.
Some XML documents, valid according the XML schema components used to generate the grammars, will not be encodable by EXI encoders.
Given the following schema and instance:
< xs:element name="test" type="xs:base64Binary"/>
<test xsi:type="xs:base64Binary" .../>
The instance is valid as per the schema but, according our current interpretation, is not encodable in strict mode.
If that is not correct, could you clarify the specification?
If that is correct, this is clearly not practical since one of the strict mode design goal was to encode at least all XML schema valid documents.
This case is not happening often currently. However, it may happen that applications put more often @xsi:type information to ensure value typing, even in schemaless mode.
The additional issue is that, in some environments, it may be tempting to use a global schema at the application level and a subset for the EXI transmission (to fit the lightest devices).
In those cases, this issue may cause trouble since perfectly application-schema-valid documents will not be encodable in strict mode, depending of course on the exact EXI schema subset. It could be expected that documents valid according a particular schema could be encoded in strict mode with a subset of the particular schema.
Also, adding new grammars to a given grammar set would require potential modifications of the grammars themselves.
Getting back to the actual compression gain of the current approach, the benefit is not high.
At max, it gains 1 bit per element (in bit-packed mode only, no difference for the compression mode).
In practice, I even doubt that the compression gain is that high:
@xsi:type production is often added to simple typed elements ("xs:string" typed elements e.g.)
@xsi:type production has minor impact to the code length as soon as element content has optional attribute/child items
At the very least, the rule could be changed so that a @xsi:type production is added for all elements whose content is defined by a global type definition.
This seems more inline with the XML Schema specification. In addition, schema writers that want to squeeze as much bits as possible could inline their schema type definitions.
This is feedback we have related to the EXI specification.
When preserveLexicalValues is true, @xsi:type should be encoded as a string, right?
In the case of preserving lexical values but not preserving namespaces (strict mode for instance),
the EXI stream will not contain any information about the URI of the @xsi:type qname value.
If that is correct, there is a potential issue:
- Encoder has the URI information
o Encoder will pick the grammar from the @xsi:type value (according the spec)
- Decoder has not the URI information
o It is not able to check whether there is a grammar associated to the @xsi:type value
o Decoder will pick the default grammar (this my interpretation of the spec)
I am a bit surprised that @xsi:type dynamic typing would not be usable when preserving lexical values in strict mode...
Anyway, the spec should probably align the EXI encoder behavior with the EXI decoder.
The wording of @xsi:nil in section 9.2.1 seems slightly out of sync.
The inclusion within or without the structure channel of the @xsi:nil value happens if @xsi:nil "has a schema-valid value".
I would reword it so that we phrase in terms of the @xsi:nil values being encoded using the Boolean representation.
This does not change the behavior and I find it clearer this way.
we are currently testing our EXI implementation using the XML Schema of
Schemas (http://www.w3.org/2001/XMLSchema) to create a schema-informed
grammar. We run into the following problem:
- Section 7.3.1 says "When a schema is provided, the string table is
also pre-populated with the local name of each attribute, element and
type declared in the schema, partitioned by namespace URI and sorted
- Section D.3 says "When XML Schemas are used to inform the grammars for
processing EXI body, there an additional partition that is appended to
the local-name partitions." and goes on listing the relevant local names.
However, the list of local-names provided in Section D.3 is not
consistent with the one produced when processing the XML Schema of
Schemas: the former only contains the local names of XML Schema
predefined types, while the latter also contains the local names of
elements used to write a schema. Should we overwrite the initial entries
defined in Section D.3 with the complete set of entries? Should we
append the missing entries (in lexicographical order) to the existing
entries? It might be useful to clarify this specific case in the spec,
in order to ensure interoperability.
Add a comment.