Warning:
This wiki has been archived and is now read-only.

EXI2

From W3C EXI WG's Public Wiki
Jump to: navigation, search

Contents

Wishes List

This section is meant to collect wishes for capabilities and their use cases. Each item needs to describe at least the followings.

  • Use case description (as specific as possible, including domains)
  • Relevant ideas (list ideas from Collection of ideas)
  • Alternatives

XMPP-Like Stream/Dynamic Schema

(YD)

Current XEP uses EXI in irregular manner. I think EXI itself can solve the issues gracefully. XMPP uses XML Schema dynamically, with element-by-element message push (streaming). Thus, features needed are: (1) optional NOP production rule to fill a byte (2) optional grammar switch production below a specific element (with a schemaID).

Alternatives: XEP-0322, or byte-aligned self-contained message (less efficient)

Random Read/Write on Slow Wireless Link

(YD)

Some class of RFID tags have rewriteable memory. Using wireless signal logistics gates must read and write them correctly within less than a second. The memory is divided in segments. Current non-XML format uses an index at segment 0 and other data on other segments can be accessed quickly through the index. If they (ISO) want to update the data format to XML-based format, EXI should be the first candidate. However, EXI does not allow random access to arbitrary element. If there are a standard way to put an index in option header, random access should be easy. The index should be something like a list of self contained elements with (XPath to specify the element, the byte count of the head of the self contained element).

Alternative: Application-specific Index document and content documents (one document per segment)

EXI 2.0 - Collection of ideas

(Improvement) EXI Options Document

schemaId #1

There seems to be common agreement that the XML Schema for EXI Options Document could be less restrictive in regard to schemaId allowing users to type-cast the schema id to any simple type. Type-aware schema IDs may be beneficial for many use cases.

The initial counter argument of not having this possibility was that EXI Options Document parsers need to be able to process any XSD type which is true. Anyhow, this seems to be true already in EXI 1.0 given to the user defined meta-data section. Further, unless xsi:nil and xsi:type are used the stream does not change.


schemaId #2

As pointed out in an email it sounds reasonable to reduce the overhead of indicating the use of "schema-less EXI streams" and "EXI streams with built-in XSD datatypes". Doing so also allows to require its presence in the EXI Header Options.

(Improvement) SelfContained Elements within Compressed Streams

In EXI1.0 the selfContained option MUST NOT appear in an EXI options document when one of "compression", "pre-compression" or "strict" elements are present in the same options document.

For the option "strict" there are very good reasons for doing so

  • No second level event portion available to indicate SC event
  • (most restrictive EXI stream in regard to validation, competitive with hand-optimized binary formats, ...)

Anyhow, it seems possible to support "compression" and "pre-compression" by closing the current block after each SC event in a similar fashion as it is done when blockSize value is reached.

Use-case: very large XML "paged" documents where one the one hand performance is much better when compression is used but still some kind of random access (link to page) is sensible.


(Improvement) Index for SelfContained EXI Streams

EXI allows to create SelfContained Elements to allow skipping or jumping to (et cetera) dedicated portions of an EXI stream. Without an index structure this feature might not be as usueful. EXI does not define any index structure yet.

The WG might want to consider defining such an index structure.

Use Cases:

  • JS script representation (skipping over JS functions or load functions on demand)
  • Fast(er) access on rather larger EXI streams
  • ....

(Improvement) Reduce Overhead of xsi:type cast

Currently a type cast is very costly given that such a cast has to identify the type out of all namespaces and all possible local-names. At least in STRICT mode it would make sense to restrict the cast to only the "possible" sub-types.

The currenty appraoch is very beneficial w.r.t. re-use of code and simplicity. That said, in some use-cases it is too generic and not compact. Maybe there is a good alternative.

YF: We could also have a fixed list of types that can be referenced by index (no need to store all types) OR change localName representation for xsi:type from using "7.3.3 Partitions Optimized for Frequent use of String Literals" to "7.3.2 Partitions Optimized for Frequent use of Compact Identifiers"

(Improvement) SelfContained Elements with some prior learning

  • SC always start with empty string tables and no prior grammar learning.
  • While grammar learning may not hurt a lot, empty string tables can harm compression.
  • It would be good if some means to prefill string tables with a shared common set of information would be available
    • Either through a more precise schema that can fill string tables
    • Or through a bootstrap EXI section that is suppposed to be decoded before each SC

Use-case: cameras streaming video sequence and related XML metadata

See below for "Shared String Table" section.

(Improvement) Redundant second level productions in strict FALSE

When applying the steps prescribed in "8.5.4.4.1 Adding Productions when Strict is False" an EE production is added to the grammars only if it is not already available as a first level productions (having event code with length 1). The same approach seems to be applicable for the addition of AT(*), CH [untyped value] and SE(*) when there is attribute wildcard, mixed content and element wildcard respectively.

The conditional addition of these production seems to be justified as at most these productions will differ from their first level counterparts by the Non Terminal symbol on the right-hand side.

This change (removing the second level productions AT(*), CH [untyped value] and SE(*) when available at level 1) if implemented, will change the number of bits required for encoding second level production and thus will lead to incompatibilities with current EXI 1.0. This change is advisable only if the new version of the specification is meant to be backwards incompatible with EXI 1.0. During the WG teleconference on 2012-12-12 it was also discuss that the introduction of the conditional addition could have negative impact on the implementation complexity.

YF: I would suggest doing the reverse.. having the same 2nd level productions all the time to simplify processing.

(New Feature) Flush in Stream

XMPP uses long standing XML stream as its communication format. Nodes need to flush its stream at the end of message, iq, and presence tag to push the event update to the other end of communication. Bit-packed EXI stream cannot used to XMPP because it cannot flush its stream if the stream is not on the byte boundary. (YD)

On 2013-03-20 conference call, two idea for this are raised: (1) selfContained (2) CH("") for padding. Approach (1) prohibits re-use of string tables (it seems to work). For approach (2) I'm not sure it works (allowed in the spec) or not. (YD)

XMPP already uses a solution that uses EXI body for each stanza. Not sure if flushing helps at this point.

(New Feature) Streaming

Related to Flush in Stream. Consider the use case of a never ending source or production of EXI. For example XMPP, or a sensor producing information, or a Twitter feed. The producer of the information needs to produce either/or a stream of N+ documents, or a document that never ends. Similar to "Flush In Stream" Part of the solution might be adding flush to the writer but this doesnt solve the reader problem. How does a reader know to not over-read? On many protocols a reader may be doing blocked IO requesting a buffer (say 1k) to be full or EOF. It is protocol dependent if flushing data from the writer actually causes the reader to return. Streaming support may depend on sufficient information being sent earlier so that readers know exactly how much to request from the input so that they can still do buffered blocking IO but return to the application when an element or document is complete. In addition it would be useful to support sequence of document streaming instead of or in addition to individual elements. This also requires the reader to know how far to lookahead without block or expecting EOF. (DL)

This issue does not appear to be unique to EXI. Therefore, solutions should be considered elsewhere regardless of the use of EXI.

(New Feature) Standard Format of Schema-Informed Grammar

Making grammar out of a schema is a complex and error-prone part of EXI implementation. If there is a format to describe canonical grammar without specific implementation dependency, it should help deployment of EXI in many platforms. (YD)

This may also help 'dynamic schema' problem. That has been dealt with awkward 'schema upload/download' in XEP-0322(XMPP). (2013-07-25 added by YD)

(Improvement) Schema-Informed Grammar for Wildcard Content

XMPP uses wildcard to embed extensions in XML stream, and the extended part may be significant part of the stream. It also provides XML schema for each extensions. Ideally, schema-informed grammar could be used to encode extended elements, instead of Ur-Type grammar used in EXI1, with dynamically extending the grammar in the middle of stream. (YD)

(New Feature) BigFloat

EXI1.0 Float has allowed value range of mantissa and exponents and prohibits out-of-range values. Is the both limit required? Adding a flag for NaN and INF will be much simpler and convenient for really really big floating point number. (YD)

EXI Float can already represent all values in the range defined by xsd:double. The problem statement does not describe why we need to consider to further expand the range.

(New Feature) Proper support of xsd:union

EXI1.0 maps union types to exi:string datatype. What would be the burden to properly support xsd:union by allowing to map to one of the actual memberTypes.

see also old email discussion (https://lists.w3.org/Archives/Member/member-exi-wg/2010May/0056.html)

XBRL and MPEG-7 schemas use union datatypes, and in both cases proper union codec would make data representation more compact.

(Improvement) Memory-sensitive EXI Grammar approach

EXI grammars tend to grow rapidly if maxOccurs values in XML Schema are a rather large numbers (e.g., maxOccurs="1000"). The benefit of the current approach of expanding all grammars seems not justifiable enough. A simple and powerful mechanims could be that all macOccurs values greater than 1 (> 0) are interpreted like unbounded (maxOccurs="unbounded"). This pre-process of modifying schema information can be easily done upfront before feeding XSDs to EXI processors. (DP)

Users sometimes don't know the consequence of setting maxOccurs to large values. Others just don't want "unbounded" and want to specify a theoratical bound. Other binary XML formats, such as BiM, treat large values already as unbounded.

The advantage of the proposed modification is beneficial in regard to memory (EX grammars at runtime) and also when serializing EXI grammars.

(potential solution discussed in TPAC2014 F2F) Define some schema/grammar transformation rule as a reference and let applications to decide how schemaId is to able to identify transformation and origin of schema.

minOccurs will be same (set to zero? one?)

Problem Schema when changed from maxOccurs="10" to "unbounded" (causes UPA constraint to fail)

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="XX">
        <xs:sequence>
            <xs:element name="a" type="xs:int" minOccurs="10" maxOccurs="10"/>
            <xs:element name="a" type="xs:int"/>
        </xs:sequence>
    </xs:complexType>
</xs:schema>

(Improvement) Better String Table in terms of Local ID

permanently unassigned makes implementation too complex. Is there any simpler approach? (YD)

(Improvement) Number of Significant Digits

We need an EXI property for defining significant digits in a floating point number. This can greatly reduce the number of bits needed to represent a value. This can be a major compaction (and possibly performance) benefit for numeric datasets.

Such a capability might also assist during comparison of data reduction techniques, to ensure that "apples to apples" measurements are actually occurring.

(Improvement) EXI Properties Embedded in XML Schema

EXI properties need to be consistently represented in XML schema so that tools can apply the best EXI techniques by default.

Adding such EXI property values for widely used XML vocabularies can turn XML Schemas into a tool for compression, not just validation and data modeling.

(New Feature) Shared String Table

Some use cases (e.g. WoT TD) want to have some well-known values leveraged for string encoding.

(New Feature) Split String Representation

Some use cases depend on longer string values that vary only slightly (e.g., URIs).
http://www.w3.org/TR/exi/#header
http://www.w3.org/TR/exi/#streams

In EXI 1.0 these strings are to be encoded twice (even the same prefix "http://www.w3.org/TR/exi/"). There is one trick that works for Character (CH) events and not AT events in case of non-strict encoding. The CH event can be splitted such as CH("http://www.w3.org/TR/exi/") + CH("#header").

It would be nice to have intrinsic support for such a feature in EXI 2.0 also for attributes and strict EXI documents (DP).

(Improvement) Change default EXI option values for Strings

The "default" value for valueMaxLength and valuePartitionCapacity is unbounded. In some use cases this default behaviour is problematic (DP).

For example consider use cases with very long strings and a very long EXI stream. The EXI string table tends to grow rapidly and the memory exceeds the available memory. Moreover, the EXI ecnoder may choose not to index new strings. However, the decoder needs to keep all strings in memory.

Further, the likelyhood of string table hits for strings with a length greater than 32 vanishes.

One may argue that such use cases need to set the EXI options properly. On the other hand the default value should be usable in all (or at least most) of the cases.

Proposal: the default value for valueMaxLength may be something like 32 and valuePartitionCapacity another "defined" (even large) value.

(New Feature) EXI Grammar Description

Prior work looked at Grammar Exchange Format. However this did not reach a conclusion.

Perhaps describing an EXI grammar from the perspective of the EXI Recommendation, via an XML schema, is a better starting point. ("Starting from scratch" if you will.)

It might then be easier to determine (a) does the XML document accurately express an EXI grammar, and (b) does it meet the necessary goals of the original proposals?

Such an approach would seem to be able to avoid any implementation-specific biases. It would also help support the use case of limited-memory mobile devices with schema-informed grammars.

Of note is that such an XML format can itself be compressed using EXI, and also loaded by any EXI engine, so the internal format of a given EXI application no longer matters.

(Improvement) EXI Compression other than DEFLATE

When the value of the compression option is set to true, each compressed stream in a block is stored using the standard DEFLATE Compressed Data Format defined by RFC 1951 (see https://www.w3.org/TR/exi/#CompressedStreams).

How could we allow for other compression scheme?

Points to be careful about:

  • Ensuring compatibility among EXI datasets
  • Meeting W3C patent policy to ensure that additions are usable in long term

These issues can be considered as part of a Call for Contributions, if we think they are worth pursuing. It will be easier than the first time since requirements are now very well defined.

EXI 1.0 - Known issues

Fallback grammar

We need a very simple mechanism to allow processors to always fallback to xsd:anyType grammar or probably element-fragment grammar in the next revision of EXI (see EXI profile efforts).

Abstract elements and types

In EXI 1.0, abstract elements themselves are included as particles in proto-grammars. The WG then noted we should have excluded them.

Similarly, EXI should not allow abstract types be used as the value of xsi:type attribute. Though this aspect was not discussed, it should be discussed together with the handling of abstract elements.

Random access

  • like a video format for films, maybe support an instruction to reset all string tables and start over, e.g. to allow random access to individual pages of a document; this was a use case brought to the original Binary XML Workshop, and is something you can't do with plain text XML. Could be very powerful when combined with HTTP Range. (LQ)
  • similar to - Skip, Jump, Index, of document structure of EXI stream (to make a quick access to significant elements) (YD)
  • Note that Random Access is likely to be a direct contradiction to streaming. If both are supported means of identifying what sections of the EXI stream can be randomly accessed and what cannot are necessary. (DL)

EXI 1.0 already has support for random access with Self-Contained Element. What is still missing?

Other ideas that may or may not be helpful

HTML 5

  • Are there changes to EXI that would make it ideal for HTML 5 support? (LQ)

XDM

  • What about transmitting an XDM instance? Maybe this would facilitate a distributed XProc, or XSLT or XQuery? (LQ)
    • W3C Recommendation: XQuery and XPath Data Model (XDM)

JSON

  • A "schema" for JSON might be interesting... (LQ)

Please see Efficient XML Interchange (EXI) for JSON

Editorial Stuff

Re-Organized Modularized Specs According to Use Case Analysis

One of the reasons why EXI is difficult to implement is that there are too many corner cases that is required to support 'everything on XML'. On the other hand, only few cases (in public) are using full-width spec. I would rather like to propose slimmed down / simplified version of EXI spec with add-on spec style to describe less commonly used specs such as DataTypeRepresentation. (YD)

Overall spec writing approach for future specs

Revise spec writing approach (see https://lists.w3.org/Archives/Member/member-exi-wg/2013May/0066.html)

  • create one core spec that deals with EXI datatypes and its association to EXI grammars
  • for schema languages we define a separate document that describes how a certain schema language definitions maps to EXI grammars and the related string tables etc. Essentially everything that must be shared between parties to be able to interoperate.
  • provide mappings for XML Schema (as exists now), WebIDL, perhaps the EXI JSON schema
  • define a standardized EXI grammar format? Using EBNF in the Recommendation itself still seems suitable.

Minor improvements

  • Re-arrange subsections in 7.1 Built-in EXI Datatype Representations to reflect codec dependencies (1. n-bit, 2. boolean et cetera) (see also: https://www.w3.org/XML/Group/EXI/wiki/File:Exi-primitives.png )
  • Consider creating informative state transition diagrams illustrating (and comparing the symmetry) of the EXI compression and decompression algorithms

Terminology

  • We may have distinctive term to express single DFA for a content model or whole set of state machines for a document. For example, content grammar and document grammar? (YD)

Best Practice Update

  • Update of best practice with regards to discussions in IETF (-exi suffix problem) (YD)
  • 'recommended set of options ordered in implementation difficulty' may help option negotiation.