This document is also available in these non-normative formats: XML.
Copyright © 2007 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document is the specification of the Efficient XML Interchange (EXI) format. EXI is a very compact representation for the eXtensible Markup Language (XML) Information Set that is intended to simultaneously optimize performance and the utilization of computational resources. The EXI format uses a hybrid approach drawn from the information and formal language theories, plus practical techniques verified by measurements, for entropy encoding XML information. Using a relatively simple algorithm, which is amenable to fast and compact implementation, and a small set of data types, it reliably produces efficient encodings of XML event streams. The event production system and format definition of EXI are presented.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the first Public Working Draft of the Efficient XML Interchange (EXI) Format 1.0 specification and is intended for review by W3C members and other interested parties. It has been developed by the Efficient XML Interchange (EXI) Working Group, which is part of the Extensible Markup Language (XML) Activity.
The features and algorithms described in the normative portion of the document are specified in enough detail to be adequate for early implementation experiments, except for Section 9.5.3 Schema-informed Element Grammar, which currently describes this feature by example. Other features the group is considering are found in the non-normative Appendix E Format Features Under Consideration, for which only brief descriptions are provided, and should probably not yet be considered for implementation.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
Comments on this document are invited and are to be sent to the public public-exi@w3.org mailing list (public archive).
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
1. Introduction
1.1 History and Design
1.2 Notational Conventions and Terminology
2. Design Principles
3. Basic Concepts
4. EXI Streams
5. EXI Header
5.1 Distinguishing Bits
5.2 EXI Format Version
5.3 EXI Options
6. Encoding EXI Streams
6.1 Determining Event Codes
6.2 Representing Event Codes
6.3 Fidelity Options
7. Representing Event Content
7.1 Built-in EXI Datatypes Representation
7.1.1 Binary
7.1.2 Boolean
7.1.3 Decimal
7.1.4 Float
7.1.5 Integer
7.1.6 Unsigned Integer
7.1.7 QName
7.1.8 Date-Time
7.1.9 n-bit Unsigned Integer
7.1.10 String
7.1.11 List
7.2 Enumerations
7.3 String Table
7.3.1 String Table Partitions
7.3.2 Partitions Optimized for Frequent use of Compact Identifiers
7.3.3 Partitions Optimized for Frequent use of String Literals
7.4 Pluggable CODECS
8. EXI Compression
8.1 Blocks
8.2 Channels
8.2.1 Structure Channel
8.2.2 Value Channels
8.3 Compressed Streams
9. EXI Grammars
9.1 Grammar Notation
9.1.1 Fixed Event Codes
9.1.2 Variable Event Codes
9.2 Grammar Event Codes
9.3 Pruning Unneeded Productions
9.4 Built-in XML Grammars
9.4.1 Built-in Document Grammar
9.4.2 Built-in Fragment Grammar
9.4.3 Built-in Element Grammar
9.5 Schema-informed Grammars
9.5.1 Schema-informed Document Grammar
9.5.2 Schema-informed Fragment Grammar
9.5.3 Schema-informed Element Grammar
9.5.3.1 Element Content Models
9.5.3.2 Undeclared Productions
9.5.3.3 Combined Grammar
10. Conformance
A References
A.1 Normative References
A.2 Other References
B Infoset Mapping
B.1 Document Information Item
B.2 Element Information Items
B.3 Attribute Information Item
B.4 Processing Instruction Information Item
B.5 Unexpanded Entity Reference Information item
B.6 Character Information item
B.7 Comment Information item
B.8 Document Type Declaration Information item
B.9 Unparsed Entity Information Item
B.10 Notation Information Item
B.11 Namespace Information Item
C XML Schema for EXI Options Header
D Initial Entries in String Table Partitions
D.1 Initial Entries in Uri Partition
D.2 Initial Entries in Prefix Partitions
D.3 Initial Entries in Local-Name Partitions
E Format Features Under Consideration (Non-Normative)
E.1 Bounded Integers
E.2 Strict Schemas
E.3 Restricted Character Sets
E.4 IEEE Floats
E.5 Bounded Tables
E.6 Grammar Coalescence
E.7 Byte-Aligned Mode
E.8 Indexed Elements
F Example Encoding (Non-Normative)
G Acknowledgements (Non-Normative)
The Efficient XML Interchange (EXI) format is a very compact, high performance XML representation that was designed to work well for a broad range of applications. It simultaneously improves performance and significantly reduces bandwidth requirements without compromising efficient use of other resources such as battery life, code size, processing power, and memory.
EXI uses a grammar-driven approach that achieves very efficient encodings using a straightforward encoding algorithm and a small set of data types. Consequently, EXI processors are relatively simple and can be implemented on devices with limited capacity.
EXI is schema "informed", meaning that it can utilize available schema information to improve compactness and performance, but does not depend on accurate, complete or current schemas to work. It supports arbitrary schema extensions and deviations and also works very effectively with partial schemas or in the absence of any schema. The format itself also does not depend on any particular schema language, or format, for schema information.
[Definition:] A program module called an EXI processor, whether it is part of a software or a hardware, is used by application programs to encode their structured data into EXI streams and/or to decode EXI streams to make the structured data accessible to them. This document not only specifies the EXI format, but also defines errors that EXI processors are required to detect and behave upon.
EXI is the result of extensive work carried out by the W3C's XML Binary Characterization (XBC) and Efficient XML Interchange (EXI) Working Groups. XBC was chartered to investigate the costs and benefits of an alternative form of XML, and formulate a way to objectively evaluate the potential of a substitute format for XML. Based on XBC's recommendations, EXI was chartered, first to measure, evaluate, and compare the performance of various XML technologies (using metrics developed by XBC [XBC Measurement Methodologies]), and then, if it appeared suitable, to formulate a recommendation for a W3C format specification. The measurements results and analyses, are presented elsewhere [EXI Measurements Note]. The format described in this document is the specification so recommended.
The functional requirements of the EXI format are those that were prepared by the XBC WG in their analysis of the desirable properties of a high performance encoding for XML [XBC Properties]. Those properties were derived from a very broad set of use cases also identified by the XBC working group [XBC Use Cases].
The design of the format presented here, is largely based on the results of the measurements carried out by the group to evaluate the performance characteristics (mainly of processing efficiency and compactness) of various existing formats. The EXI format is based on Efficient XML [Efficient XML], including for example the basis heuristic grammar approach, compression algorithm, and resulting entropy encoding. Present work centers around evaluating and integrating some features from other measured format technologies into EXI (see Appendix E Format Features Under Consideration).
EXI is compatible with XML at the XML Information Set [XML Information Set] level, rather than at the XML syntax level. This permits it to encapsulate an efficient alternative syntax and grammar for XML, while facilitating at least the potential for minimizing the impact on XML application interoperability.
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear EMPHASIZED in this document, are to be interpreted as described in RFC 2119 [IETF RFC 2119]. Other terminology used to describe the EXI format is defined in the body of this specification.
The term event and stream is used throughout this document to denote EXI event and EXI stream respectively unless the words are qualified differently to mean otherwise.
This document specifies an abstract grammar for EXI. In grammar notation, all terminal symbols are represented in plain text and all non-terminal symbols are represented in italics. Grammar productions are represented as follows:
| LeftHandSide : Event NonTerminal |
A set of one or more grammar productions that share the same left-hand-side non-terminal symbol are often presented together along with event codes that uniquely identify events among the collocated productions as follows:
| LeftHandSide : | |||
| Event 1 NonTerminal 1 | EventCode1 | ||
| Event 2 NonTerminal 2 | EventCode2 | ||
| Event 3 NonTerminal 3 | EventCode3 | ||
| ... | |||
| Event n NonTerminal n | EventCoden | ||
Section 9.1 Grammar Notation introduces additional notations for describing productions and event codes in grammars. Those additional notations facilitates concise representation of the EXI grammar system.
Terminal symbols that are qualified with a qname permit the use of a wildcard symbol (*) in place of a qname. The terminal symbol SE (*) matches a start element (SE) event with any qname. Similarly, the terminal symbol AT (*) matches an attribute (AT) event with any qname.
Several prefixes are used throughout this document to designate certain namespaces. The bindings shown below are assumed, however, any prefixes can be used in practice if they are properly bound to the namespaces.
| Prefix | Namespace Name |
|---|---|
| exi | http://www.w3.org/2007/07/exi |
| xml | http://www.w3.org/XML/1998/namespace |
| xsd | http://www.w3.org/2001/XMLSchema |
| xsi | http://www.w3.org/2001/XMLSchema-instance |
In describing the layout of an EXI format construct, a pair of square brackets [ ] are used to surround the name of a field to denote that the occurrence of the field is optional in the structure of the part or component that contains the field.
In arithmetic expressions, the notation ⌈x⌉ where x represents a real number denotes the ceiling of x, that is, the smallest integer greater than or equal to x.
The following design principles were used to guide the development of EXI and encourage consistent design decisions. They are listed here to provide insight into the EXI design rationale and to anchor discussions on desirable EXI traits.
One of primary objectives of EXI is to maximize the number of systems, devices and applications that can communicate using XML data. Specialized approaches optimized for specific use cases should be avoided.
To reach the broadest set of small, mobile and embedded applications, simple, elegant approaches are preferred to large, analytical or complex ones.
EXI must be competitive with hand-optimized binary formats so it can be used by applications that require this level of efficiency.
EXI must deal flexibly and efficiently with documents that contain arbitrary schema extensions or deviate from their schema. Documents that contain schema deviations should not cause encoding to fail.
EXI must integrate well with existing XML technologies, minimizing the changes required to those technologies. It must be compatible with the XML Information Set [XML Information Set], without significant subsetting or supersetting, in order to maintain interoperability with existing and prospective XML specifications.
EXI achieves broad generality, flexibility, and performance, by unifying concepts from formal language theory and information theory into a single, relatively simple algorithm. The algorithm uses a grammar to determine what is likely to occur at any given point in an XML document and encodes the most likely alternatives in fewer bits. The fully generalized algorithm works for any language that can be described by a grammar (e.g., XML, Java, HTTP, etc.); however, EXI is optimized specifically for XML languages.
The built-in EXI grammar accepts any XML document or fragment and may be augmented with productions derived from XML Schemas [XML Schema Structures][XML Schema Datatypes], RELAX NG schemas [ISO/IEC 19757-2:2003], DTDs [XML 1.0] or other sources of information about what is likely to occur in a set of XML documents. The EXI encoder uses the grammar to map a stream of XML information items onto a smaller, lower entropy, stream of events.
The encoder then represents the stream of events using a set of simple variable length codes called event codes. Event codes are similar to Huffman codes [Huffman Coding], but are much simpler to compute and maintain. They are encoded directly as a sequence of values, or if additional compression is desired, they are passed to the EXI compression algorithm, which replaces frequently occurring event patterns to further reduce size.
When schemas are used, EXI also supports a user-customizable set of typed encodings for efficiently encoding typed values.
[Definition:] An EXI stream is an EXI header followed by an EXI stream body. It is the EXI stream body that carries the content of the document, while the EXI header amongst its roles communicates the options that were used for encoding the EXI stream body. Section 5. EXI Header describes the EXI header. Values in an EXI stream are packed into bytes most significant bit first.
Applications that use EXI streams embedded in a container data format that discerns it is an EXI stream, dictates the EXI format version and the EXI Options used for its encoding, may with to omit the EXI header. Although an EXI Body is not a valid EXI stream, EXI processors MAY provide a capability to process an EXI stream body independent of an EXI stream.
[Definition:] The building block of an EXI stream body is an EXI event. An EXI stream body consists of a sequence of EXI events representing an EXI document or an EXI fragment.
The EXI events permitted at any given position in an EXI stream are determined by the EXI grammar. The events occur in a well-formed manner with matching start element and end element events in the same fashion as XML. The EXI grammar incorporates knowledge of the XML grammar and may be augmented and refined using schema information and fidelity options. EXI grammar is formally specified in section 9. EXI Grammars.
The following table summarizes the EXI events and associated content that occur in an EXI stream. In addition, the table includes the grammar notation used to represent each event in this specification. Each event in an EXI stream participates in a mapping system that relates events to XML Information Items so that an EXI document as a whole serves to represent an XML Information Set. The table shows XML Information Items relevant to each EXI event type. Appendix B Infoset Mapping describes the mapping system in detail.
| EXI Event Type | Content | Grammar Notation | Information Item |
|---|---|---|---|
| Start Document | SD | B.1 Document Information Item | |
| End Document | ED | ||
| Start Element | SE ( qname ) | B.2 Element Information Items | |
| qname | SE ( * ) | ||
| End Element | EE | ||
| Attribute | value | AT ( qname ) | B.3 Attribute Information Item |
| qname, value | AT ( * ) | ||
| Characters | value | CH | B.6 Character Information item |
| Namespace Declaration | prefix, uri | NS | B.11 Namespace Information Item |
| Comment | text | CM | B.7 Comment Information item |
| Processing Instruction | name, text | PI | B.4 Processing Instruction Information Item |
| DOCTYPE | name, public, system, text | DT | B.8 Document Type Declaration Information item |
| Entity Reference | name | ER | B.5 Unexpanded Entity Reference Information item |
Section 6. Encoding EXI Streams describes the algorithm used to encode events in the EXI stream. As indicated in the table above, there are some event types that carry content with their event instances while other event types function as markers without content. A grammar production may match a specific Element or Attribute by qname or match any Element or Attribute using wildcard notation. When a grammar matches an Element or Attribute by qname, the qname is not part of the content. When a grammar matches an Element or Attribute using wild notation, the qname is part of the content.
Each item in the event content has a data type associated with it as shown in the following table. The content of each event, if any, is encoded as a sequence of items each of which being encoded according to its data type in order starting with the first item followed by subsequent items.
| Content item | Used in | Type |
|---|---|---|
| name | PI, DT, ER | 7.1.10 String |
| prefix | NS | 7.1.10 String |
| public | DT | 7.1.10 String |
| qname | SE, AT | 7.1.7 QName |
| system | DT | 7.1.10 String |
| text | CM, PI | 7.1.10 String |
| uri | NS | 7.1.10 String |
| value | CH, AT | According to the schema type (see 7. Representing Event Content) if any is in effect and the preserve.lexicalValues option is set to false, otherwise 7.1.10 String |
Content items other than value have their inherent, fixed data types independent of their uses. The data type that governs each occurrence of the value item depends on the schema type if any that is in effect for the value in question. The type xsd:anySimpleType is used for values that do not have an associated schema-type, are schema-invalid, or occur in mixed content. Section 7. Representing Event Content describes how each of the types listed above are encoded in an EXI stream.
Each EXI stream begins with an EXI header. The EXI header distinguishes EXI documents from text XML documents, identifies the version of the EXI format being used, and can specify the options used to encode the body of the EXI stream. The EXI header has the following structure:
| Distinguishing Bits |
| EXI Format Version | [EXI Options] | [Padding Bits] |
EXI Options field within an EXI header is optional where its presence is indicated by the value of the presence bit that follows Distinguishing Bits. The presence and absence is indicated by the value 1 and 0, respectively.
When the compression of the EXI stream is turned on dictated by EXI options, padding bits of minumum length required to make the whole length of the header byte-aligned are added at the end of the header.
The following sections describe the remaining three parts of the header.
[Definition:] An EXI header starts with Distinguishing Bits part, which is a two bit field used to distinguish EXI documents from text XML documents. The first bit contains the value 1 and the second bit contains the value 0 as follows.
| 1 | 0 |
This bit sequence cannot occur as the first two bits of a well-formed XML document and represents the minimum length EXI document prefix required to distinguish EXI documents from XML documents.
XML Processors are expected to consistently reject an EXI stream as early as they read and process the first byte from the stream, since neither of the bit sequences shown above constitutes the first two bits of any well-formed XML documents represented in any one of the conventional character encodings such as UTF-8, UTF-16, UCS-2, UCS-4, EBCDIC, ISO 8859, Shift-JIS and EUC, according to XML 1.0 [XML 1.0]. Systems that use EXI documents as well as XML documents can look at the Distinguishing Bits to determine whether to interpret a particular stream as XML or EXI.
| Editorial note | |
| The distinguishing bits are sufficient to distinguish EXI streams from a reasonably broad range of popular content types that occur on the web. However, the introduction of an optional, larger set of bits for distinguishing EXI streams from any content type is also under consideration. | |
[Definition:] The third part in the EXI header is the EXI Format Version, which identifies the version of the EXI format being used. EXI format version numbers are integers. Each version of the EXI Format Specification specifies the corresponding EXI format version number to be used by conforming implementations. The EXI format version number that corresponds with this version of the EXI formation specification is 0 (zero).
The first bit of the version field indicates whether the version is a preview or final version of the EXI format. A value of 0 indicates this is a final version and a value of 1 indicates this is a preview version. Final versions correspond to final, approved versions of the EXI format specification. An EXI processor that implements a final version of the EXI format specification is REQUIRED to process EXI streams that have a version field with its first bit set to 0 followed by a version number that corresponds to the version of the EXI specification the processor implements. Preview versions of the EXI format are useful for gaining implementation and deployment experience prior to finalizing a particular version of the EXI format. While preview versions may match drafts of this specification, they are not governed by this specification and the behaviour of EXI processors encountering preview versions of the EXI format is implementation dependent. Implementers are free to coordinate to achieve interoperability between different preview versions of the EXI format.
Following the first bit of the version is a sequence of one or more 4-bit unsigned integers representing the version number. The version number is determined by summing this sequence of 4-bit unsigned values. The sequence is terminated by any 4-bit unsigned integer with a value in the range 0-14. As such, the first 15 version numbers are represented by 4 bits, the next 15 are represented by 8 bits, etc.
Given an EXI stream with its stream cursor positioned just past the first bit of the EXI format version field, the EXI format version number can be computed by going through the following steps with version number initially set to 1.
The following are example EXI format version numbers.
| EXI Format Version Field | Description |
|---|---|
| 1 0000 | Preview version 1 |
| 0 0000 | Final version 1 |
| 0 1110 | Final version 15 |
| 0 1111 0000 | Final version 16 |
| 0 1111 0001 | Final version 17 |
EXI processors conforming with the final version of this specification MUST use the 5-bit value 0 0000 as the version number.
[Definition:] The fourth part of the EXI header is the EXI Options, which provides a way to specify the options used to encode the body of the EXI stream. They are represented as an XML document encoded using the EXI format described in this specification. This results in a very compact header format that can be read and written with very little additional software.
The presence of EXI Options in its entirety is optional in EXI header, and it is predicated on the value of the presence bit that follows the Distinguishing Bits. When EXI Options are present in the header, an EXI Processor MUST observe the specified options to process the EXI stream that follows. Otherwise, an EXI Procesor may obtain the EXI options using another mechanism.
EXI processors MAY provide external means for applications or users to specify EXI Options when the EXI header is abscent. Such EXI processors are typically used in controlled systems where the knowledge about the effective EXI Options is shared prior to the exchange of EXI documents. The mechanism to communicate out-of-bound EXI Options and their representation used in such systems are implementation dependent.
The following table describes the EXI options specified in the options field.
| EXI Option | Description |
|---|---|
| compression | EXI compression is used to achieve better compactness |
| fragment | Body is encoded as an EXI fragment instead of an EXI document |
| preserve | Specifies whether comments, pis, etc. are preserved |
| schemaID | Identify the schema information, if any, used to encode the body |
| codecMap | Identify pluggable CODECs used to encode body |
| blockSize | Specifies the block size used for EXI compression |
| [user defined] | User defined headers may be added |
The compression option is a Boolean used to increase compactness using additional computational resources. When set to true, the event codes and associated content are compressed according to 8. EXI Compression. Otherwise, the event codes and associated content are represented as a sequence of bit-encoded values.
The fragment option is a Boolean that indicates whether the EXI body is an EXI document or an EXI fragment. When set to true, the EXI body is an EXI fragment. Otherwise, the EXI body is an EXI document. [Definition:] EXI fragments are analogous in concept to external general parsed entities in XML in that they consist of a sequence of elements, processing instructions and comments in containers of their own that are physically separate from the documents in which they are to be used. An EXI fragment is formally defined in terms of its grammar in Section 9.4.2 Built-in Fragment Grammar. The XML Information Set an EXI stream is mapped onto contains a document information item if the stream represents an EXI document, otherwise, the XML Information Set does not have a document information item if the stream represents an EXI fragment. The order among elements, processing instructions and comments that appear at the root in an EXI fragment is deemed significant and MUST be preserved by EXI processors.
The preserve option is a set of Booleans that can be set independently to control whether certain information items are preserved in the EXI stream. 6.3 Fidelity Options describes the set of information items effected by the preserve option.
The schemaID option may be used to identify the schema information used to encode the EXI body. When the schemaID is nil, no schema information was used to encode the EXI body. When the schemaID option is absent (i.e., undefined), no statement is made about the schema information used to encode the EXI body and it is assumed this information is communicated out of band.
The codecMap identifies pluggable CODECs used to encode the EXI body as described in 7.4 Pluggable CODECS.
The blockSize specifies the block size used for EXI compression. When the blockSize option is absent, the default blocksize of 1,000,000 is used. The default blockSize is intentionally large but can be reduced for processing large documents on devices with limited memory.
Appendix C XML Schema for EXI Options Header provides an XML Schema describing the EXI document format used to represent the options field. This EXI document format is designed to produce smaller headers for option combinations used when compactness is critical.
The options field is encoded as an EXI body using the default options specified by the following XML document:
<header xmlns="http://www.w3.org/2007/07/exi"> </header>
The rules for encoding a series of events as an EXI stream are very simple and are driven by a declarative set of grammars that describes the structure of an EXI stream. Every event in the stream is encoded using the same set of encoding rules, which are summarized as follows:
Namespace (NS) and attribute (AT) events are encoded in a specific order following the associated start element (SE) event. Namespace (NS) events are encoded first followed by the AT(xsi:type) event if present, followed by the AT(xsi:nil) event if present, followed by the rest of the attribute (AT) events. When schema-informed grammars are used for processing an element, attribute events occur in lexical order in the stream sorted first by qname localName then by qname uri. Otherwise, when built-in element grammars are used, attribute events can occur in any order. Namespace (NS) events can occur in any order regardless of the grammars used for processing the associated element.
EXI uses the same simple procedure described above, to encode well-formed documents, document fragments, schema-valid information items, schema-invalid information items, information items partially described by schemas and information items with no schema at all. Only the grammars that describe these items differ. For example, an element with no schema information is encoded according to the XML grammar defined by the XML specification, while an element with schema information is encoded according to the more specific grammar defined by that schema.
[Definition:] An event code is a sequence of 1 to 3 non-negative integers called parts. Each production in a grammar has an event code that distinguishes its event from that of other productions that share the same left-hand-side non-terminal symbol.
Section 6.1 Determining Event Codes describes in detail how the grammar is used to determine the event code of an event. Section 6.2 Representing Event Codes describes in detail how event codes are represented as bits. Section 6.3 Fidelity Options describes available fidelity options and how they effect the EXI stream. Section 7. Representing Event Content describes how the typed event contents are represented as bits.
The structure of an EXI stream is described by the EXI grammars, which are formally specified in section 9. EXI Grammars. Each grammar defines which events are permitted to occur at any given point in the EXI stream and provides a pre-assigned event code for each event.
For example, the grammar productions below describe the events that can occur in a schema-informed EXI stream after the Start-Document (SD) event provided there are four global elements defined in the schema and provide an event code for each event:
| Syntax | Event Code | ||
|---|---|---|---|
| DocContent | |||
| SE ("A") DocEnd | 0 | ||
| SE ("B") DocEnd | 1 | ||
| SE ("C") DocEnd | 2 | ||
| SE ("D") DocEnd | 3 | ||
| SE (*) DocEnd | 4.0 | ||
| DT DocContent | 4.1 | ||
| CH DocContent | 4.2 | ||
| CM DocContent | 4.3.0 | ||
| PI DocContent | 4.3.1 | ||
At the point in an EXI stream where the above grammar productions are in effect, the event code of Start Element "A" (i.e. SE("A")) is 0. The event code of a DOCTYPE (DT) event at this point in the stream is 4.1, and so on.
Each event code is represented by a sequence of 1 to 3 parts that uniquely identify an event. Event code parts are encoded in order starting with the first part followed by subsequent parts.
When the EXI compression is not in effect for the current processing of the stream, the ith part of an event code is encoded using the minimum number of bits required to distinguish it from the ith part of the other sibling event codes in the current grammar. Specifically, the ith part of an event code is encoded as an n-bit unsigned integer (7.1.9 n-bit Unsigned Integer), of which n is ⌈ log 2 m ⌉ where m is the number of distinct values used as the ith part of its own and all its sibling event codes in the current grammar. Two event codes are siblings at the ith part if and only if they share the same values in all preceding parts. All event codes are siblings at the first part.
When the EXI events are subsequently subject to EXI compression, ith part of an event code is encoded using minimum number of bytes instead of bits required to distinguish it from the ith part of the other sibling event codes in the current grammar again as an n-bit unsigned integer (7.1.9 n-bit Unsigned Integer) , of which n is ⌈ log 2 m ⌉ where m is the number of distinct values used as the ith part of its own and all its sibling event codes in the current grammar. The number of bytes used by n-bit unsigned integer representation in this case is equal to ⌈ n / 8 ⌉.
Regardless of the EXI compression option, if there is only one distinct value for a given part, the part is omitted (i.e., encoded in log 2 1 = 0 bits = 0 bytes).
For example, the nine event codes shown in the DocContent grammar above have a value ranging from 0 to 4 for their first part. There are five distinct values needed to identify the first part of these event codes. Therefore, when the EXI compression is not in effect, the first part can be encoded in ⌈ log 2 5 ⌉ = 3 bits. In the same fashion, the number of bits used for encoding second and third part (if present) are calculated as ⌈ log 2 4 ⌉ = 2 bits and ⌈ log 2 2 ⌉ = 1 bits, respectively. On the other hand, when EXI compression is in effect, the number of bytes used for each part is ⌈ 3 / 8 ⌉ = 1 bytes for the first part, ⌈ 2 / 8 ⌉ = 1 bytes for the second part and ⌈ 1 / 8 ⌉ = 1 bytes for the third part.
The table below illustrates how the event codes of each event in the DocContent grammar above is encoded.
| Event | Part values | Event Code Encoding | # bits | ||
|---|---|---|---|---|---|
| SE ("A") | 0 | 000 | 3 | ||
| SE ("B") | 1 | 001 | 3 | ||
| SE ("C") | 2 | 010 | 3 | ||
| SE ("D") | 3 | 011 | 3 | ||
| SE (*) | 4 | 0 | 100 00 | 5 | |
| DT | 4 | 1 | 100 01 | 5 | |
| CH | 4 | 2 | 100 10 | 5 | |
| CM | 4 | 3 | 0 | 100 11 0 | 6 |
| PI | 4 | 3 | 1 | 100 11 1 | 6 |
| # distinct values ( m) | 5 | 4 | 2 | ||||
| 3 | 2 | 1 |
| Event | Part values | Event Code Encoding | # bytes | ||
|---|---|---|---|---|---|
| SE ("A") | 0 | 00000000 | 1 | ||
| SE ("B") | 1 | 00000001 | 1 | ||
| SE ("C") | 2 | 00000010 | 1 | ||
| SE ("D") | 3 | 00000011 | 1 | ||
| SE (*) | 4 | 0 | 00000100 00000000 | 2 | |
| DT | 4 | 1 | 00000100 00000001 | 2 | |
| CH | 4 | 2 | 00000100 00000010 | 2 | |
| CM | 4 | 3 | 0 | 00000100 00000011 00000000 | 3 |
| PI | 4 | 3 | 1 | 00000100 00000011 00000001 | 3 |
| # distinct values (m) | 5 | 4 | 2 | ||||
| 1 | 1 | 1 |
Some XML applications do not require the entire XML feature set and would prefer to eliminate the overhead associated with unused features. For example, the SOAP 1.2 specification [SOAP 1.2] prohibits the use of XML processing-instructions. In addition, there are many data-exchange use cases that do not require XML comments or DTDs.
Applications can use a set of fidelity options to specify the XML features they require. As specified in section 9.3 Pruning Unneeded Productions, EXI processors MUST use these fidelity options to prune the events that are not required from the grammars, improving compactness and processing efficiency.
The table below lists the fidelity options supported by this version of the EXI specification and describes the effect setting these options has on the EXI stream.
| Fidelity option | Effect |
|---|---|
| Preserve.comments | CM events are preserved |
| Preserve.pis | PI events are preserved |
| Preserve.dtd | DOCTYPE and ER events are preserved |
| Preserve.prefixes | NS events and namespace prefixes are preserved |
| Preserve.lexicalValues | Lexical form of element and attribute values is preserved |
EXI processors may report an error if the application attempts to encode events that have been pruned from the grammar or may simply ignore these events.
The content of each event in an EXI body is represented according to its type (see Table 4-2). In the absence of external type information or when the preserve.lexicalValues option is set to true, all attribute and character values are typed as String.
[Definition:] EXI defines a minimal set of data types called Built-in EXI datatypes that define how values are represented in EXI streams as described in 7.1 Built-in EXI Datatypes Representation. The following table lists the built-in EXI datatypes, associated type identifiers and the XML Schema Language [XML Schema Datatypes] built-in types each is used to represent by default.
| Built-in EXI Datatype | EXI Datatype ID | XML Schema Types | |
|---|---|---|---|
| Binary | xsd:base64Binary | base64Binary | |
| xsd:hexBinary | hexBinary | ||
| Boolean | xsd:boolean | boolean | |
| Date-Time | xsd:dateTime | dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth | |
| Decimal | xsd:decimal | decimal | |
| Float | xsd:double | float, double | |
| Integer | xsd:integer | integer without minInclusive or minExclusive facets, or with minInclusive or minExclusive facet of negative value | |
| Unsigned Integer | nonNegativeInteger or integer with minInclusive or minExclusive facet value of 0 or above | ||
| String | xsd:string | string, anySimpleType, anyURI, duration, All types derived by union | |
| List | All types derived by list, including IDREFS and ENTITIES | ||
| n-bit Unsigned Integer | |||
| QName | |||
By default, types derived from the XML Schema types above are also represented by the associated built-in EXI datatype. When there are more than one XML Schema types above from which a type is derived directly or indirectly, the closest ancestor is used to determine the built-in EXI datatype. For example, a value of XML Schema type xsd:int is represented by the same built-in type as for XML Schema type xsd:integer. Although xsd:int is derived indirectly from xsd:integer and also further from xsd:decimal, a value of xsd:int is processed as an instance of xsd:integer because xsd:integer is closer to xsd:int than xsd:decimal is in the datatype inheritance hierarchy.
Each EXI datatype identifier above is a QName. Datatype identifiers uniquely identify one of the built-in EXI datatypes. They are used by Pluggable CODECS to designate XML Schema types to built-in EXI datatypes different from the ones that are associated by default. Not all built-in EXI datatypes are assigned datatype identifiers. Only those that have identifiers are usable by Pluggable CODECS for designating alternative representations.
The rules used to represent values of String depend on the content items to which the values belong. There are certain content items whose value representation involve the use of string tables while other content items are represented using the encoding rule described in 7.1.10 String without involvement of string tables. The content items that use string tables and how each of such content items uses string tables to represent their values are described in 7.3 String Table.
Schemas can provide one or more enumerated values for types. EXI exploits those pre-defined values when they are available to represent values of such types in a more efficient manner than it would otherwise using built-in EXI datatypes. The encoding rule for representing a type of enumerated values is described in 7.2 Enumerations. Types that are derived from other types by union and their subtypes are always represented as String regardless of the availability of enumerated values. Representation of values of which the schema type is one of QName, Notation or a type derived therefrom by restriction are also not affected by enumerated values if any.
The following sections describe the encoding rules for representing built-in EXI datatypes.
Values typed as Binary are represented as a length-prefixed sequence of octets representing the binary content. The length is represented as an Unsigned Integer (see 7.1.6 Unsigned Integer).
When the EXI compression option is set to false, values typed as Boolean are represented using one bit, otherwise they are represented using one byte. The value zero (0) represents false and the value one (1) represents true.
Values typed as Decimal are represented as a Boolean sign (see 7.1.2 Boolean) followed by two Unsigned Integers (see 7.1.6 Unsigned Integer). A sign value of zero (0) is used to represent positive Decimal values and a sign value of one (1) is used to represent negative Decimal values. The first Unsigned Integer represents the integral portion of the Decimal value. The second Unsigned Integer represents the fractional portion of the Decimal value with the digits in reverse order to preserve leading zeros.
Values typed as Float are represented as two consecutive Integers (see 7.1.5 Integer). The first Integer represents the mantissa of the floating point number and the second Integer represents the base-10 exponent of the floating point number. The range of the mantissa is - (263) to 263-1 and the range of the exponent is - (214-1) to 214-1. Values typed as Float with a mantissa or exponent outside the accepted range are represented as schema-invalid values.
The exponent value -(214) is used to indicate one of the special values: infinity, negative infinity and not-a-number (NaN). An exponent value -(214) with mantissa values 1 and -1 represents positive infinity (INF) and negative infinity (-INF) respectively. An exponent value -(214) with any other mantissa value represents NaN.
A value represented as Float can be decoded by going through the following steps.
Note:
Support for IEEE float representation is currently under consideration. (See E.4 IEEE Floats)
The Integer type supports signed integer numbers of arbitrary magnitude. Values typed as Integer are represented as a Boolean sign (see 7.1.2 Boolean) followed by an Unsigned Integer (see 7.1.6 Unsigned Integer). A sign value of zero (0) is used to represent positive integers and a sign value of one (1) is used to represent negative integers. For non-negative values, the Unsigned Integer holds the magnitude of the value. For negative values, the Unsigned Integer holds the magnitude of the value minus 1.
The Unsigned Integer type supports unsigned integer numbers of arbitrary magnitude. Values typed as Unsigned Integer are represented using a sequence of octets. The sequence is terminated by an octet with its most significant bit set to 0. The value of the unsigned integer is stored in the least significant 7 bits of the octets as a sequence of 7-bit bytes, with the least significant byte first.
A value represented as Unsigned Integer can be decoded by going through the following steps with the initial value set to 0 and the initial multiplier set to 1.
| Editorial note | |
| EXI also provides a modified representation for Integers that will not fit within a 64-bit integer to facilitate processing by devices that do not support big integers. This capability has not yet been specified. | |
Values of type QName are each represented by the following two Strings (see 7.1.10 String) that are encoded consecutively in the order they are numbered below.
If the QName is in no namespace, the uri is represented by a zero length String.
| Editorial note | |
| When the preserve.prefixes option is set to true, EXI also preserves the prefix of the QName. This capability has not yet been specified. | |
Values typed as Date-Time are encoded as a sequence of values representing the individual components of the Date-Time. The following table specifies each of the possible date-time components along with how they are encoded.
| Component | Value | Type |
|---|---|---|
| Year | Offset from 2000 | Integer ( 7.1.5 Integer) |
| MonthDay | Month * 31 + Day | 9-bit Unsigned Integer (7.1.9 n-bit Unsigned Integer) where day is a value in the range 0-30 and month is a value in the range 1-12. |
| Time | ((Hour * 60) + Minutes) * 60 + seconds | 17-bit Unsigned Integer (7.1.9 n-bit Unsigned Integer) |
| FractionalSecs | Fractional seconds | Unsigned Integer ( 7.1.6 Unsigned Integer) representing the fractional part of the seconds with digits in reverse order to preserve leading zeros |
| TimeZone | TZHours * 60 + TZMinutes | 11-bit Unsigned Integer (7.1.9 n-bit Unsigned Integer) representing a signed integer offset by 840 ( = 14 * 60 ) |
| presence | Boolean presence indicator | Boolean (7.1.2 Boolean) |
The variety of components that constitute a value and their appearance order depend on the XML Schema type associated with the value. The following table shows which components are included in a value of each XML Schema type that is relevant to Date-Time datatype. Items listed in square brackets are included if and only if the value of its preceding presence indicator (specified above) is set to true.
| XML Schema Type | Included Components |
|---|---|
| gYear | Year, presence, [TimeZone] |
| gYearMonth | Year, MonthDay, presence, [TimeZone] |
| date | |
| dateTime | Year, MonthDay, Time, presence, [FractionalSecs], presence, [TimeZone] |
| gMonth | MonthDay, presence, [TimeZone] |
| gMonthDay | |
| gDay | |
| time | Time, presence, [FractionalSecs], presence, [TimeZone] |
When the EXI compression option is set to false, values of type n-bit Unsigned Integer are represented as an unsigned binary integer using n bits. Otherwise, they are represented as an unsigned integer using the minimum number of bytes required to store n bits. Bytes are ordered with the least significant byte first.
Values of type String are encoded as a length prefixed sequence of characters. The length represents the number of characters in the string and is encoded as an Unsigned Integer (see 7.1.6 Unsigned Integer) and each character in the string is encoded as an Unsigned Integer (see 7.1.6 Unsigned Integer) representing its UCS code point.
EXI uses a string table to represent certain content items more efficiently. Section 7.3 String Table describes the string table and how it is applied to different content items.
Values of type List are encoded as a length prefixed sequence of values. The length is encoded as an Unsigned Integer (see 7.1.6 Unsigned Integer) and each value is encoded according to its type (see 7. Representing Event Content).
Values of enumerated types are encoded as n-bit Unsigned Integers (7.1.9 n-bit Unsigned Integer) where n = ⌈ log 2 m ⌉ and m is the number of items in the enumerated type. The value assigned to each item corresponds to its ordinal position in the enumeration in schema-order starting with position zero (0).
Exceptions are for schema types derived from others by union and their subtypes, QName or Notation and types derived therefrom by restriction. The values of such types are processed by their respective built-in EXI datatypes instead of being represented as enumerations.
EXI uses a string table to assign "compact identifiers" to some string values. Occurrences of string values found in the string table are represented using the associated compact identifier rather than encoding the entire "string literal". The string table is initially pre-populated with string values that are likely to occur in certain contexts and is dynamically expanded to include additional string values encountered in the document. The following content items are encoded using a string table:
The uris and local-names used in qname content items are also encoded using a string table. When a string value is found in the string table, the value is encoded using the compact identifier and no changes are made to the string table as a result. When a string value is not found in the string table, its string literal is encoded as a String without using a compact identifier, only after which the string table is augmented by including the string value with an assigned compact identifier.
The string table is divided into partitions and each partition is optimized for more frequent use of either compact identifiers or string literals depending on the purpose of the partition. Section 7.3.1 String Table Partitions describes how EXI string table is partitioned. Section 7.3.2 Partitions Optimized for Frequent use of Compact Identifiers describes how string values are encoded when the associated partition is optimized for more frequent use of compact identifiers. Section 7.3.3 Partitions Optimized for Frequent use of String Literals describes how string values are encoded when the associated partition is optimized for more frequent use of string literals.
The life cycle of a string table spans the processing of a single EXI stream. String tables are not represented in an EXI stream or exchanged between EXI processors. A string table cannot be reused across multiple EXI streams; therefore, EXI processors MUST use a string table that is equivalent to the one that would have been newly created and pre-populated with initial values for processing each EXI stream.
The string table is organized into partitions so that the indices assigned to compact identifiers can stay relatively small. Smaller number of indices results in improved average compactness and the efficiency of table operations. Each partition has a separate set of compact identifiers and content items are assigned to specific partitions as described below.
Uri content items and the URI portion of qname content items are assigned to the uri partition. The uri partition is optimized for frequent use of compact identifiers and is pre-populated with initial entries as described in D.1 Initial Entries in Uri Partition. When a schema is provided, the uri partition is also pre-populated with the name of each namespace URI declared in the schema, appended in lexicographical order.
Prefix content items are assigned to partitions based on their associated namespace URI. Partitions containing prefix content items are optimized for frequent use of compact identifiers and the string table is pre-populated with entries as described in D.2 Initial Entries in Prefix Partitions.
Local-name content items and the local-name portion of qname content items are assigned to partitions based on the namespace URI of the NS event or qname content item of which the local-name is a part. Partitions containing local-name content items are optimized for frequent use of string literals and the string table is pre-populated with entries as described in D.3 Initial Entries in Local-Name Partitions. When a schema is provided, the string table is also pre-populated with the local name of each attribute, element and type declared in the schema, partitioned by namespace URI and sorted lexicographically.
Value content items are assigned simultaneously to the global value partition as well as to the "local" value partition that corresponds to the qname of the attribute or element in context at the time when the string table is looked up and the string value is not found in both global and local value partitions. Partitions containing value content items are optimized for frequent use of string literals and are initially empty.
String table partitions that are expected to contain a relatively small number of entries used repeatedly throughout the document are optimized for the frequent use of compact identifiers. This includes the uri partition and all partitions containing prefix content items.
When a string value is found in a partition optimized for frequent use of compact identifiers, the string value is represented as the value (i+1) encoded as an n-bit Unsigned Integer (7.1.9 n-bit Unsigned Integer), where i is the value of the compact identifier, n is ⌈ log2 (m+1) ⌉ and m is the number of entries in the string table partition at the time of the operation.
When a string value is not found in a partition optimized for frequent use of compact identifiers, the String value is represented as zero (0) encoded as an n-bit Unsigned Integer, followed by the string literal encoded as a String (7.1.10 String). After encoding the String value, it is added to the string table partition and assigned the next available compact identifier m.
The remaining string table partitions are optimized for the frequent use of string literals. This includes all string table partitions containing local-name content items and all string table partitions containing value content items.
When a string value is found in the partitions containing local-name content items, the string value is represented as zero (0) encoded as an Unsigned Integer (see 7.1.6 Unsigned Integer) followed by an the compact identifier of the string value. The compact identifier of the string value is encoded as an n-bit unsigned integer (7.1.9 n-bit Unsigned Integer), where n is ⌈ log2 m ⌉ and m is the number of entries in the string table partition at the time of the operation.
When a string value is not found in the partitions containing local-name content items, its string literal is encoded as a String (see 7.1.10 String) with the length of the string is incremented by one. After encoding the string value, it is added to the string table partition and assigned the next available compact identifier m.
As described above, value content items are assigned to two partitions, a "local" value partition and the global value partition. When a string value is found in the "local" value partition, the string value is represented as zero (0) encoded as an Unsigned Integer (see 7.1.6 Unsigned Integer) followed by the compact identifier of the string value in the "local" value partition. When a string value is found in the global value partition, but not in the "local" value partition, the String value is represented as one (1) encoded as an Unsigned Integer (see 7.1.6 Unsigned Integer) followed by the compact identifier of the String value in the global value partition. The compact identifier is encoded as an n-bit unsigned integer (7.1.9 n-bit Unsigned Integer), where n is ⌈ log2m ⌉ and m is the number of entries in the associated partition at the time of the operation.
When a string value is not found in the global or "local" value partition, its string literal is encoded as a String (see 7.1.10 String) with the length incremented by two. After encoding the string value, it is added to both the associated "local" value string table partition and the global value string table partition.
By default, each typed value in an EXI stream is represented by the associated built-in EXI data type (e.g., seeTable 7-1). However, [Definition:] EXI processors MAY provide the capability to specify different built-in types or user-defined encoder/decoders (CODECS) for representing specific schema types. This capability is called Pluggable CODECS.
EXI processors that support Pluggable CODECS MAY provide external means to define and install user-defined CODECS, of which EXI processors are free to choose implementation dependent mechanisms. EXI processors MAY also provide means for applications or users to specify alternate built-in types or user-defined CODECS for representing specific schema types, the mechanisms of which are again implementation dependent.
When an EXI processor encodes an EXI stream using Pluggable CODECS, it MUST specify in the EXI header each schema type that is not represented using the default built-in type and the alternate built-in type or user-defined CODEC used for each one unless the whole EXI Options part of the header is omitted. An EXI processor that attempts to decode an EXI stream that specifies a user-defined CODEC in the EXI header that it does not recognize MAY report a warning, but this is not an error. However, when an EXI processor encounters a typed value that was encoded by a user-defined CODEC that it does not support, it MUST report an error.
The EXI options header, when it appears in an EXI stream, MUST include a codecMap element for each schema type that is not represented using the default built-in type. The codecMap element includes two child elements. The QName of the first child element identifies the schema type that is not represented using the default built-in type and the QName of the second child element identifies the alternate built-in type or user-defined CODEC used to represent that type. Built-in types are identified by the type identifiers in Table 7-1.
For example, the following codecMap element indicates all values of type xsd:decimal in the EXI stream are represented using the built-in String type, which has the type ID xsd:string:
<codecMap xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:decimal/>
<xsd:string/>
</codecMap>
It is the responsibility of an EXI processor to interface with a particular implementation of built-in types or user-defined CODECs properly. In the example above, an EXI processor may need to provide a string value of the data being processed that is typed as xsd:decimal in order to interface with a built-in String type. In such a case, some EXI processors may have started with a decimal value and such processors may well translate the value into a string before passing the data to the built-in String type while other EXI processors may already have a string value of the data so that it can pass the value directly to the built-in String type without any translation.
As another example, the following codecMap element indicates all values of the used-defined type geo:geometricSurface are represented using the user-defined CODEC geo:geometricInterpolator:
When the value of the compression option is set to true, EXI compression is used to increase compactness utilizing additional computational resources. EXI compression combines knowledge of XML with a widely adopted, standard compression algorithm to achieve higher compression ratios than would be achievable by applying compression to the entire stream. Byte-aligned representations of event codes and content items are more amenable to compression algorithms compared to unaligned representations because most compression algorithms operate on series of bytes to identify redundancies in the octets. Therefore, when EXI compression is used, event codes and content items of EXI events are encoded as aligned bytes in accordance with 6.2 Representing Event Codes and 7. Representing Event Content.
EXI compression splits a sequence of EXI events into a number of contiguous blocks of events. Events that belong to the same block are transformed into lower entropy groups of similar values called channels, which are individually well suited for standard compression algorithms. To reduce compression overhead, smaller channels are combined before compressing them, while larger channels are compressed independently. The criteria EXI compression uses to define and combine channels is intentionally simple to facilitate implementation, reduce processing overhead, and avoid the need to encode channel ordering or grouping information in the format. The figure below presents a schematic view of the steps involved in EXI compression.

Figure 8-1. EXI Compression Overview
In the following sections, 8.1 Blocks defines blocks and explains how EXI events are partitioned into blocks. Section 8.2 Channels defines channels, their organization as well as how a group of channels correlate to its corresponding block of events. Section 8.3 Compressed Streams describes how some channels are combined as needed in preparation for applying compression algorithms on channels.
EXI compression partitions the sequence of EXI events into a sequence of one or more non-overlapping blocks. Each block preceding the final block contains the minimum set of consecutive events that have exactly blockSize Attribute (AT) and Character (CH) values, where blockSize is the block size of the EXI stream (see 5.3 EXI Options). The final block contains no more than blockSize Attribute (AT) and Character (CH) values.
Events inside each block are multiplexed into channels. The first channel of each block is the structure channel described in Section 8.2.1 Structure Channel. The remaining channels in each block are value channels described in Section 8.2.2 Value Channels. The diagram below presents an exemplary view of the transformation in which events within a block are multiplexed into channels in one way and channels are demultiplexed into events in the other way.

Figure 8-2. Multiplexing EXI events into channels
The structure channel of each block defines the overall order and structure of the events in that block. It contains the event codes and associated content for each event in the block, except for Attribute (AT) and Character (CH) values, which are stored in the value channels. In addition, there are two attribute events whose values are stored in the structure channel instead of in value channels, which are xsi:nil and xsi:type attributes that match a schema-informed grammar production. These attribute events are intrinsic to the grammar system thus are essential in processing the structure channel because their values affect the grammar to be used for processing the rest of the elements on which they appear. All event codes and content in the structure stream occur in the same order as they occur in the EXI event sequence.
The values of the Attribute (AT) and Character (CH) events in each block are organized into separate channels based on the qname of the associated attribute or element. Specifically, the value of each Attribute (AT) event is placed in the channel identified by the qname of the Attribute and the value of each Character (CH) event is placed in the channel identified by the qname of its parent Start Element (SE) event. Each block contains exactly one channel for each distinct element or attribute qname that occurs in the block. The values in each channel occur in the order they occur in the EXI event sequence.
The channels in a block are further organized into compressed streams. Smaller channels are combined into the same compressed stream, while others are each compressed separately. Below are the rules applied within the scope of a block used to determine the channels to be combined together, the order of the compressed streams and the order amongst the channels that are combined into the same compressed stream.
If the block contains at most 100 values, the block will contain only 1 compressed stream containing the structure channel followed by all of the value channels. The order of the value channels within the compressed stream is defined by the order in which the first value in each channel occurs in the EXI event sequence.
If the block contains more than 100 values, the first compressed stream contains only the structure channel. The second compressed stream contains all value channels that contain less than 100 values. And the remaining compressed streams each contain only one channel, each having more than 100 values. The order of the value channels within the second compressed stream is defined by the order in which the first value in each channel occurs in the EXI event sequence. Similarly, the order of the compressed streams following the second compressed stream in the block is defined by the order in which the first value of the channel inside each compressed stream occurs in the EXI event sequence.
Each compressed stream in a block is stored using the standard DEFLATE Compressed Data Format defined by RFC 1951 [IETF RFC 1951].
EXI is a knowledge based encoding that uses a set of grammars to determine which events are most likely to occur at any given point in an EXI stream and encodes the most likely alternatives in fewer bits. It does this by mapping the stream of events to a lower entropy set of representative values and encoding those values using a set of simple variable length codes or an EXI compression algorithm.
The result is a very simple, small algorithm that uniformly handles schema-less encoding, schema-informed encoding, schema deviations, and any combination thereof in EXI streams. These variations do not require different algorithms or different parsers, they are simply informed by different combinations of grammars.
The following sections describe the grammars used to inform the EXI encoding.
Note:
The grammar semantics in this specification are written for clarity and generality. They do not prescribe a particular implementation approach.Each grammar production has an event code, which is represented by a sequence of one to three parts separated by periods ("."). Each part is an unsigned integer. The following are examples of grammar productions with event codes as they appear in this specification.
| Productions | Event Codes | |||
|---|---|---|---|---|
| LeftHandSide 1 : | ||||
| Event 1 NonTerminal 1 | 0 | |||
| Event 2 NonTerminal 2 | 1 | |||
| Event 3 NonTerminal 3 | 2.0 | |||
| Event 4 NonTerminal 4 | 2.1 | |||
| Event 5 NonTerminal 5 | 2.2.0 | |||
| Event 6 NonTerminal 6 | 2.2.1 | |||
| LeftHandSide 2 : | ||||
| Event 1 NonTerminal 1 | 0 | |||
| Event 2 NonTerminal 2 | 1.0 | |||
| Event 3 NonTerminal 3 | 1.1 | |||
The number of parts in a given event code is called the event code's length. No two productions with the same non-terminal symbol on the left-hand-side are permitted to have the same event code.
Some non-terminal symbols are used on the right-hand-side in a production without an event prefixed to them. Such non-terminal symbols are macros and they are used to capture some recurring set of productions into symbols so that a symbol can be used in the grammar representation instead of including all the productions the macro represents in place every time it is used.
| ABigProduction 1 : | |||
| Event 1 NonTerminal 1 | 0 | ||
| Event 2 NonTerminal 2 | 1 | ||
| LEFTHANDSIDE 1 (2.0) | 2.0 | ||
| ABigProduction 2 : | |||
| Event 1 NonTerminal 1 | 0 | ||
| LEFTHANDSIDE 1 (1.1) | 1.1 | ||
| Event 2 NonTerminal 2 | 1.2 | ||
Because non-terminal macros are injected into the right-hand-side of more than one production, the event codes of productions with these macro non-terminals on the left-hand-side are not fixed, but will have different event code values depending on the context in which the macro non-terminal appears. This specification calls these variable event codes and uses variables in place of individual event code parts to indicate the event code parts are determined by the context. Below are some examples of variable event codes:
| LEFTHANDSIDE 1 (n.m) : | ||||
| EVENT 1 NONTERMINAL 1 | n.0 | |||
| EVENT 2 NONTERMINAL 2 | n.1 | |||
| EVENT 3 NONTERMINAL 3 | n. m+2 | |||
| EVENT 4 NONTERMINAL 4 | n. m+3 | |||
| EVENT 5 NONTERMINAL 5 | n. m+4.0 | |||
| EVENT 6 NONTERMINAL 6 | n. m+4.1 | |||
Unless otherwise specified, the variable n evaluates to the event code of the production in which the macro non-terminal LEFTHANDSIDE 1 appears on the right-hand-side. Similarly, the expression n. m represents the first two parts of the event code of the production in which the macro non-terminal LEFTHANDSIDE 1 appears on the right-hand-side.
Non-terminal macros are used in this specification for notational convenience only. They are not non-terminals, even though they are used in place of non-terminals. Productions that use non-terminal macros on the right-hand-side need to be expanded by macro substitution before such productions are interpreted. Therefore, ABigProduction 1 and ABigProduction 2 shown in the preceding example are equivalent to the following set of productions derived by expanding the non-terminal macro symbol LEFTHANDSIDE 1 and evaluating the variable event codes.
| ABigProduction 1 : | ||||
| Event 1 NonTerminal 1 | 0 | |||
| Event 2 NonTerminal 2 | 1 | |||
| EVENT 1 NONTERMINAL 1 | 2.0 | |||
| EVENT 2 NONTERMINAL 2 | 2.1 | |||
| EVENT 3 NONTERMINAL 3 | 2.2 | |||
| EVENT 4 NONTERMINAL 4 | 2.3 | |||
| EVENT 5 NONTERMINAL 5 | 2.4.0 | |||
| EVENT 6 NONTERMINAL 6 | 2.4.1 | |||
| ABigProduction 2 : | ||||
| Event 1 NonTerminal 1 | 0 | |||
| EVENT 1 NONTERMINAL 1 | 1.0 | |||
| EVENT 2 NONTERMINAL 2 | 1.1 | |||
| Event 2 NonTerminal 2 | 1.2 | |||
| EVENT 3 NONTERMINAL 3 | 1.3 | |||
| EVENT 4 NONTERMINAL 4 | 1.4 | |||
| EVENT 5 NONTERMINAL 5 | 1.5.0 | |||
| EVENT 6 NONTERMINAL 6 | 1.5.1 | |||
Each production rule in the EXI grammar includes an event code value that approximates the likelihood the associated production rule will be matched over the other productions with the same left-hand-side non-terminal symbol. Ultimately, the event codes determine the value(s) by which each non-terminal symbol will be represented in the EXI stream.
To understand how a given event code approximates the likelihood a given production will matched, it is useful to visualize the event codes for a set of production rules that have the same non-terminal symbol on the left-hand-side as a tree. For example, the following set of productions:
| ElementContent : | ||||
| EE | 0 | |||
| SE (*) ElementContent | 1.0 | |||
| CH ElementContent | 1.1 | |||
| ER ElementContent | 1.2 | |||
| CM ElementContent | 1.3.0 | |||
| PI ElementContent | 1.3.1 | |||
represents a set of information items that might occur as element content after the start tag. Using the production event codes, we can visualize this set of productions as follows:

Figure 9-1. Event code tree for ElementContent grammar
where the non-terminal symbols are represented by the leaf nodes of the tree and the event code of each production rule that contains a non-terminal symbol defines a path from the root of the tree to the node associated with that symbol. We call this the event code tree for a given set of productions.
An event code tree is similar to a Huffman tree [Huffman Coding] in that shorter paths are generally used for symbols that are considered more likely. However, event code trees are far simpler and less costly to compute and maintain. Event code trees are shallow and contain at most three levels. In addition, the length of each event code in the event code tree is assigned statically without analyzing the data. This classification provides some of the benefits of a Huffman tree without the cost.
As discussed in section 6.3 Fidelity Options, applications MAY provide a set of fidelity options to specify the XML features they require. EXI processors MUST use these fidelity options to prune the events that are not required from the grammars, improving compactness and processing efficiency.
For example, the following set of productions represent the set of information items that might occur as element content after the start tag.
| ElementContent : | |||
| EE | 0 | ||
| SE (*) ElementContent | 1.0 | ||
| CH ElementContent | 1.1 | ||
| ER ElementContent | 1.2 | ||
| CM ElementContent | 1.3.0 | ||
| PI ElementContent | 1.3.1 | ||
If an application sets the fidelity options preserve.comments, preserve.pis and preserve.dtd to false, the productions matching comment (CM), processing instruction (PI) and entity reference (ER) events are pruned from the grammar, producing the following set of productions:
| ElementContent : | ||||
| EE | 0 | |||
| SE (*) ElementContent | 1.0 | |||
| CH ElementContent | 1.1 | |||
Removing these productions from the grammar tells EXI processors that comments and processing instructions will never occur in the EXI stream, which reduces the entropy of the stream allowing it to be encoded in fewer bits.
Each time a production is removed from a grammar, the event codes of the other productions with the same non-terminal symbol on the left-hand-side MUST be adjusted to keep them contiguous if its removal has left the remaining productions with non-contiguous event codes.
This section describes the built-in XML grammar used by EXI when no additional information is available to describe the contents of the EXI stream. The built-in XML grammar is used when no schema exists, for elements with unrestricted types (e.g., xsd:anyType) and for schema extensions and deviations that are not declared by the schema.
A built-in XML grammar is self-evolving. The built-in grammar continuously reflects the knowledge being learned while processing an EXI stream onto itself in order to keep refining itself for subsequent use of the grammar within the extent of processing a single stream.
In the absence of additional information about the content of the EXI stream, the following grammar describes the events that will occur in an EXI document.
| Syntax | Event Code | ||
|---|---|---|---|
| Document : | |||
| SD DocContent | 0 | ||
| DocContent : | |||
| SE (*) DocEnd | 0 | ||
| DT DocContent | 1.0 | ||
| CM DocContent | 1.1.0 | ||
| PI DocContent | 1.1.1 | ||
| DocEnd : | |||
| ED | 0 | ||
| CM DocEnd | 1.0 | ||
| PI DocEnd | 1.1 | ||
| Semantics: |
|---|
All productions in the built-in Document grammars of the form LeftHandSide : SE (*) RightHandSide are evaluated as follows:
In the absence of additional information about the contents of an EXI stream, the following grammar describes the events that will occur in an EXI fragment. The grammar shown below represents the initial set of productions that belong to a built-in fragment grammar at the start of a stream processing, which is supplemented by the semantic description that explains the rules used to evolve the built-in fragment grammar to continuously improve it and be better prepared for subsequent uses of the same grammar during the rest of the processing of the stream.
| Syntax | Event Code | ||
|---|---|---|---|
| Fragment : | |||
| SD FragmentContent | 0 | ||
| FragmentContent : | |||
| SE (*) FragmentContent | 0 | ||
| ED | 1 | ||
| CM FragmentContent | 2.0 | ||
| PI FragmentContent | 2.1 | ||
| Semantics: |
|---|
All productions in the built-in Fragment grammars of the form LeftHandSide : SE (*) RightHandSide are evaluated as follows:
All productions of the form LeftHandSide : SE (qname) RightHandSide that were previously added to the grammar upon the first occurrence of the element that has the qualified name qname are evaluated as follows when they are matched:
[Definition:] EXI defines a built-in element grammar that is used in the absence of additional information about the contents of an EXI element prior to its processing. A built-in element grammar shown below is prescibed by EXI to reflect the events that will occur in an element and the order amongst them in general without any further constraint about what is likely or not likely to occur inside elements.
A single instance of built-in element grammar is shared by those elements in a stream that have the same qualified name and do not have additional a priori constraints as to their content. A separate instance of built-in element grammar is assigned to each qualified name upon the first occurrence of the elements of the same qualified name, thereafter the grammar continuously evolves by reflecting the knowledge learned while processing the content of those elements. The grammar shown below represents the initial set of productions that belong to a built-in element grammar at the time when a new instance is created, which is supplemented by the semantic description that explains the rules that are applied by the grammar onto itself to evolve and be better prepared for subsequent uses of the same grammar instance during the rest of the processing of the stream.
| Syntax | Event Code | ||
|---|---|---|---|