W3C

Efficient XML Interchange Evaluation

W3C Working Draft 28 July 2008

This version:
http://www.w3.org/TR/2008/WD-exi-evaluation-20080728
Latest version:
http://www.w3.org/TR/exi-evaluation
Editor:
Carine Bournez, W3C

Abstract

This Working Draft is an evaluation of the Efficient XML Interchange (EXI) Format 1.0 with reference to the Properties identified by the XML Binary Characterization (XBC) Working Group, relative to XML, gzipped XML and ASN.1 PER. It is conducted using the XBC Measurement methodology. For the "compactness" and "processing efficiency" Properties, the performance is measured with EXI Measurement framework, over the test data collected for the EXI measurements, representing XBC Use Cases.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This First Public Working Draft presents initial evaluation of the EXI Format 1.0 conducted by the EXI Working Group as a per-charter requirement for the publication of the EXI Format 1.0 specification as Last Call Working Draft. . This draft includes results for all properties but "processing efficiency". The formats and alternative technologies to which we wish to compare results include at least ASN.1 PER and gzipped XML.

This document was developed by the Efficient XML Interchange (EXI) Working Group. It is intended to be completed and published as a Working Group Note.

Comments on this document are invited and are to be sent to the public public-exi@w3.org mailing list (public archive). If substantive comments are received, the Working Group may revise this Working Group Note.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.


Table of Contents


1. Objectives

This document presents the anticipated benefits of the EXI format 1.0 compared to XML and gzipped XML. Additionally, tests for compactness include comparison to ASN.1 PER. The points of comparison are the requirements set by the EXI Working Group charter, based on the results of the XML Binary Characterization Working Group.

This summarized evaluation of the EXI format uses the testing framework built during the first phase of the EXI Working Group's work so as to select a baseline candidate technology. Although this evaluation aims at demonstrating EXI benefits in the targeted XBC Use Cases, it can be read as a summary of the EXI measurements Note.

2. Background

The methodology used in the evaluation relies on previous work on measurements. The Properties referred to in this document have been defined by the XBC Working Group. The methodology for measurement is detailed in the XBC measurement methodology document. For convenience, Appendix A gives an overview of the properties definitions, as well as some details of their measurements.

In addition, two Properties require an implementation to be evaluated: Compactness and Processing Efficiency. These Properties have been tested using the EXI measurement framework and the associated methodology.

3. Evaluation Results

At the time of the first publication of this document, the Working Group has not tested conformance of implementations. The methodology and framework designed and implemented on Japex by the Working Group are used for the properties that require implementation testing. The other properties can be asserted by checking the specification only.

3.1. Compactness results

This test has been run over the same set of documents as the EXI Working Group's framework test data. The following graphs show the resulting size as a percentage of the original XML document size, sorted by the EXI result, for the sake of legibility (i.e. "best" results on the left). The implementation of EXI used for the measurements is Efficient XML 3.0. It implements the specification of the EXI format 1.0 at the time of writing.

comparison for compactness of EXI against gzipped XML
comparison for compactness of EXI against ASN.1 PER

For each test case, the testing framework uses the most appropriate application class: Whenever a schema is available, EXI is used in schema-informed mode, and when a document-analysis-based technique leads to a better result, the compression option is turned on.

The first graph compares EXI to Gzipped XML. As shown by the graph, EXI is consistently smaller than gzipped XML regardless of document size, document structure or the availability of schema information. In some cases, EXI is over 10 times smaller than gzip. In addition, EXI works well in cases where gzip has little effect or even makes documents bigger, such as high volume streams of small messages typical of geolocation, financial exchange and sensor applications.

In the second graph, the same EXI numbers are compared to the ASN.1 PER file sizes. Each EXI encoded file is smaller than the equivalent ASN.1 PER, and sometimes 20 times smaller. In addition, EXI works well in cases where ASN.1 PER actually increases the size of the document or fails to produce an encoding at all (e.g., due to schema deviations.)

3.2. Processing efficiency results

This will be addressed during the Candidate Recommendation phase of the development of the format. The next draft of this document will include processing efficiency results.

3.3. Summary

Property

XML (+gzip)

EXI

MUST support

Directly Readable and Writable

No

The XML format itself satisfies this property, but naturally gzip compression applied to a file format requires creating the intermediate form (XML) first.

Yes

Implementations can read and write EXI streams directly via standard XML APIs, such as DOM, SAX and StAX. At least one current implementation also support typed APIs for increased performance.

Transport Independence

Yes

Yes

EXI can be used over TCP, UDP, HTTP and various wireless and satellite transports.

Compactness

No

XML and gzip cannot take advantage of schema information, so this format fails in the Schema and Both classes (It does not achieve compactness typically required by applications that use binary data formats, like ASN.1, CORBA, XDR, etc.)

By definition, it succeeds in the Document class and fails in the Neither class (due to the different requirements, in the Neither class, it would have to be smaller than itself).

Yes

See compactness results.

Human Language Neutral

Yes

Yes

EXI supports all standard character set encodings.

Platform Neutrality

Yes

Yes

The property cannot be fully evaluated at this time. However, the EXI format specification does not make particular assumption about the platform architecture. An implementation already exists for several popular server, desktop and mobile platforms, including Java EE/SE, Microsoft .NET, Java Mobile Edition and .NET Compact Framework.

Integratable into XML Stack

Yes

Yes

EXI was designed to integrate well into the XML stack, neither duplicating nor requiring changes to functionality at other layers in the XML stack. It builds on the XML Infoset data model. It implements the same character encodings as text XML and supports the common interfaces as existing XML parsers and serializers. As such, it can be inserted into existing XML applications with minimal time and cost.

Royalty Free

Yes

Yes

Per the W3C PP.

Fragmentable

Yes

Yes

EXI can represent any collection of XML fragments extracted from any collection of XML documents. All schema optimization, bit-packing and XML compression algorithms apply equally to fragments.

Streamable

Yes

Yes

Roundtrip Support

Yes

The equivalence is exact in both cases.

Yes

EXI supports lossless equivalence for PSVI, Infoset and lexical applications, such as XML Digital Signatures. The EXI "preserve" option can be used when this property is needed.

Generality

No

XML scores 8/20, Gzipped XML 10/20 (see appendix B.)

Yes

EXI scores 19/20 (see appendix B.)

Schema Extensions and Deviations

Yes

Yes

EXI includes schema optimizations that support arbitrary schema extensions and deviations. Applications may specify strict or extensible schema handling and may provide a full schema, partial schema or no schema at all.

Format Version Identifier

Yes

Both XML and gzip include an identifier in the header.

Yes

EXI header includes version.

Content Type Management

Yes

Yes

EXI can be used in various contexts, some which use a media type and some which use content encoding, or both.

Self-Contained

Yes

Yes

When schema optimizations are not used, EXI documents are always self-contained.

MUST NOT Prevent

Processing Efficiency

Prevents

see EXI measurements.

Does Not Prevent

Current implementations achieve performance several times faster than XML using both in-memory tests and more realistic scenarios that involve file and network IO. These implementations do not depend on compile-time schema-binding techniques that make dynamically acquiring, loading or updating schemas impractical or impossible.

Small Footprint

Does Not Prevent

Does Not Prevent

TBD in CR phase: check implementation for a variety of small, mobile devices.

Widespread Adoption

Does Not Prevent

Both XML and gzip have been widely adopted and included in many protocol standards.

Does Not Prevent

Space Efficiency

Prevents

Does Not Prevent

TBD on CR phase: check implementations for small, mobile devices.

Implementation Cost

Does Not Prevent

Does Not Prevent

TBD in CR phase.

Forward Compatibility

Does Not Prevent

Does Not Prevent

4. Discussion

5. References

[EXI Measurements]
EXI Measurements, Greg White, Jaakko Kangasharju, Don Brutzman, Stephen Williams editors, World Wide Web Consortium, 25 July 2007. http://www.w3.org/TR/exi-measurements/.
[XBC Use Cases]
XML Binary Characterization Use Cases, Mike Cokus, Santiago Pericas-Geertsen editors, World Wide Web Consortium, 31 March 2005. http://www.w3.org/TR/xbc-use-cases/.
[XBC Properties]
XML Binary Characterization Properties, Oliver Goldman, Dmitry Lenkov editors, World Wide Web Consortium, 31 March 2005. http://www.w3.org/TR/xbc-properties/.
[XBC Measurements]
XML Binary Characterization Measurement Methodologies, Stephen D. Williams, Peter Haggar editors, World Wide Web Consortium, 31 March 2005. http://www.w3.org/TR/xbc-measurement/.
[XBC Characterization]
XML Binary Characterization, Oliver Goldman, Dmitry Lenkov editors, World Wide Web Consortium, 31 March 2005. http://www.w3.org/TR/xbc-characterization/.
[PER]
Information Technology - ASN.1 Encoding Rules: Specification of Packed Encoding Rules (PER) [The ASN.1 PER Standard (ITU-T Rec X.691 | ISO/IEC 8825-2)], International Telecommunication Union (ITU), July 2002. http://www.itu.int/ITU-T/studygroups/com17/languages/X.691-0207.pdf.
[XML 1.0]
Extensible Markup Language (XML) 1.0, Tim Bray et al editors, World Wide Web Consortium, 4 February 2004 (Third Ed). Latest version http://www.w3.org/TR/REC-xml/.
[Efficient XML]
AgileDelta's Efficient XML 3.0, accessed July 2008.
[JAPEX]
Japex Manual, Santiago Pericas-Geertsen, java.net, April 2006. https://japex.dev.java.net/docs/manual.html.

Appendix A. Properties definitions

A.1. Directly Readable and Writable

A format is directly readable and writable if it can be serialized from an instance of a data model and parsed into an instance of a data model without first being transformed to an intermediate representation. The retained data model for EXI is the XML Infoset.

A.2. Transport Independence

A format is transport independent if the only assumptions of transport service are "error-free and ordered delivery of messages without any arbitrary restrictions on the message length".
However, a protocol binding can specify how a format is transmitted as payload in a specific transport (e.g., TCP/IP) or messaging (e.g., HTTP) protocol.

A.3. Compactness

The Compactness property measurement represents the amount of compression a particular format achieves when encoding data model items. There are three categories of methods to reduce the size of a data object or data model items:

This property is measured in the EXI testing framework in 4 measurement modes: "Neither" optimization (pure tokenization), "Schema" (schema-based compression), "Document" (data analysis), "Both" (data analysis + schema-based compression).

A.4. Human Language Neutral

A format is human language neutral if it is not significantly more optimal for processing when its content is in a given language or set thereof, and does not impose restrictions on the languages or combinations of languages that may be used with it. Historically, it has often been a property of many data and document formats that they only supported certain character encodings. XML do not suffer from similar limitations, and it is expected that EXI will not limit the usage of particular human languages.
In terms of compactness or processing efficiency, it is not possible to ensure the same performance for a language that can be entirely captured using a single byte per character and for one that requires a multi-byte encoding, but an internationalization support equivalent to XML is necessary for a wide adoption.

A.5. Platform Neutrality

Platform neutrality is the property of formats that are not significantly more optimal for processing on some computing platforms or architectures than on others (e.g. endianness, native structures for programming language). Platform neutrality ensures not only that wide adoption is possible, but also makes the format more resilient to the passing of time.
In some cases, options in the format may be used based on the preferred parameters of the systems involved. Thus, the XBC Working Group proposed 3 possible values:

It must also be noted that allowing too many mechanisms (options or parameters) for optimization may in fact prove to be a pessimisation.

A.6. Integratable into XML Stack

Per the EXI Working Group charter, this property must be seen as a strong requirement. The integration of the EXI format in the stack of the existing XML specifications for validation, transformation, querying, APIs, canonicalization, signatures, encryption, etc. is a key to a wide adoption.

A.7. Royalty Free

The EXI format will be unencumbered and royalty-free as ensured by the process W3C. It will lead this technology to a better adoption across the industry. A free format is also more likely to have free, open source code for processing it and free tools for building applications which use it. In addition, per the EXI Working Group charter, the EXI format will be proven to have at least one publicly available implementation before becoming a W3C Recommendation.

A.8. Fragmentable

Fragmentability is the ability to encode instances that do not represent the entirety of a document together with sufficient context for the decoder to process them. In addition to this ability to process fragments in isolation, it covers storing one or more parts of a document instance as immediately extractable fragments, so that they can be pulled out with little or no additional processing cost.

A.9. Streamable

Streamability is the ability to generate correct partial output from partial input. This property is needed in memory-constrained environments where it is important to be able to handle data as it is generated to avoid buffering of data inside the processor. Hence it is also characterized by the amount of buffering that needs to be done in the processors. In particular, required buffer space for encoding or decoding must be constant, no matter what the input document is or how it is mapped to the data model. This requirement precludes some serialization techniques (e.g. Gzip compression over the entire XML document).

A particular attention must be paid to the need for lookahead in the format parser, since it is not always available. For some types of sequences it can be beneficial to have the length of the full serialized form of the sequence that precede the actual sequence, so the serializer must buffer the whole sequence before outputting anything. If such sequences can be arbitrarily long, this sacrifices output streamability.

A.10. Roundtrip Support

A format supports roundtripping if converting a file from XML to that format and back produces an output equivalent to the original input. A format supports roundtripping via XML if converting a file from that format to XML and back produces an output equivalent to the original input.

This property is measured by comparing the data which can be represented in XML with those that can be represented in the EXI format:

A.11. Generality

A format has the property of generality if it is competitive with alternatives across a diverse range of XML documents, applications and use cases. The EXI testing framework covers the XBC use cases. The goal of this set of test cases is to include a range of different document sizes, different uses of schemas and various XML features (comments, whitespaces, etc.)

The measurement of this property is defined by the XBC Working Group as a score over 20 items, 1 point per item:

A.12. Schema Extensions and Deviations

A format supports schema extensions and deviations if it allows applications to encode XML Infosets that are not conformant to the schema or not defined in the schema associated with the document.

A.13. Format Version Identifier

This property refers to the ability to efficiently determine the version of a format from a document instance. It is desirable to access this information as early as possible, so a format that does not make this information available when the processing starts should be considered inefficient as far as this property is concerned.

A.14. Content Type Management

This property refers to the definition in the format of one or more media types and/or encodings to be used when transferring documents. It is required for content negotiation, hence its importance for the Web.

The XBC Working Group proposed four degrees of support:

Note: the EXI format specification does not define a dedicated content-type. The original XML document content-type should be used, along with a content coding information specifying that EXI has been used to encode the Infoset.

A.15. Self Contained

An XML format is self-contained if the only information that is required to reproduce the data model instance is (i) the representation of the data model instance and (ii) the specification of the XML format. When no external information is known by the receiver, the document needs to be self-contained.

A.16. Processing Efficiency

This property refers to the speed at which a new format can be generated and/or consumed for processing. It covers serialization, parsing and data binding. The XBC Working Group proposed the following criteria for its measurement:

  1. Parsing into a DOM - The time it takes to parse into a DOM memory structure.
  2. Parsing to SAX - The time it takes to parse to SAX events (push or pull).
  3. Parsing to a new proposed interface (optional) - The time it takes to parse into a DOM-like memory structure proposed for binary XML as an improvement of DOM.
  4. Query processing - The time it takes to process standard queries.
  5. Update (creation, insertion, deletion) - The time it takes to modify an instance in a predetermined pattern of operations.
  6. Retrieval - The time it takes to retrieve information from an instance.
  7. XPath streaming - The time it takes to find a series of xpaths and associated data in a stream of data.
  8. Serialization - The time it takes to generate the alternate format from a memory structure including DOM, SAX-related, and an optional proposed interface.
  9. Lifecycle - Using the best available method, create an instance with data, interpret instance to get partial data, and modify or create new instance with some changes. Memory of the instance at each write/read point must not be reused at the next step.

Each measurement should be recorded as a percentage faster than a standard text-based alternative for each type of operation.

The EXI testing framework implements parsing from SAX and serialization through SAX. Alternative APIs can be used.

A.17. Small Footprint

This property refers to the size of a processor implementing a new format with respect to that of a processor implementing XML. Ideally, the evaluation of this property would be done through a range of implementations in different languages on various platforms. Since a complete evaluation could not be easily achieved during the development of the format itself, due to the small number of implementations at this time, an alternate solution consists in considering the number and/or complexity of the mandatory features (which impacts the size of the code segment) and the amount of data that must be available to a processor in order to support the format (which impacts the size of the initialized data segment).

A.18. Widespread Adoption

A format is more ubiquitous to the extent it has been implemented on a greater range and number of devices and used in a wider variety of applications. There is a tradeoff between the format implementation cost and complexity and the adequation to the applications' needs.

A.19. Implementation Cost

A requirement on XML was "It shall be easy to write programs which process XML documents." A rough estimate of implementation cost can be made by considering how much time does it take for a solitary programmer to implement sufficiently robust processing of the format (the so-called Desperate Perl Hacker measure).

The possibility to reuse common APIs (including DOM, SAX, StAX) lowers the cost of implementation of an alternate encoding of the XML Infoset. This property benefits from the Integration in the XML Stack property.

A.20. Space Efficiency

This property refers to the memory requirements of a processor implementing EXI with respect to that of a processor implementing XML. The measurement is a percentage of the dynamic memory costs for equivalent XML processing.

The measurement for this property is by inspection of format specification, logical analysis, and empirical testing on test scenarios. The EXI testing framework measures the heap size in each test case.

A.21. Forward Compatibility

A format must support the evolution of data models and must allow corresponding implementation of layered standards. Format version and extension points are related to this property. Evolution of XML and its data models could mean additional character encodings, additional element/attribute/body structure, or new predefined behavior similar to ID attributes. Integration of the EXI format into the XML Stack is also related to this property.

Appendix B. Generality evaluation

Criteria XML XML+gzip EXI
Can represent documents without a schema 1 1 1
Can represent documents that include elements and attributes not defined in the associated schema (i.e., open content) 1 1 1
Can represent any schema-invalid document 1 1 1
Can leverage available schema information to improve compactness, processing speed, and resource utilization 0 0 1
Can leverage available schema information to improve compactness, processing speed, and resource utilization even when documents contain elements and attributes not defined in the schema 0 0 1
Can leverage available schema information to improve compactness, processing speed, and resource utilization for any schema-invalid document 0 0 1
Can leverage document analysis to improve compactness 0 1 1
Can suppress document analysis to increase speed and reduce resource utilization 1 0 1
[optional] Can adjust document analysis to meet application performance and resource utilization criteria 0 1 1
Can structure the binary XML stream to increase net compactness when off-the-shelf compression software is built in to the communications infrastructure 0 0 1
[optional] Supports high fidelity XML representations that preserve an exact copy of the original XML document, including all whitespace and formatting 1 1 0
Supports reduced fidelity XML representations that preserve all data model items, but discard whitespace and formatting to improve compactness 1 1 1
Supports reduced fidelity XML representations that preserve all information needed by a particular application, but discard specified information items that are not needed (e.g., comments and processing instructions) to improve compactness 1 1 1
Supports reduced fidelity XML representations that preserve the logical structures and values of an XML document, but discard lexical and syntactic constructs to improve compactness 1 1 1
Can consistently produce XML representations that are close to the same size or smaller than XML documents compressed using gzip 0 1 1
Can consistently produce more compact XML representations than XML documents compressed using gzip 0 0 1
Can consistently produce more compact XML representations than binary XML documents created with document analysis suppressed, then compressed using gzip 0 0 1
Can consistently produce XML representations that are close to the same size or smaller than the equivalent ASN.1 PER encoding plus 20% 0 0 1
Can consistently produce XML representations that are more compact than the equivalent ASN.1 PER encoding plus 20% 0 0 1
[optional] Can consistently produce XML representations that are more compact than the equivalent ASN.1 PER encoding plus 20% compressed using gzip 0 0 1
8 10 19