Issues Relating to the Creation of a
Binary Interchange Standard for XML

Michael Conner
Noah Mendelsohn
IBM Corporation
August 10, 2003


Introduction

IBM appreciates the opportunity to participate in the W3C Workshop on Binary Interchange of XML Information Item Sets.  This note is our position paper for the workshop.  In addition to this position paper, we have submitted a summary of an experimental technology developed by IBM called Compact Binary XML (CBXML).  We hope that our work on CBXML will contribute to the analysis of requirements and technologies at the workshop.

Characteristics of XML

XML has had an extraordinary impact on the computer industry over the past five years, and we believe that its success results from several specific characteristics of the original XML Recommendation [1]. These include:

·        There is a single interoperable form for XML, so every XML implementation accepts all legal XML documents (presuming either UTF-8 or UTF-16 character encodings are used.)  Customers and users trust this aspect of XML:  they count on XML to work, regardless of which vendors’ tools they choose.

·        XML is text, not binary.  The industry has seen increasingly successful focus on text formats such as HTML, notwithstanding the fact that in almost all cases more efficient binary encodings are possible.  XML continues in and benefits from that tradition.   The use of human-readable text encodings brings several benefits to both HTML and XML, including: applicability of existing text tools, lack of dependence on byte-order, flexibility, readability, approachability for early adopters, ease of debugging, suitability as a basis for other text vocabularies, and perhaps also increased security.

·        XML is applicable to a very broad range of information, and thus supports information integration scenarios that were not previously possible using standards-based techniques.  XML provides a unified representation and schema description framework which has been successfully applied to:  “pure data”, such as inventory data that might otherwise be represented as relational; “pure documents” such as technical reports; “data driven documents” such as insurance policies, which combine structured text with typed data such as numerics and dates; “structured messages”, such as those supported by SOAP[2]; etc., etc.   XML’s support of mixed content, ordered elements, etc. provides the semantic richness necessary to support these varied scenarios.

·        XML is self-describing: data is explicitly tagged.  Such self-description is very important as a basis for creating evolvable vocabularies, and for supporting robust, loosely coupled interaction (I.e. between relatively separate organizations.)  Self-description also facilitates many important data integration tasks.

On the other hand, it’s clear that some of these very characteristics can impact both the size of XML documents and the efficiency of XML processing.  For these reasons, it is appropriate that W3C hold a workshop to evaluate the tradeoffs between maintaining a purely text-based XML standard and introducing one or more optimized binary forms.

Nonetheless, any effort to introduce an alternate representation for XML necessarily undermines to some degree the universal interoperability of XML.  Many of the specific proposals we’ve seen further sacrifice other important characteristics of XML, such as self-description or mixed content.  We thus believe it is appropriate for the XML community to approach any “binary XML” proposals with a healthy reluctance to tinker with the formula that has successfully carried XML so far.

Furthermore, we believe that only some of the claims about XML’s poor performance are well founded.  Some of the widely deployed parsers and validators were designed primarily for correctness and standards-compliance, and only secondarily for performance.  We believe that, for many (but not all) purposes, sufficient performance may be achievable merely by more careful implementation of the current standards, or through profiling of those standards.

What are the Requirements?

The discussion above suggests that any standard for binary XML will necessarily involve significant compromises.  Accordingly, it seems particularly important for the XML community to agree on clearly articulated requirements if such work is to be justified.  We are concerned that there appears to be significant diversity of opinion as to just which usage scenarios are most important, and on what technical priorities are implied.  Among the requirements that have been suggested at one time or another are:

·        Compression to minimize transmission time on slow links (e.g. to cell phones)

·        Asymmetric connection to small devices:  the “client” is presumed to have a slow processor, its server a fast processor.  If data flow is primarily server-to-client, then one can employ schemes in which compression is slow, and decompression fast.

·        High volume transaction processing, including both RPC-based, in which data is modeled in terms of programming language structures and fields, and document-based, in which XML documents such as purchase orders are the fundamental units of exchange

·        Ability to preserve the Infoset [3] information for a transmitted document

·        Ability to preserve the character form of a document (e.g. to preserve single vs. double quotes on attribute values – this can be important to tools that sign the characters, or to systems like CVS or other text based tools.)

·        Ability to efficiently send MIME-typed information (e.g. SOAP+Attachments [4], PASWA [5], MTOM [6])

·        Ability to efficiently encode the XPath/XQuery data model [7], including type information

·        Requirement to send only values (such as integers) as opposed to lexical forms (leading zeros on the integers) for schema-typed data

·        Willingness to presume agreement on schemas, suggesting that markup need be sent only where it is variable (similarly, one can encode enumerated simple types, etc.)

We believe that the first step for the XML community is to decide whether any combination of these requirements is sufficiently compelling to suggest that compromising XML’s text-based compatibility might even be worthwhile.  This workshop should be a good venue for that discussion, but we start with the assumption that there is not yet sufficient consensus on requirements to justify a formal W3C investment in a workgroup. 

We in IBM have anecdotal evidence from customers and others that the highest priority focus areas should be for high-volume XML-based transaction processing (not necessarily RPC), and for compact representations for transmission to small devices.  Some of these requirements may be met by more efficient implementation of the text-based XML standard, but some may not.  We are open-minded as to whether there will eventually be a need to standardize optimized forms to support these usage scenarios.

Desirable characteristics for a binary XML system

Having surveyed some of the requirements that we have heard others discuss, we outline here the considerations that seem most important to us at IBM:

·        The compression and/or speedup must be significant, say 3-10 times over the best possible text-based implementations (and we note again that few of the currently deployed processors are near optimal in speed.)  Any smaller gain would not justify the loss of compatibility resulting from a second format.

·        The format must be capable of representing at least the Infoset information corresponding to a well-formed XML document.  Necessary capabilities thus include efficient support for mixed content, preservation of element order, repetition of elements with the same name, etc. etc. (all of which are capabilities that are sometimes sacrificed by, for example, RPC systems). 

·        The self-describing characteristics of XML should be preserved.  While redundant information may be safely removed or compressed, all markup contained in the Infoset should be conveyed.

·        If there is direct support for datatypes such as integer, then those types must be 100% compatible with the type systems of XML Schema, XPath 2.0, and XQuery. 

·        Efficient support for PASWA/MTOM is desirable, so that that any new binary format will provide an efficient ‘on the wire’ representation for SOAP attachments and other binary SOAP data.

·        Consideration should be given to supporting the XPath/XQuery data model, which is essentially a typed superset of Infoset.  Potential scenarios include efficient transmission of data between databases, encoding of query results, and on-disk storage of the data model.  (The data model is only a consideration, there are also reasons to “keep it simple” and stick to the untyped Infoset.)

·        Both encoding and decoding should be efficient in CPU time.  We expect that small devices will be required to transmit as well as receive significant amounts of information.

·        Efficient binding to programming language constructs is desirable, but does not outweigh the need for full Infoset capability.

CBXML

Accompanying this position paper is a white paper describing CBXML, an experimental system developed within IBM.  We have used CBXML to explore many of the themes this workshop is considering, including optimization of XML serialization and some of the associated tradeoffs.  IBM is not at this time recommending that any particular technology be adopted, but we offer this white paper in the hope that it may facilitate discussion and as a proof point that considerable encoding benefit can be achieved while preserving the all information in the Infoset model.

Summary

IBM believes that wherever possible, implementations of the existing XML 1.x Recommendation should be optimized to meet the needs of customers.  While we expect to see non-standard binary forms used internally within certain vendors’ implementations, including perhaps our own, we are not yet convinced that there is justification to standardize an interchange format other than XML 1.x.  We thus believe that it would be premature for the W3C to launch a formal workgroup, or to recharter an existing group, to develop a Binary XML Recommendation.

Nonetheless, we and others have anecdotal evidence that there are scenarios which are difficult to support efficiently using XML 1.x; if the community can clearly articulate and achieve consensus on a set of requirements that justify the compromises in developing a non-text form of XML, IBM would expect to alter its position and to support such work.  In the meantime, we welcome the ongoing discussion of the tradeoffs, and we offer our experiences with CBXML in the hope that they will facilitate those discussions.


References

[1] http://www.w3.org/TR/1998/REC-xml-19980210
[2] http://www.w3.org/TR/soap12-part1/
[3] http://www.w3.org/TR/xml-infoset/
[4] http://www.w3.org/TR/SOAP-attachments
[5] http://www.gotdotnet.com/team/jeffsch/paswa/paswa61.html
[6] http://www.w3.org/TR/soap12-mtom
[7] http://www.w3.org/TR/xpath-datamodel/