Issues Relating to the
Creation of a
Binary Interchange Standard for XML
Michael Conner
Noah Mendelsohn
IBM Corporation
August 10, 2003
Introduction
IBM appreciates the opportunity to participate in the W3C
Workshop on Binary Interchange of XML Information Item Sets. This note is our position paper for the
workshop. In addition to this position
paper, we have submitted a summary of an experimental technology developed by
IBM called Compact Binary XML (CBXML).
We hope that our work on CBXML will contribute to the analysis of
requirements and technologies at the workshop.
Characteristics of XML
XML has had an extraordinary impact on the computer industry
over the past five years, and we believe that its success results from several
specific characteristics of the original XML Recommendation [1]. These include:
·
There is a single interoperable form for XML, so every
XML implementation accepts all legal XML documents (presuming either UTF-8 or
UTF-16 character encodings are used.)
Customers and users trust this aspect of XML: they count on XML to work, regardless of which vendors’ tools
they choose.
·
XML is text, not binary. The industry has seen increasingly successful focus on text
formats such as HTML, notwithstanding the fact that in almost all cases more
efficient binary encodings are possible.
XML continues in and benefits from that tradition. The use of human-readable text encodings brings
several benefits to both HTML and XML, including: applicability of existing
text tools, lack of dependence on byte-order, flexibility, readability,
approachability for early adopters, ease of debugging, suitability as a basis
for other text vocabularies, and perhaps also increased security.
·
XML is applicable to a very broad range of information,
and thus supports information integration scenarios that were not previously
possible using standards-based techniques.
XML provides a unified representation and schema description framework
which has been successfully applied to:
“pure data”, such as inventory data that might otherwise be represented
as relational; “pure documents” such as technical reports; “data driven
documents” such as insurance policies, which combine structured text with typed
data such as numerics and dates; “structured messages”, such as those supported
by SOAP[2]; etc., etc. XML’s support
of mixed content, ordered elements, etc. provides the semantic richness
necessary to support these varied scenarios.
·
XML is self-describing: data is explicitly tagged. Such self-description is very important as a
basis for creating evolvable vocabularies, and for supporting robust, loosely
coupled interaction (I.e. between relatively separate organizations.) Self-description also facilitates many
important data integration tasks.
On the other hand, it’s clear that some of these very
characteristics can impact both the size of XML documents and the efficiency of
XML processing. For these reasons, it
is appropriate that W3C hold a workshop to evaluate the tradeoffs between maintaining
a purely text-based XML standard and introducing one or more optimized binary
forms.
Nonetheless, any effort to introduce an alternate
representation for XML necessarily undermines to some degree the universal
interoperability of XML. Many of the
specific proposals we’ve seen further sacrifice other important characteristics
of XML, such as self-description or mixed content. We thus believe it is appropriate for the XML community to
approach any “binary XML” proposals with a healthy reluctance to tinker with
the formula that has successfully carried XML so far.
Furthermore, we believe that only some of the claims about
XML’s poor performance are well founded.
Some of the widely deployed parsers and validators were designed
primarily for correctness and standards-compliance, and only secondarily for
performance. We believe that, for many
(but not all) purposes, sufficient performance may be achievable merely by more
careful implementation of the current standards, or through profiling of those
standards.
What are the Requirements?
The discussion above suggests that any standard for binary
XML will necessarily involve significant compromises. Accordingly, it seems particularly important for the XML
community to agree on clearly articulated requirements if such work is to be
justified. We are concerned that there
appears to be significant diversity of opinion as to just which usage scenarios
are most important, and on what technical priorities are implied. Among the requirements that have been suggested
at one time or another are:
·
Compression to minimize transmission time on slow links
(e.g. to cell phones)
·
Asymmetric connection to small devices: the “client” is presumed to have a slow
processor, its server a fast processor.
If data flow is primarily server-to-client, then one can employ schemes
in which compression is slow, and decompression fast.
·
High volume transaction processing, including both
RPC-based, in which data is modeled in terms of programming language structures
and fields, and document-based, in which XML documents such as purchase orders
are the fundamental units of exchange
·
Ability to preserve the Infoset [3] information for a
transmitted document
·
Ability to preserve the character form of a document
(e.g. to preserve single vs. double quotes on attribute values – this can be
important to tools that sign the characters, or to systems like CVS or other
text based tools.)
·
Ability to efficiently send MIME-typed information
(e.g. SOAP+Attachments [4], PASWA [5], MTOM [6])
·
Ability to efficiently encode the XPath/XQuery data
model [7], including type information
·
Requirement to send only values (such as integers) as
opposed to lexical forms (leading zeros on the integers) for schema-typed data
·
Willingness to presume agreement on schemas, suggesting
that markup need be sent only where it is variable (similarly, one can encode
enumerated simple types, etc.)
We believe that the first step for the XML community is
to decide whether any combination of these requirements is sufficiently
compelling to suggest that compromising XML’s text-based compatibility might
even be worthwhile. This workshop
should be a good venue for that discussion, but we start with the assumption
that there is not yet sufficient consensus on requirements to justify a formal
W3C investment in a workgroup.
We in IBM have anecdotal evidence from customers and others
that the highest priority focus areas should be for high-volume XML-based
transaction processing (not necessarily RPC), and for compact representations
for transmission to small devices. Some
of these requirements may be met by more efficient implementation of the
text-based XML standard, but some may not.
We are open-minded as to whether there will eventually be a need to
standardize optimized forms to support these usage scenarios.
Desirable characteristics for a binary XML system
Having surveyed some of the requirements that we have heard
others discuss, we outline here the considerations that seem most important to
us at IBM:
·
The compression and/or speedup must be significant, say
3-10 times over the best possible text-based implementations (and we note again
that few of the currently deployed processors are near optimal in speed.) Any smaller gain would not justify the loss
of compatibility resulting from a second format.
·
The format must be capable of representing at least the
Infoset information corresponding to a well-formed XML document. Necessary capabilities thus include
efficient support for mixed content, preservation of element order, repetition
of elements with the same name, etc. etc. (all of which are capabilities that
are sometimes sacrificed by, for example, RPC systems).
·
The self-describing characteristics of XML should be
preserved. While redundant information
may be safely removed or compressed, all markup contained in the Infoset should
be conveyed.
·
If there is direct support for datatypes such as
integer, then those types must be 100% compatible with the type systems of XML
Schema, XPath 2.0, and XQuery.
·
Efficient support for PASWA/MTOM is desirable, so that
that any new binary format will provide an efficient ‘on the wire’
representation for SOAP attachments and other binary SOAP data.
·
Consideration should be given to supporting the XPath/XQuery
data model, which is essentially a typed superset of Infoset. Potential scenarios include efficient
transmission of data between databases, encoding of query results, and on-disk
storage of the data model. (The data
model is only a consideration, there are also reasons to “keep it simple” and
stick to the untyped Infoset.)
·
Both encoding and decoding should be efficient in CPU
time. We expect that small devices will
be required to transmit as well as receive significant amounts of information.
·
Efficient binding to programming language constructs is
desirable, but does not outweigh the need for full Infoset capability.
CBXML
Accompanying this position paper is a white paper describing
CBXML, an experimental system developed within IBM. We have used CBXML to explore many of the themes this workshop is
considering, including optimization of XML serialization and some of the
associated tradeoffs. IBM is not at
this time recommending that any particular technology be adopted, but we offer
this white paper in the hope that it may facilitate discussion and as a proof
point that considerable encoding benefit can be achieved while preserving the
all information in the Infoset model.
Summary
IBM believes that wherever possible, implementations of the
existing XML 1.x Recommendation should be optimized to meet the needs of
customers. While we expect to see
non-standard binary forms used internally within certain vendors’ implementations,
including perhaps our own, we are not yet convinced that there is justification
to standardize an interchange format other than XML 1.x. We thus believe that it would be premature
for the W3C to launch a formal workgroup, or to recharter an existing group, to
develop a Binary XML Recommendation.
Nonetheless, we and others have anecdotal evidence that
there are scenarios which are difficult to support efficiently using XML 1.x;
if the community can clearly articulate and achieve consensus on a set of
requirements that justify the compromises in developing a non-text form of XML,
IBM would expect to alter its position and to support such work. In the meantime, we welcome the ongoing
discussion of the tradeoffs, and we offer our experiences with CBXML in the
hope that they will facilitate those discussions.
References
[1] http://www.w3.org/TR/1998/REC-xml-19980210
[2] http://www.w3.org/TR/soap12-part1/
[3] http://www.w3.org/TR/xml-infoset/
[4] http://www.w3.org/TR/SOAP-attachments
[5] http://www.gotdotnet.com/team/jeffsch/paswa/paswa61.html
[6] http://www.w3.org/TR/soap12-mtom
[7] http://www.w3.org/TR/xpath-datamodel/