Shankar Pal, Jonathan Marsh, Andrew
Layman
Microsoft Corporation,
{shankarp, jmarsh,
andrewl}@microsoft.com
XML
1.0 [1] uses a text representation and has been very
successful as a portable, platform-independent format for data exchange. The format, like most text formats such as
SGML [2] and HTML [3], frequently uses many more bits to encode an element
of information than the theoretical minimum. This fact has led to the idea that
XML could benefit from having an alternate encoding format, a “binary XML,”
that takes fewer bits for the same data and thereby saves transmission
bandwidth or other costs such as parsing time.
The
advantages of the single text form are well-known: It is a ubiquitous format
that all data can be rendered into and that all XML parsers can process,
thereby enabling universal portability of the
data representation. The singularity of the format precludes splintering
into multiple dialects with eventually differing meanings. It is human readable
without the need for special viewers and editors. A wide range of
text-processing tools are available for its manipulation, again without
requiring any special, additional filters or converters.
Against
this we need to weigh the potential for reducing the size or other
performance-related measures of the encoding.
This
paper presents some evidence that some benefits (such as size) claimed for a
binary XML can often be achieved better by other means. More
fundamentally, it argues that attempting to create a standard “binary” form of
XML will lead to either great complexity or a plurality of new forms of XML,
inconsistent with what we understand the goal of XML to be, namely a single,
simple, ubiquitous form for portable exchange of data.
In this section, we consider factors that need to be addressed for
a binary XML standard.
There
are no known standards addressing performance requirements only. This raises a
fundamental, philosophical question whether binary representation of XML is an
appropriate candidate for standardization. Innovation should be left to
vendors, and the convergence of their solutions should lead to a de facto standard. If this does not
happen, then imposing a standard upon vendors and users cannot solve the
problem either.
A
new, standardized binary XML adds complexity to the XML space. Specifically, it
increases the barrier to entry for working with XML since vendors and users
will have to support two different formats (text and binary) to perform the
same task. Vendors and users already have to deal with an increasing complexity
of the XML family of technologies. W3C should find ways to simplify the XML
landscape and not to impose specialized solutions on the XML community. It is
conceivable that an innovative idea can produce a more performant binary
representation than the standard allows, and the standard will only stifle its
adoption.
The
binary representation must preserve, or be isomorphic to, the Infoset [4] content of the XML value[1]
since this is important for portability. It should not require XML Schema [5] or DTD association for generating the binary
encoding. A binary DOM format is not attractive since it does not meet the goal
of being isomorphic to Infoset.
When
a schema exists for the data, it should be used for optimizations. For obvious
reasons, our preference is for XML Schema based approach (rather than DTD). The
post-schema validation Infoset (PSVI) [5], which adds type information to the Infoset, should
be encoded in the binary representation. This can improve parsing speed
significantly.
One downside of the Infoset-based encoding is that other specifications may change the Infoset (as XML Schema did) and this information would also have to be transmitted to the binary format, resulting in continual maintenance of the standard.
One and possibly more criteria could be used as a
guiding principle for the binary representation. Some of them can be satisfied
individually (e.g. low memory footprint or low parse time) but it is impossible
to optimize them all at the same time. The global optimum is virtually impossible to find. The following are the
most interesting criteria to pursue.
A
binary representation generally has a smaller memory footprint than text XML. Consequently,
it is useful for communicating with small devices such as PDA, cell phones,
etc. It benefits from low bandwidth connections as well. On a bigger machine,
such as a server, the emphasis shifts to better usage of bandwidth. A server
can exchange more information with clients using the same number of bytes.
Text XML can be compressed using techniques such as
gzip [6] or XMill [7] before transmission on the wire. These can achieve
very good compression. The binary representation must be decompressed into text
XML on the receiving side before being consumed, so that two passes of the data
are required for parsing; this conflicts with low parse time optimization.
These formats are suitable when a high compression ratio is required (e.g. low
bandwidth connection) and generation and parsing costs are less of a concern
(e.g. the data is archived and not often queried). On a handheld device, it
becomes a tradeoff between smaller memory footprint and higher parsing cost,
which consumes more power.
These compression techniques do not allow streaming
of XML data. That is, the whole XML must be compressed and decompressed, which
makes the overall process costly and infeasible in some scenarios (e.g. small
devices). Gzip and XMill allow chunking, which mitigates the issue to a large
extent. Streaming is, however, useful for scalability of a data server. With
small amounts data, it may be feasible to store the binary representation and
scan it multiple times, even for small devices. However, this approach is
infeasible if the data size is large when single-pass parsing is desired (e.g.
display data). The primary benefit of streaming is the lower memory requirement
for parsing/generation of XML, which is a different requirement than the small
footprint of the binary representation.
Often, the gain from hardware-based network
compression (e.g. MNP-5) can be significant and works very well, which dilutes
the need for binary XML representation.
If the binary representation must be stored in files
or inside a database server, then low memory footprint may be acceptable. This
reasoning also applies to data caching and messaging applications. Low memory
footprint is desirable when storage and retrieval are the predominant
operations. The increased processing cost may be more than offset by the saving
in the I/O cost. The thresholds are unpredictable, so that beyond the low
bandwidth situations, low memory footprint solution is probably not practical.
An
important measure is the parsing cost of the binary form compared to the text
representation. It can be up to one order of magnitude faster or even better,
which saves a great deal of power on small devices. Binary XML parsers can be
more complex than their text counterparts depending upon the complexity of the
encoding, although they can be designed to be as simple as text XML parsers
with a different set of goals in mind. Similarly, the generation of the binary
representation (from, say, tabular data) can be costly depending upon the
encoding, or take roughly the same time as text XML depending upon the factors
being optimized. Parsing
and generation are strongly correlated; when one cost goes up, so does the
other.
Lower parsing/generation cost requires a fairly
simple binary representation that includes enough information to aid efficient
parsing. One such representation is obtained by creating a map from element and
attribute names to numbers. This yields
pretty good compression when element and attribute names are long and occur
many times in the XML value. What’s left
is the data, whose binary values are encoded in the binary stream if the schema
is known. Furthermore, there is no need of entity resolution or white space normalization, which contribute to faster
parsing speed.
It is not hard to imagine that a binary parser can be simpler then a text one.
Low parse/generation cost optimization may not yield much compaction of the
binary representation, and conflicts with optimizations for small footprint.
Random access during forward-only parsing can yield
significant speedup in some scenarios (e.g. execution of XPath expressions).
Additional structures must be encoded, which increases generation time and
slows down parsing of the whole XML value. True random access (i.e. not
forward-only parsing) causes increase in size of XML and punishes modifications
of larger XML. Thus, there are tricky tradeoffs to be made about how much to
speed up random access at the cost of slowing down parse/generation of the
whole XML value, since this optimization is largely determined by the workload.
Parsing/generation speed is a suitable criterion
from a server’s viewpoint. A web or database server sends data out in chunks
and fast generation speed leads to greater scalability. Buffering the entire
data might be OK for small amounts of data, but for large data transfers, the
server’s scalability would be poor.
Many client-side applications prefer to have faster
parsing speed especially for visual rendering. Low memory footprint might be
desirable if the data is cached at the client and may not be used (e.g. the
user does not proceed beyond the first result of a search query). Thus, the
optimization criterion heavily depends upon the application, and there is no
one solution to make everyone happy.
It is possible to use ingenuous techniques to
extract the last drop of compression from the binary representation. Typically,
the greater the compression the more the parsing time is. Beyond a certain
point, the parsing/generation cost outweighs the benefits. This point is not
clearly demarcated, and it is very hard to strike the right balance between
these conflicting requirements.
If
the sender and the receiver each have the XML schema used in the encoding, and
the XML schema is strict (no open content), further optimizations are possible.
Consider the case in which tabular data is converted into XML — only the data
needs to be encoded. This can yield very good compression ratios. Furthermore,
the parser will be a validating one, so that additional functionality gets
built in.
The
benefits are large for large amounts of data, and applications can build in
data-only compression. For example, this approach is suitable for web services
in which a WSDL [8] is the point of agreement — the encoded data and the
WSDL together are adequate to decode the data. The WAP binary XML protocol [9] falls into this class (although it is DTD based).
Individual vendors can also provide such solutions. The encoding by itself is
no longer self-describing. Thus, this approach is at odds with portability of
data and works on a single “platform”.
The
schema-driven approach is suitable for inter-process and inter-process data
exchange in which the goal is extensibility of component architecture rather
than portability of data. By changing the schemas, a different behavior is
obtained from the system. This, however, is an internal need of a specific
application and platform, and outside the purview of a standardization process.
Different
applications benefit from different optimizations for the binary
representation. A server wants faster generation speed, mid-tier server
emphases portability of data, and a client desires small memory footprint over
slow connections. Trying to combine all these requirements and optimizations
into a single, standardized format may eliminate the performance benefits expected
from binary XML.
Multiple
binary representations could be allowed within the standard. This can be done
in a similar way as the “encoding” specification in XML 1.0. There needs to be
a standard set of encodings allowed in binary representations, each optimizing
one or more facets; the format has to handle all encodings of XML for the I18N
item. Each encoding is designed to benefit an important and widely used
class of applications. Within one encoding, a different encoding for a subpart
of the content should be avoided. Thus, at database servers, the encoding that
optimizes parse/generation speed might be more prevalent than small footprint
used by clients.
The
net result will be that the sender gets to choose the format to generate while
the receiver must be able to decode multiple representations. Each side must be
prepared to receive and process all binary encodings in addition to XML 1.0,
thereby adding to the complexity of the software.
Big and little endian encoding are conflicting
requirements. There cannot be an agreement of one over the other, and the
standard will have to allow both.
On a compatible platform, both sender and receiver
can use the same encoding, and both derive performance benefits. However, in a
cross platform scenario, the performance benefits at the receiver might vanish.
Text XML runs into a similar issue with Unicode strings, which can be handled
without much trouble. So the big/little encoding may not be a big issue.
For
different classes of applications, the criterion (minimize footprint or
minimize parse/generate time) for the binary representation is different and
often conflicting. There is no single criterion that optimizes all
applications. Consequently, a binary standard could result in a suite of
allowable representations that clients and servers must be prepared to receive
and process. This is a retrograde step from the portability goals of XML 1.0.
Furthermore, the optimal binary representation depends on the machine and OS architectures
on each end — translating between binary representations negates much of the
advantages that binary XML has over text.
We
need viable ideas to simplify the picture instead of muddy the waters by introducing
more guidelines. We second the technical concerns voiced by Chris Lilley [10]. Furthermore, the standard’s work can go on for years and the
ensuing standard can become a burden on vendors and stifle innovation. It is quite likely that factors other than those
addressed in this paper are equally important for some classes of applications.
Until these are clearly understood, continuing to innovate is prudent to find
solutions that benefit the community at large.
We generally assume an all-or-nothing model in which
either the entire XML value is converted into a binary representation or it is
not converted at all. There may be intermediate solutions. For example, an interleaved text and binary format preserving Infoset content — in
which large blobs of data (e.g. pictures) are sent as binary attachments — is a
more promising direction to follow. Its main strength remains portability, and it can improve parsing speed sufficiently to
satisfy a large class of applications.
1.
“Extensible
Markup Language (XML) 1.0 (Second Edition)”, October 2000. URL: http://www.w3.org/TR/REC-xml.
2.
“Information
processing -- Text and office systems -- Standard Generalized Markup Language
(SGML)”, 1986. URL: http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=16387.
3.
“HTML 4.01
Specification”, December 1999. URL: http://www.w3.org/TR/html401.
1.
“XML Information
Set”, October 2001. URL: http://www.w3.org/TR/xml-infoset.
2.
“XML Schema”, May 2001. URL: http://www.w3.org/TR/xmlschema-0 {-1, -2}.
3.
gzip compression. URL: http://www.gzip.org/.
4.
“XMill: An Efficient Compressor for XML”, Hartmut Liefke and
Dan Suciu, in Proceedings of SIGMOD, pp. 153-164, 2000. URL: http://www.research.att.com/sw/tools/xmill.
5.
“Web Services Description Language (WSDL) 1.1”, March
2001. URL: http://www.w3.org/TR/wsdl.
6.
“WAP
Binary XML Content Format”, June 1999. URL: http://www.w3.org/TR/wbxml.
7.
“Binary XML problem statement”, Chris Lilley, Feb 2003. URL: http://lists.w3.org/Archives/Public/www-tag/2003Feb/0224.