Binary XML

Position paper for The W3C Workshop on Binary Interchange
of XML Information Item Sets

Oracle Corporation

Dmitry Lenkov, editor

Contributors: Alex Yiu, Anish Karmarkar, Anjana Manian, Chandra Sivasankaran, Dmitry Lenkov, Eric Sedlar, Julie Basu, K. Karun, Mark Scardina, Ravi Murthy

1 Introduction

As the World Wide Web transitions from just being a medium for browsing to a medium for commerce, web services, and application integration, XML (eXtensible Markup Language) has emerged as the standard language for markup. Multiple applications over the Internet are increasingly adopting XML as the standard for expressing messages, schema, and data. Consequently, XML is being increasingly used for Web based applications as an exchange wire format. Several problems arise as a result. First of all, with the rapidly increasing volume of XML data being exchanged for information purposes and for conducting business, the bandwidth of networks and other communication channels is being tested to its limit. Traditional data compression algorithms can achieve good compression rates on large text files but are less effective with small sized files like the ones that may be typical in some XML messaging applications. Also, traditional compression algorithms do not assume any knowledge of the document syntactic or semantic structure. In the case of XML documents, such knowledge provides additional opportunities for the compression.

In another area of XML applications, XML documents are stored and saved, then searched and retrieved. Besides the size and time efficiency in compressing and decompressing the whole document, the preservation of the document structural information becomes really important. It allows applications to do efficient searches and retrieve parts of documents rather than whole documents. Traditional compression systems do not retain this structural information of the documents.

In the related area of XML applications, XML documents are passed from application to application while being manipulated in each application separately. This manipulation needs to be efficient. Typically an application that receives an XML document as an input, manipulates the XML data either by using DOM APIs to access an in-memory tree representation of the XML document produced be the parser, or by using SAX events to build its own representation of the document or its parts based on the parsing events passed by the parser to the application. When DOM is used, the structural information is preserved in the tree. In the case of SAX events, it has to be gathered by the application and preserved in a proprietary way. Current DOM representations of XML documents are, in most cases, quite expensive size wise. In addition, as a result of the large size, manipulations that require copying and moving subtrees of a DOM tree are also expensive performance wise. Processing XML documents based on SAX events results in proprietary code and data structures that cannot be easily exchanged among the applications. It is also error prone.

Thus there is a need for XML compression and streaming systems that efficiently work in the Internet or direct communication context for the application domains described above.

2 Goals and Requirements

In order to address the needs presented in the introduction, the appropriate use scenarios based on these needs have to be collected together with the performance characteristics that binary XML processing and streaming technologies should satisfy. Under use scenarios we understand exemplars of practical utilization of XML technologies where performance of XML processing and streaming techniques is important. Given use scenarios, a set of requirements is presented. The majority of the requirements emphasize how binary XML format should improve performance of XML processing and streaming.

The performance of binary XML streaming includes the following elements: the compression ratio, speed of compression, speed of decompression, document manipulation efficiency, efficiency of incremental processing, efficiency of querying documents stored in a persistent store, etc.

Another characteristic that affects all elements of the performance when binary streaming is applied to an XML document is whether an XML schema validating this document was available in an advance. This paper assumes that we are dealing primarily with schema-aware documents.

2.1 Use scenarios

In this paper, the following use scenarios are taken in consideration.

2.1.1 Scenario 1: Speedy transmission

XML documents are passed in an inter-application communication. Such documents are typically created and sent once, and received once. This scenario is typical for the enterprise web services integration platforms that include inter-application and inter-database communications. This type of communication calls for equally important fast compression and decompression, and significant ratio of compression. The total transmission speed characterizes performance. Any streaming technique offered for this scenario needs to be compared against transmission of XML documents without any compression.

2.1.2 Scenario 2: Fast distribution

This scenario is similar to scenario 1. However, there are two differences. In this scenario the compression speed is not very important. So, only the total speed of transmission and decompression counts. Also, in this scenario documents can be larger than in scenario 1. Another important characteristic of this scenario is that documents are always decompressed into XML textual format, not in any binary representation like DOM, and, sometimes, converted into HTML or other textual formats.

2.1.3 Scenario 3: Maximum compression

Compression ratio is very important when large documents are transferred over Internet and stored in a persistent storage for long time. In this case, storage capacity is a constraining factor, and thus compression ratio is the major concern, while the speed of compression and decompression aspect is secondary. There are two kinds of documents that are typical for this case. Documents of the first kind have significant amount of data and rather small number of tags, for example, books. Documents of the other kind have large number of tags and large amount of data, for example, business reports and periodic transaction summaries.

2.1.4 Scenario 4: Pipeline

XML documents are passed from application to application in a pipeline-like architecture. Each application, in such a scenario, performs certain processing of documents. In this case, the efficiency of operations on the document representation and the size of the document representation are most important. At the same time, the compression and decompression speed is not always important. This scenario often leads to parsing XML documents or serializing their DOM representation multiple times. This affects the overall performance of the XML application, regardless how good the compression speed is. So, it is important to minimize the number of parses or serializations. It is no less important that the pipeline components of the application perform operations on the document passed through. Thus, the efficiency of these operations is an equally important factor in choosing a streaming technique to support this scenario. The documents that are typical for this scenario are not very large, but are not necessarily small. And they usually contain many tags.

2.1.5 Scenario 5: Compress and search

Documents are stored in the persistent store. They are compressed / streamed and stored once, but need to be retrieved and searched for some particular data multiple times. Thus, the size of compressed documents, the efficiency of decompression, and the efficiency of searches become equally important. For its support, beside compression, it requires multiple techniques such as selective indexing, data access using groups of precompiled XPath expressions, etc.

2.1.6 Scenario 6: Incremental processing

This scenario applies to the processing of very large documents or to the processing of many documents in a multi-threaded environment. There are two main kinds of processing where incremental processing is important: incremental transfer and lazy (on demand) reproduction of an in-memory representation, such as DOM, of XML documents.

2.2 Requirements

Based on the scenarios above, Binary XML must be suitable for the following applications:

o Efficiently support binary format for schema aware documents and be friendly to schema evolutions that maintain the validity of documents

o Efficiently support XML document transmission and distribution

o Efficiently support certain operations in the persistent store and in the memory. These operations should include querying, updating, indexing, access to fragments, and fragment extraction

o Efficiently support XML schema datatypes. Transferring data in their native binary format is typically more compact, e.g. <xsd:hexBinary>, data as binary bytes is half the size of the corresponding textual hexadecimal representation

o Efficiently support incremental processing where both partial loading and partial transmission are included

o Efficiently support streaming evaluation of XPath expressions

o Binary XML should not require changes to existing application layers (only the underlying layers need to be change). The impact to developers should be minimal

o Any Binary XML standard development should be in coordination with other W3C activities in regard to binary formats

3 Related work

Binary XML formats have been the subject of multiple projects, investigations, and research prototypes. Different approaches developed can be categorized by groups of techniques they use. The following techniques are used by approaches described in this section:

o Redundancy elimination through the use of online of offline dictionaries. Most of the approaches below use this technique.

o Use of XML schema information for schema-aware documents [2], [5].

o Separation of tags, structural information, and data into separate buckets that are compacted separately [4], [9].

o Splitting XML documents in pieces (sections) that can be compacted and reconstructed separately. This technique supports incremental processing [5], [6], [11].

o Streaming evaluation of XPath expressions [6].

3.1 gzip

The most commonly used one for XML is gzip, based on [1] by Ziv, J., and Lempel, A. Gzip works by checking if the current data has already been recorded in a temp buffer {dictionary) and, if so, the current data can be recorded as a reference to the previous data. When similar data are compressed together, high compression ratios can be achieved. Gzip is a plain text compressor, and thus cannot take advantage of the XML document structure. So, theoretically, a method that makes use of the special properties of XML documents could have better performance. However, so far, none of the existing XML-conscious methods on the market win over gzip in all aspects. Some are slower than gzip, while some have lower compression ratios.

An alternative approach proposed by N. Jasper Larsson and Alistair Moffat [3] is to infer a complete dictionary offline to optimize the choice of phrases so as to maximize compression performance.

3.2 XMLZip

XMLZip is a compressor and decompressor for XML documents based on the W3C DOM. XMLZip is written in Java and produces ordinary pkzip/WinZip zip files [7]. XMLZip first parses XML data with a DOM parser, then breaks the structural tree into multiple components: a root component containing all data up to depth N from the root, and one component for each of the subtrees starting at depth N. The root component is then modified, references to each subtree are added onto the root, and finally components are compressed.

XMLZip allows users to choose the depth at compression time, thus allowing users to select the DOM level at which to compress the XML files. This allows continued use of the DOM API without decreased performance. XMLZip only decompresses the portion of the XML tree that needs to be accessed, allowing applications to access data without uncompressing the whole document, reducing execution time and run-time space and memory usage.

XMLZip reduces the size of XML file while keeps the accessibility of the DOM API. However, XMLZip can only be run on entire XML file objects, and is thus offline-only. Also, XMLZip compression ratio is not as good as gzip when measured over an entire document.

3.3 Millau

Millau [2] was designed as a binary XML compression method that is schema aware. Millau encoding is based on the Binary XML format from WAP [8] that reduces the size of XML documents. Millau achieves the compression performance by using the structural and data types information in XML schemas. Millau format saves space by representing tags and attributes as tokens, rather than strings. A custom parser for processing the format is implemented using DOM and SAX adapted to handle Millau streams, including an extended Binary SAX and Binary DOM API which employ tokens instead of strings, resulting in better performance. This approach can easily be extended to handle incremental processing.

3.4 XML-Xpress

XML-Xpress [5] is a schema specific XML coder. To implement schema-specific encoding, XML-Xpress uses Schema Model Files (SMF). SMF is generated by an automated off-line routine that reads a schema file (such as an .xsd or .dtd file) and creates parsing models that allow the real-time encoders to operate at the highest speeds. The same SMF is provided to both the encoder and the decoder. The real-time encoder can support multiple schemas as long as it has the SMF for each and knows which one to use for each file. The SMF can also include encoding models developed by analyzing sample data. This is especially helpful in achieving maximum compression performance on smaller files.

XML-Xpress can parse and encode documents one at a time, or if input data is known to arrive more slowly, the program can be configured to accept data as it is received (incremental packet level compression). This prevents the performance-degrading latency caused by waiting for entire large files to arrive.

3.5 XMILL

XMILL [4] is designed to take advantage of any regularities in XML data to improve compression performance. XMILL first parses XML data with a SAX parser, then transforms it by splitting the data into three components: element and attribute symbols, document tree structure, and plain text. XMILL employs a path processor that uses a container expression language called ?group by enclosing structure?. By this method, text within a certain set of element tags is placed into a container. XMILL is based on the following compression principles:

o Separate structure, tags, and data, and compress them separately.

o Group data items by meaning in separate containers. Data is compressed according to container, to increase the likelihood of structural similarity.

o Use different compressors for each container because different containers may contain different types of data (names, numbers, etc.) For example, for data, after the data is transformed, the result is compressed using a text-based compression programs such as GZIP.

The major limitation of XMILL is that it only achieves greater compression ratios than traditional text-compression methods if dealing with data larger than approximately 20,000 bytes [4]. Another limitation of XMILL is that it does not support incremental processing.

3.6 XML Structure Compression, Mark Levene and Peter Wood

The paper [9] presents a compression algorithm for XML documents that conform to a given DTD. The algorithm separates the document?s structure from its data, taking advantage of the regular structure of XML elements. To do so, it first obtains the parse tree of the document and prunes from it all the leaf nodes containing text (PCDATA). The data is represented as a sequence of leaf nodes in a left-to-right fashion with a fixed delimiter between them. In the second step, further pruning is applied to the parse tree maintaining a tree representation only of the structure that needs to be encoded in order that the decoding can reconstruct the document when given its DTD. The pruned parse tree does not contain sequence nodes since they can be deduced from DTD. This approach can be easily extended to support schema-aware documents. However, it does not support incremental processing.

3.7 University of Washington XML Toolkit

A research team at the CS department of the University of Washington has developed the XML toolkit [6] using a highly scalable XML stream processor and an XML stream index. The toolkit parses XML data and processes a collection of XPath expressions. Two novel technologies to realize a high-throughput XML data processing were implemented: all XPath expressions are converted to one single Lazy DFA (Deterministic Finite Automaton that is constructed lazily), and structured Stream IndeX (SIX) that allows the tool to skip portions of the XML document that are not needed for the particular set of XPath expressions when data is accessed.

The binary XML format produced by the streamer replaces tags and attributes with tokens (integer values), and recognizes some atomic data types like integers, floats, and represents them in binary. This is similar to some other techniques described above. The interesting part is SIX (Stream IndeX). It is a binary stream consisting of pairs of binary offsets in the document stream. The first offset points to the start of the beginning tag of an element, while second one points to the end of the end tag of the same element. These pairs of offsets are created for elements of interest only. Which elements are interesting is determined by the input set of XPath expressions. This is a novel indexing technique that allows the data of interest to be extracted from the stream very efficiently.

The interesting thing about this approach is that, not only, it supports streaming evaluation of XPath expressions but, also, it can be combined with other approaches easily to satisfy multiple requirements outlined in this document.

4 Work done at Oracle

Two different technologies for the binary representation of XML documents have been developed at Oracle.

4.1 Binary representation of XML documents in Oracle?s XDB

Oracle XML DB [10] implements a proprietary binary format for storing XML documents conforming to XML Schemas in the Oracle database. The binary format exploits the XML Schema information by eliminating tag overheads and storing data in their native datatype representations. The format for storing schema-based XML documents is similar to the relational row (record) storage in tables. This format is also used for transferring XML documents to and from mid-tier applications. In addition to significant reduction in the size of data, the binary format is also suitable for efficient indexing, queries and updates.

4.2 XML binary streaming

The approach is an online dictionary based approach and is similar to other online dictionary approaches. The compression algorithm is based on tokenizing the XML tags. It defines a strategy to based on the assumption that XML documents typically have repeated tags and therefore tokenizing these tags will result in good compression. However, the compression depends on the type of the input document - the larger the number of repeating tags and the less the text content, the better is the compression. Actual measurements show that the compression ratio for well formatted documents (tags take 50% to 70% of textual space) is between 20% and 50%.

The implementation of the algorithm allow an XML document which is exchanged and streamed over the internet to be compressed and decompressed without waiting for the whole document to reach across the wire. This allows the implementation algorithm to support incremental processing.

The compressed stream generated from SAX events and a DOM tree are compatible, that is, the compressed stream generated from SAX could be used to generate the DOM tree and the vice versa.

5 Conclusions

The big debate whether XML binary streaming pays off in regard to the document size as well as performance of document transfer over the wire, is going for at least several years. Many big projects have adopted some binary encoded XML formats. The primary advantages in using such formats are space savings and speed of transmission. These advantages are often significant in improving performance of multiple applications. It is clear that the importance of binary XML formats and binary XML streaming will be growing with time.

However, there seem to be no convincing case yet showing that any of these formats can be universally applied to all kinds of XML documents. The question is not whether binary streaming is useful for any given XML document kind, any XML application language, and any scenario of their use, but is a standard XML binary streaming approach can be useful for all of them. WML is a good example because binary streaming is good for WML. Generalizing this might be difficult and even lead to false conclusions because the form and fit is not the same for all functions.

There are certain aspects of XML that have significant advantages in their own right and would be lost with a binary format. Binary formats tend to be proprietary, primarily because of possible application specific optimizations. This is bad when several independent parties, often each behind a firewall, involved. There is an enormous advantage, in this case, in using a data format that was created by a third party, in particular such as a standard organization. Also, XML has succeeded where other file formats have failed because it is auditable. A mere human, can pull up the code and read it without an intermediate reader who could be at fault.

While the industry and academia are gaining more and more experience in creating binary XML formats that deliver better performance for multiple applications (see [13] for example), there is no evidence yet that these formats are converging in the same direction. The questions are:

o Do we have enough experience to develop a usable standard (or standards)?

o If we do, is it possible to develop a single standard sufficient for all application domains, or we need to develop several standards?

It is clear that Binary XML is / will be a very base XML technology. Thus, if the consensus is that we are ready to develop a standard (or standards), such development should happen as a part of W3C activities.

6 References

Ziv, J., and Lempel, A., "A universal algorithm for sequential data compression", IEEE Transaction on Information Theory, Volume 23, Number 3, May 1997, pages 337-343
Girardot, M., and Sundaresan, N., ?Millau: an encoding format for efficient representation and exchange of XML over the Web?, 9^th International World Wide Web Conference, XML-Session 2.
N. Jesper Larsson, Alistair Moffat, "Offline Dictionary-Based Compression", IEEE Transaction on Information Theory, 1999
Liefke, H. and Suciu, D., ?XMILL: An Efficient Compressor for XML Data. SIGMOD, 2000. http://www.research.att.com/sw/tools/xmill/
ICT, XML-Xpress, http://www.ictcompress.com/xml.html
http://www.cs.washington.edu/homes/suciu/XMLTK
XML Solutions. XMLZip, http://www.xmls.com/resources/xmlzip.xml
The Wireless Application Protocol (WAP) Forum, http://www.wapforum.org/
XML Structure Compression, Mark Levene, Peter Wood, Submitted to Twelfth International World Wide Web Conference, 20-24 May, 2003
Oracle XML DB Developer's Guide, Oracle 9iR2. http://otn.oracle.com/tech/xml/xmldb/
Jun-Ki Min, Myung-Jae Park, Chin-Wan Chung, XPRESS: A Queriable Compression for XML Data, SIGMOD 2003, June 912, 2003, San Diego, CA
An MPEG-7 Tool for Compression and Streaming of XML Data, U. Niedermeier, J. Heuer, A. Hutter, W. Stechele, A. Kaup, IEEE International Conference on Multimedia and Expo (ICME 2002), Lausanne, Switzerland, August 26-29, 2002.
XML Sizing and Compression Study For Military Wireless Data, Michael Cokus <msc@mitre.org>, Daniel Winkowski <winkowsk@mitre.org>, XML 2002 Proceedings