Binary
XML
Position paper for The W3C Workshop on Binary Interchange
of XML Information Item Sets
Oracle Corporation
Dmitry Lenkov, editor
Contributors: Alex Yiu, Anish Karmarkar, Anjana Manian, Chandra Sivasankaran, Dmitry Lenkov, Eric Sedlar, Julie Basu, K. Karun, Mark Scardina, Ravi Murthy
As the World Wide Web transitions from just being a medium for browsing to a medium for commerce, web services, and application integration, XML (eXtensible Markup Language) has emerged as the standard language for markup. Multiple applications over the Internet are increasingly adopting XML as the standard for expressing messages, schema, and data. Consequently, XML is being increasingly used for Web based applications as an exchange wire format. Several problems arise as a result. First of all, with the rapidly increasing volume of XML data being exchanged for information purposes and for conducting business, the bandwidth of networks and other communication channels is being tested to its limit. Traditional data compression algorithms can achieve good compression rates on large text files but are less effective with small sized files like the ones that may be typical in some XML messaging applications. Also, traditional compression algorithms do not assume any knowledge of the document syntactic or semantic structure. In the case of XML documents, such knowledge provides additional opportunities for the compression.
In another area of XML applications, XML documents are stored and saved, then searched and retrieved. Besides the size and time efficiency in compressing and decompressing the whole document, the preservation of the document structural information becomes really important. It allows applications to do efficient searches and retrieve parts of documents rather than whole documents. Traditional compression systems do not retain this structural information of the documents.
In the related area of XML applications, XML documents are passed from application to application while being manipulated in each application separately. This manipulation needs to be efficient. Typically an application that receives an XML document as an input, manipulates the XML data either by using DOM APIs to access an in-memory tree representation of the XML document produced be the parser, or by using SAX events to build its own representation of the document or its parts based on the parsing events passed by the parser to the application. When DOM is used, the structural information is preserved in the tree. In the case of SAX events, it has to be gathered by the application and preserved in a proprietary way. Current DOM representations of XML documents are, in most cases, quite expensive size wise. In addition, as a result of the large size, manipulations that require copying and moving subtrees of a DOM tree are also expensive performance wise. Processing XML documents based on SAX events results in proprietary code and data structures that cannot be easily exchanged among the applications. It is also error prone.
Thus there is a need for XML compression and streaming systems that efficiently work in the Internet or direct communication context for the application domains described above.
In order to address the needs presented in the introduction, the appropriate use scenarios based on these needs have to be collected together with the performance characteristics that binary XML processing and streaming technologies should satisfy. Under use scenarios we understand exemplars of practical utilization of XML technologies where performance of XML processing and streaming techniques is important. Given use scenarios, a set of requirements is presented. The majority of the requirements emphasize how binary XML format should improve performance of XML processing and streaming.
The performance of binary XML streaming includes the following elements: the compression ratio, speed of compression, speed of decompression, document manipulation efficiency, efficiency of incremental processing, efficiency of querying documents stored in a persistent store, etc.
Another characteristic that affects all elements of the performance when binary streaming is applied to an XML document is whether an XML schema validating this document was available in an advance. This paper assumes that we are dealing primarily with schema-aware documents.
In this paper, the following use scenarios are taken in consideration.
XML documents are passed in an inter-application communication. Such documents are typically created and sent once, and received once. This scenario is typical for the enterprise web services integration platforms that include inter-application and inter-database communications. This type of communication calls for equally important fast compression and decompression, and significant ratio of compression. The total transmission speed characterizes performance. Any streaming technique offered for this scenario needs to be compared against transmission of XML documents without any compression.
This scenario is similar to scenario 1. However, there are two differences. In this scenario the compression speed is not very important. So, only the total speed of transmission and decompression counts. Also, in this scenario documents can be larger than in scenario 1. Another important characteristic of this scenario is that documents are always decompressed into XML textual format, not in any binary representation like DOM, and, sometimes, converted into HTML or other textual formats.
Compression ratio is very important when large documents are transferred over Internet and stored in a persistent storage for long time. In this case, storage capacity is a constraining factor, and thus compression ratio is the major concern, while the speed of compression and decompression aspect is secondary. There are two kinds of documents that are typical for this case. Documents of the first kind have significant amount of data and rather small number of tags, for example, books. Documents of the other kind have large number of tags and large amount of data, for example, business reports and periodic transaction summaries.
XML documents are passed from application to application in a pipeline-like architecture. Each application, in such a scenario, performs certain processing of documents. In this case, the efficiency of operations on the document representation and the size of the document representation are most important. At the same time, the compression and decompression speed is not always important. This scenario often leads to parsing XML documents or serializing their DOM representation multiple times. This affects the overall performance of the XML application, regardless how good the compression speed is. So, it is important to minimize the number of parses or serializations. It is no less important that the pipeline components of the application perform operations on the document passed through. Thus, the efficiency of these operations is an equally important factor in choosing a streaming technique to support this scenario. The documents that are typical for this scenario are not very large, but are not necessarily small. And they usually contain many tags.
Documents are stored in the persistent store. They are compressed / streamed and stored once, but need to be retrieved and searched for some particular data multiple times. Thus, the size of compressed documents, the efficiency of decompression, and the efficiency of searches become equally important. For its support, beside compression, it requires multiple techniques such as selective indexing, data access using groups of precompiled XPath expressions, etc.
This scenario applies to the processing of very large documents or to the processing of many documents in a multi-threaded environment. There are two main kinds of processing where incremental processing is important: incremental transfer and lazy (on demand) reproduction of an in-memory representation, such as DOM, of XML documents.
Based on the scenarios above, Binary XML must be suitable for the following applications:
o Efficiently support binary format for schema aware documents and be friendly to schema evolutions that maintain the validity of documents
o Efficiently
support XML document transmission and distribution
o Efficiently
support certain operations in the persistent store and in the memory. These
operations should include querying, updating, indexing, access to fragments,
and fragment extraction
o Efficiently support XML schema datatypes. Transferring data in their native binary format is typically more compact, e.g. <xsd:hexBinary>, data as binary bytes is half the size of the corresponding textual hexadecimal representation
o Efficiently support incremental processing where both partial loading and partial transmission are included
o Efficiently support streaming evaluation of XPath expressions
o Binary XML should not require changes to existing application layers (only the underlying layers need to be change). The impact to developers should be minimal
o Any Binary XML standard development should be in coordination with other W3C activities in regard to binary formats
Binary XML formats have been the subject of multiple projects, investigations, and research prototypes. Different approaches developed can be categorized by groups of techniques they use. The following techniques are used by approaches described in this section:
o Redundancy elimination through the use of online of offline dictionaries. Most of the approaches below use this technique.
o Use of XML schema information for schema-aware documents [2], [5].
o Separation of tags, structural information, and data into separate buckets that are compacted separately [4], [9].
o Splitting XML documents in pieces (sections) that can be compacted and reconstructed separately. This technique supports incremental processing [5], [6], [11].
o Streaming evaluation of XPath expressions [6].
The most commonly used one for XML is gzip, based on [1] by Ziv, J., and Lempel, A. Gzip works by checking if the current data has already been recorded in a temp buffer {dictionary) and, if so, the current data can be recorded as a reference to the previous data. When similar data are compressed together, high compression ratios can be achieved. Gzip is a plain text compressor, and thus cannot take advantage of the XML document structure. So, theoretically, a method that makes use of the special properties of XML documents could have better performance. However, so far, none of the existing XML-conscious methods on the market win over gzip in all aspects. Some are slower than gzip, while some have lower compression ratios.
An alternative approach proposed by N. Jasper Larsson and Alistair Moffat [3] is to infer a complete dictionary offline to optimize the choice of phrases so as to maximize compression performance.
XMLZip is a compressor and decompressor for XML documents based on the W3C DOM. XMLZip is written in Java and produces ordinary pkzip/WinZip zip files [7]. XMLZip first parses XML data with a DOM parser, then breaks the structural tree into multiple components: a root component containing all data up to depth N from the root, and one component for each of the subtrees starting at depth N. The root component is then modified, references to each subtree are added onto the root, and finally components are compressed.
XMLZip allows users to choose the depth at compression time, thus allowing users to select the DOM level at which to compress the XML files. This allows continued use of the DOM API without decreased performance. XMLZip only decompresses the portion of the XML tree that needs to be accessed, allowing applications to access data without uncompressing the whole document, reducing execution time and run-time space and memory usage.
XMLZip reduces the size of XML file while keeps the accessibility of the DOM API. However, XMLZip can only be run on entire XML file objects, and is thus offline-only. Also, XMLZip compression ratio is not as good as gzip when measured over an entire document.
Millau [2] was designed as a binary XML compression method that is schema aware. Millau encoding is based on the Binary XML format from WAP [8] that reduces the size of XML documents. Millau achieves the compression performance by using the structural and data types information in XML schemas. Millau format saves space by representing tags and attributes as tokens, rather than strings. A custom parser for processing the format is implemented using DOM and SAX adapted to handle Millau streams, including an extended Binary SAX and Binary DOM API which employ tokens instead of strings, resulting in better performance. This approach can easily be extended to handle incremental processing.
XML-Xpress [5] is a schema specific XML coder. To implement schema-specific encoding, XML-Xpress uses Schema Model Files (SMF). SMF is generated by an automated off-line routine that reads a schema file (such as an .xsd or .dtd file) and creates parsing models that allow the real-time encoders to operate at the highest speeds. The same SMF is provided to both the encoder and the decoder. The real-time encoder can support multiple schemas as long as it has the SMF for each and knows which one to use for each file. The SMF can also include encoding models developed by analyzing sample data. This is especially helpful in achieving maximum compression performance on smaller files.
XML-Xpress can parse and encode documents one at a time, or if input data is known to arrive more slowly, the program can be configured to accept data as it is received (incremental packet level compression). This prevents the performance-degrading latency caused by waiting for entire large files to arrive.
XMILL [4] is designed to take advantage of any regularities in XML data to improve compression performance. XMILL first parses XML data with a SAX parser, then transforms it by splitting the data into three components: element and attribute symbols, document tree structure, and plain text. XMILL employs a path processor that uses a container expression language called ?group by enclosing structure?. By this method, text within a certain set of element tags is placed into a container. XMILL is based on the following compression principles:
o Separate structure, tags, and data, and compress them separately.
o Group data items by meaning in separate containers. Data is compressed according to container, to increase the likelihood of structural similarity.
o Use different compressors for each container because different containers may contain different types of data (names, numbers, etc.) For example, for data, after the data is transformed, the result is compressed using a text-based compression programs such as GZIP.
The major limitation of XMILL is that it only achieves greater compression ratios than traditional text-compression methods if dealing with data larger than approximately 20,000 bytes [4]. Another limitation of XMILL is that it does not support incremental processing.
The paper [9] presents a compression algorithm for XML documents that conform to a given DTD. The algorithm separates the document?s structure from its data, taking advantage of the regular structure of XML elements. To do so, it first obtains the parse tree of the document and prunes from it all the leaf nodes containing text (PCDATA). The data is represented as a sequence of leaf nodes in a left-to-right fashion with a fixed delimiter between them. In the second step, further pruning is applied to the parse tree maintaining a tree representation only of the structure that needs to be encoded in order that the decoding can reconstruct the document when given its DTD. The pruned parse tree does not contain sequence nodes since they can be deduced from DTD. This approach can be easily extended to support schema-aware documents. However, it does not support incremental processing.
A research team at the CS department of the
University of Washington has developed the XML toolkit [6] using a highly
scalable XML stream processor and an XML stream index. The toolkit parses XML
data and processes a collection of XPath expressions. Two novel technologies to
realize a high-throughput XML data processing were implemented: all XPath expressions are converted to one single Lazy
DFA (Deterministic Finite Automaton that is constructed lazily), and
structured Stream IndeX (SIX) that allows the
tool to skip portions of the XML document that are not needed for the
particular set of XPath expressions when data is accessed.
The binary XML format produced by the streamer
replaces tags and attributes with tokens (integer values), and recognizes some
atomic data types like integers, floats, and represents them in binary. This is
similar to some other techniques described above. The interesting part is SIX
(Stream IndeX). It is a binary stream
consisting of pairs of binary offsets in the document stream. The first offset
points to the start of the beginning tag of an element, while second one points
to the end of the end tag of the same element. These pairs of offsets are
created for elements of interest only. Which elements are interesting is
determined by the input set of XPath expressions. This is a novel indexing
technique that allows the data of interest to be extracted from the stream very
efficiently.
The
interesting thing about this approach is that, not only, it supports streaming
evaluation of XPath expressions but, also, it can be combined with other
approaches easily to satisfy multiple requirements outlined in this document.
Two different technologies for the binary representation of XML documents have been developed at Oracle.
Oracle XML DB [10] implements a proprietary binary format for storing XML documents conforming to XML Schemas in the Oracle database. The binary format exploits the XML Schema information by eliminating tag overheads and storing data in their native datatype representations. The format for storing schema-based XML documents is similar to the relational row (record) storage in tables. This format is also used for transferring XML documents to and from mid-tier applications. In addition to significant reduction in the size of data, the binary format is also suitable for efficient indexing, queries and updates.
The approach is an online dictionary based approach and is similar to other online dictionary approaches. The compression algorithm is based on tokenizing the XML tags. It defines a strategy to based on the assumption that XML documents typically have repeated tags and therefore tokenizing these tags will result in good compression. However, the compression depends on the type of the input document - the larger the number of repeating tags and the less the text content, the better is the compression. Actual measurements show that the compression ratio for well formatted documents (tags take 50% to 70% of textual space) is between 20% and 50%.
The implementation of the algorithm allow an XML document which is exchanged and streamed over the internet to be compressed and decompressed without waiting for the whole document to reach across the wire. This allows the implementation algorithm to support incremental processing.
The compressed stream generated from SAX events and a DOM tree are compatible, that is, the compressed stream generated from SAX could be used to generate the DOM tree and the vice versa.
The big debate whether XML binary streaming pays off in regard to the document size as well as performance of document transfer over the wire, is going for at least several years. Many big projects have adopted some binary encoded XML formats. The primary advantages in using such formats are space savings and speed of transmission. These advantages are often significant in improving performance of multiple applications. It is clear that the importance of binary XML formats and binary XML streaming will be growing with time.
However, there seem to be no convincing case yet
showing that any of these formats can be universally applied to all kinds of
XML documents. The question is not
whether binary streaming is useful for any given XML document kind, any XML
application language, and any scenario of their use, but is a standard XML
binary streaming approach can be useful for all of them. WML is a good example
because binary streaming is good for WML. Generalizing this might be difficult
and even lead to false conclusions because the form and fit is not the same for
all functions.
There are certain aspects of XML that have significant advantages in their own right and would be lost with a binary format. Binary formats tend to be proprietary, primarily because of possible application specific optimizations. This is bad when several independent parties, often each behind a firewall, involved. There is an enormous advantage, in this case, in using a data format that was created by a third party, in particular such as a standard organization. Also, XML has succeeded where other file formats have failed because it is auditable. A mere human, can pull up the code and read it without an intermediate reader who could be at fault.
While the industry and academia are gaining more and more experience in creating binary XML formats that deliver better performance for multiple applications (see [13] for example), there is no evidence yet that these formats are converging in the same direction. The questions are:
o Do we have enough experience to develop a usable standard (or standards)?
o If we do, is it possible to develop a single standard sufficient for all application domains, or we need to develop several standards?
It is clear that Binary XML is / will be a very base XML technology. Thus, if the consensus is that we are ready to develop a standard (or standards), such development should happen as a part of W3C activities.