Software AG Position on Binary XML

Michael Champion

Trevor Ford

8 August, 2003

This document reflects the authors' best understanding of the interests and experience of Software AG in the "Binary XML" debate, not an authoritative statement of an official company position.

Overview

Software AG develops "native XML" DBMS and middleware infrastructure software and application-level solutions based in it. Over its nearly 35-year history, the company has found its best market in organizations that require high performance, "industrial strength" data processing, so the issue of optimizing XML processing lies at the intersection of our historical focus on speed and our more recent focus on XML standards. Our developers and field support people have faced significant challenges in optimizing the performance of the XML processing components of our products, and welcome this W3C initiative to share thoughts on the causes of the problems and possibilities for common solutions.

The typical concerns about the inefficiency of XML that we and our customers tend to encounter fall into three main categories.

There is considerable "bloat" in the XML 1.0 format, caused by the "self-describing" [1] markup and the text representation of numerical data.
The processing power needed to parse XML text into useful data structures causes significant bottlenecks in many applications.
SAX-like APIs are relatively efficient for straightforward processing tasks in the hands of experienced XML developers; Infoset-based (including DOM, JDOM, XPath, and XQuery) data models and APIs are generally easier for ordinary business developers to "grok"; but impose a significant processing and memory overhead.

The Call for Participation notes: "The purpose of the Workshop, then, is to study methods to compress XML documents". This reflects only the first of the issues, and probably the one that is least salient to us. So, while we would welcome new standards and technologies that offered good compression with less processing overhead, we shall assume a somewhat broader scope for the term "binary XML" in this position paper. Efficient processing of XML data (broadly defined to include Infoset-oriented as well as syntax-oriented approaches) is a much more serious short-term issue for our developers and customers than the raw size of XML documents or messages. The biggest processing bottleneck in XML applications is probably XSLT, but that is clearly outside of the scope of this workshop. The other major bottleneck is the parsing of XML documents into application data structures or DOM/JDOM trees.

So, we would encourage the W3C to think of the issue here not so much as compressed "binary XML" formats but rather as serializations of the XML Infoset that can be more efficiently parsed into useful software objects than is feasible with XML 1.0 text.

We should address one frequently asked question before continuing: given that the typical computer's processor is under-utilized and that Moore's Law predicts processing speed and memory capacity will double every couple of years, why worry about the speed and efficiency of XML parsing? Even if this is a problem now, it might not be by the time the W3C can standardize some more efficient format. This is certainly a factor that must be carefully considered when weighing the results of this workshop, but it must be balanced against a couple of others.
First, use of XML messages and documents appears poised to grow much faster than hardware processing power over the next few years. Various analysts project that the percentage of data transmitted over the Internet as XML will grow from 1-2% to 50-60% between 2002 and 2006. One should take such forecasts with several grains of salt, but it is indisputable that the processing power of the average computer will not grow by a factor of 20 or 30 in that same period!
Second, XML is likely to find much of its success in environments -- especially mobile, wireless applications -- that are constrained by other factors such as weight and battery life and will probably not see dramatic increases in CPU power or bandwidth in the near future. Whilst the raw technology in a hybrid phone/PDA/internet appliance might be capable of handling the overhead of XML processing, designers are being asked to accommodate many other design goals. It is not a foregone conclusion that the overhead of XML processing will be insignificant in these environments, even if it becomes less significant in server/infrastructure environments.
Finally, the W3C should not become complacent about the continued success of XML as the substrate for the next generation of enterprise and mobile applications; if XML does not evolve to meet real needs other technologies may grab some of its powerful ideas, synthesize them with their own, and displace it.

Specific Approaches to Optimization

Several diverse but inter-related ideas fall very loosely under this umbrella. We will try to summarize a wide variety of impressions here, some based on research by Software AG, but much of it gleaned from a variety of e-mail discussions and presentations (some received under NDA). Quite frankly, it is difficult to keep track of how we know what we think we know and to keep it properly sanitized and attributed, so we will err on the side of non-specificity. We hope that other contributors to the workshop can provide more in the way of hard numbers and specific citations than we offer below, but hope that this overview will be useful to put things in a larger context.

Compression

The Tamino XML DBMS technology (to the best of the authors' knowledge) effectively uses off-the-shelf compression algorithms to minimize the size of stored XML instances.
Internal studies have shown a difficult tradeoff when compressing XML data for network communication: on fast networks, the time needed to compress the data can outweigh the reduced transmission time; on slow networks there can be a net gain, but slow networks often have a processor-constrained device at one end and the overall value proposition of conventional compression is unclear.

Simplification

Though out of scope for this workshop, this is worth mentioning. The main reason that SOAP 1.2 specifies that an "XML infoset of a SOAP message MUST NOT contain a document type declaration information item" is performance: "Doing general entity substitution beyond that mandated by XML 1.0 (e.g. <) implies a degree of buffer management, often data copying, etc. which can be a noticeable burden when going for truly high performance. This performance effect has been reported by workgroup members who are building high performance SOAP implementations." [2]. This suggests that a sine qua non of an alternative Infoset serialization that is optimized for parsing efficiency is that general entity references should either be forbidden altogether or resolved in some sort of a canonicalization step before the "binary" format is created. It also suggests that two of the most contentious issues inside the W3C XML community -- alternative serializations and subsetting -- are somewhat complementary!

We don't know of any concrete evidence that a simplified markup text format (i.e., one still containing element and attribute tags to label content, but using more concise/easy-to-parse markup) would significantly reduce XML processing bottlenecks, but the idea has been kicked around on mailing lists.

ASN.1 and similar approaches

As other position papers will no doubt describe, there is a certain amount of overlap between the ideas in the XML Infoset and alternative serializations for it, and the ideas in the ASN.1 abstract data model and alternate encoding schemes for actual messages. There seems to be much innovative work underway to bridge the XML and ASN.1 worlds.[3]

To the best of our understanding, the ASN.1 approach hard-codes a specific schema into a data format, eliminating the "self-describing" [1] feature of XML. This works well when there is a fully standardized schema for some class of messages, but breaks down when there is not an a priori agreement on the data format between producers and consumers. Whilst it is true that this works well in some important segments of the XML world (SOAP envelopes and standard headers for reliable messaging, security parameters, etc. come to mind) it is not something that we can use as the basis for general purpose storage or transmission of XML Infosets.

Unicode Character and String Optimization

It is our impression -- from several conversations with developers inside and outside Software AG who have looked closely at the matter -- that conversion back and forth from various text encodings to Unicode code points is a significantly expensive part of XML parsing. This suggests that a "binary" format that stores Unicode characters in a single "UCS" form could be re-parsed more quickly than one of the "UTF" forms. Of course, this would probably "bloat" the overall size of an XML document further, so would probably only make sense when high-bandwidth channels are available for communication among the components in an XML application.

Hardware Accelerated Processing

There are a number of vendors who are moving XML processing to the hardware layer -- at the level of boxes, boards, and even chips. This has little to do with "binary XML," but we would note that if this approach is technically successful and widely deployed, it could mitigate the need for more efficient serialization formats. On the other hand, a parse-efficient binary standard could aid interoperability among hardware from different vendors.

Hybrid Approaches

It seems that none of these approaches on its own will meet all the requirements for a more compact and efficient XML serialization format. On the other hand, it may be possible to develop hybrid approaches that combine the strengths and cancel the weaknesses of alternatives. For example, hardware accelerated parsing offers little advantage if all the XML processing has to be done on a specific appliance, but could be more generally useful if the hardware emits a standardized "binary" format that XPath, XSLT, DOM, etc - based applications can efficiently consume. Or perhaps widespread adoption of an ASN.1-based encoding for those parts of a document or message that are in a truly standardized format (e.g., someday, SOAP 1.2 with WS-Security headers) then those parts of an Infoset that are most relevant to time-critical components such as routers and firewalls could be processed quickly, while the ultimate payload (in XML text format) is delivered to a less time-critical or general purpose node in a system.

Finally, it may be possible in principle to combine the performance advantages of a binary format with the interoperability and visibility advantages of a text format by associating a binary index that describes the parsed structure of a document with the text itself. For example, a binary "attachment" to an XML document or message might describe the element hierarchy, with offsets/lengths into the text format used to pick up actual values. (Again, this buys processing performance at the cost of additional "bloat"). Applications that understand the binary index can build some Infoset (or application-specific) representation quickly by using the pre-indexed attachment, yet applications that don't can simply rebuild the Infoset structure by parsing the text. This is essentially applying a design pattern used in XML (and other text) database systems to document instances, and could work in an interoperable fashion if the binary structure component were standardized. (We know of at least one company who has submitted a position paper describing this approach and the performance results in some detail).

Conclusion

Let's recapitulate by addressing the specific questions raised in the Call for Participation:

What work has your organization done in this area?
We have no detailed measurements to share, but we have done extensive profiling of our own implementations to find where they are spending their time, and have evaluated a number of compression and proprietary "binary XML" technologies. The results tend to support the general findings noted above: XML parsing/treebuilding/serialization is a significant bottleneck in real applications, and we have reason to believe that these bottlenecks could be minimized if data were exchanged using a constrained, more easily parseable serialization of the XML Infoset. "gzip"-like compression schemes are useful in some situations and architectural components, but are not the whole answer to the question.
What goals do you believe are most important in this area? (e.g. reducing bandwidth usage; reducing parse time; simplifying APIs or data structures, or other goals)
Reducing parse time and simplifying APIs.
What sort of documents have you studied the most?
Not applicable. Since our technology has to handle "documents," "data", and "messages," the more generic the solution, the better.
What sorts of applications did you have in mind?
High-volume messaging applications are the most obvious use case for more efficient serializations of the XML Infoset. This includes business-level messaging applications such as Web services hubs or specialized processing intermediaries (e.g. XML-aware firewalls) but also the internal processing pipelines of XML infrastructure products such as DBMS.
If you implemented something, how did you ensure that internationalization and accessibility were not compromised?
Unicode, which is the basis for both XML syntax / serialization (character encodings) and in Infoset-based tools (code points). Likewise, accessibility seems to be more an Infoset-level feature of "XML" or something supplied by applications, and not something inherent in the XML text serialization.
How does your proposal differ from using gzip on raw XML?
We currently do use a "zip"-like compression scheme. There are, however, obvious potential advantages to a binary serialization format that had compression ratios on real XML data that are comparable to those achieved the various "zip" approaches but are less processor-intensive and more streaming-friendly.
Does your solution work with any XML? How is it affected by choice of Schema language? (e.g. W3C XML Schema, DTD, Relax NG)
Our current approach works with XML that has been "canonicalized" to expand all entity references, not arbitrary XML with entity declarations and references. Any approach that requires a specific schema (irrespective of schema language) to operate would probably not be of much use to us. An exception might be for very widely deployed XML data types such as XHTML, SOAP, or [if the various communities ever come together!] RSS. There may be value in optimizing for those "special cases" if they in fact cover the largest portion of the actual volume of data processed by some customers. In any event, we would tend to favor schema-optimized serialization schemes that are independent of a specific schema language, because there is significant innovation in that area outside the W3C.
How important to you are random access within a document, dynamic update and streaming, and how do you see a binary format as impacting these issues?
They are important, but -- besides "streaming" -- seem independent of a possible binary format, because we see random access and dynamic update happening at the Infoset level (via DOM, XQuery extensions, etc.) rather than the syntax level.

The bottom line for Software AG is that "XML" is a constellation of technologies, including both the XML 1.x syntax and a range of Infoset-based tools such as XPath, XQuery, and XSLT. The commonality, visibility, and interoperability brought about by the text-based syntax certainly offer significant benefits, but so do the Infoset-based specs -- especially since they can work on data that is never serialized as XML text. If alternative serializations of the Infoset that are standardized, widely useable, and interoperable across platforms and vendors can be developed that eliminate some of the processing bottlenecks, we see this as outweighing the disadvantages in many scenarios for our customers.

Finally, we are waiting to hear other positions and experiences before coming to a conclusion about the usefulness of a W3C working group to develop alternative XML serialization standards. We are quite skeptical (at this point) that a single "binary" format can meet the processing needs of both well-formed and schema-based documents, and can reduce both "bloat" and processing time. Likewise, innovation and standardization are somewhat antithetical at this point in the evolution of XML, and there appear to be many innovative ideas that need to be explored. Perhaps multiple "binary XML" standards are needed. Or perhaps this Workshop will lead to evidence that these are not incompatible goals. We plan to watch and learn.

Notes:

[1] We are well aware that this is not a great term to describe the fact that XML text/data values have labels associated with them, but it is widely used and much more concise than alternatives that come to mind. XML text is not "self-describing" in any semantic sense, but the labelled data values syntax / data model does make it easier -- as compared to technologies such as CSV or ASN.1 -- for human readers and programmers to associate a particular element/attribute value with some implicit, explicit, or hard-coded "meaning."

[2]http://lists.w3.org/Archives/Public/www-tag/2002Dec/0119.html

[3] http://asn1.elibel.tm.fr/xml/