Theory, Benefits and Requirements for Efficient Encoding of XML Documents
John C. Schneider
In 1995, our Chief Technologist invented a technology called Knowledge Based Compression (KBC) designed to drastically reduce the bandwidth requirements of structured text messages shared between automated systems . To our knowledge, this was the first data compression technique to leverage shared meta-data (i.e. a schema) to produce highly efficient encodings of structured text messages.
KBC was based on the principals of information theory pioneered by Claude Shannon  and informed by empirical data from one of the world’s largest and most complicated data sharing environments (see  for additional information). It was designed to enable the U.S. DoD and its Allies to share timely, accurate, complete information between hundreds of fixed and mobile automated systems developed independently by different organizations, evolving on different schedules, funded by different budgets and driven by different priorities.
With the advent of XML, it was obvious this technology could be applied to benefit a far broader range of users . Our preliminary findings showed that compression rates of 99% (50 times smaller) could be achieved on XML data (depending on the schema). Unlike conventional compression techniques that depend on character redundancy, schema-aware encodings worked equally well on large messages and high frequency streams of small messages typical of web services and mobile devices. In addition, they could be read and written directly from an appropriately modified XML Information Set  implementation (e.g., DOM, SAX, pull-parser), eliminating the need for a separate compression step.
However, despite a few early contributions with initial support for schema-aware compression   , it was relatively clear the mainstream market was not yet ready to take this leap . Recently, however, there has been a marked increase of activity in this area with broader initiatives under development, like the emerging MPEG-7 BiM standard   and the XML work of the ASN.1 community  .
With this increased interest and activity comes the risk that several non-interoperable XML encodings will evolve independently. Given this risk, it is clear that now is the right time to establish a single global standard for efficiently encoding XML data. As the custodian of XML, the W3C is the right organization to take on this task.
Given our background in this area, we were selected by the DoD to devise an efficient XML encoding that will satisfy their requirements. While the initiatives cited above are a fantastic start, they each lack some important architectural components and some include features that add unnecessary overhead. We are very interested in collaborating with the W3C member organizations to share our knowledge and expertise toward the development of a single, global standard for efficient XML encoding.
AgileDelta has devised uniquely efficient methods for deploying compelling web service applications to almost any programmable device (including Java and Microsoft devices). This technology enables us to implement the full web services stack and an intuitive data binding layer in as little as 7 Kbytes. One of our principal requirements for an efficient XML encoding is that it can be implemented on similarly constrained environments without increasing this code footprint by a significant percentage.
The remainder of this position paper provides background and outlines some of our additional rational and requirements for an efficient XML encoding mechanism.
The field of Information Theory pioneered by Claude Shannon provides a formal mathematical foundation for quantifying the minimum amount of data required to represent a piece of information . Information Theory defines a bit as the amount of data required to differentiate between two equally likely alternatives and states that log2(n) bits of data are required to differentiate between n equally likely alternatives. However, when the alternatives are not equally likely, fewer bits may be used for higher probability alternatives resulting in a more efficient encoding of the information. Specifically, log2(1/ pI) bits are required to represent an alternative with probability pI. It is this principal on which compression schemes employed by popular programs, such as WinZip, are based.
From this principal, Information Theory concludes the amount of data required to represent a piece of information is a function of the amount of knowledge the receiver has about the nature of the data. Specifically, the more knowledge the receiver has about the probable messages being sent, the less data is required to transmit those messages. Paul Revere took advantage of this principal when he interpreted the number of lights in the tower of the Old North Church within the famous context: “one if by land, two if by sea.” This knowledge enabled him to receive two important messages, “the British are coming” and “they are coming by land” encoded in only two bits.
In the context of XML, document recipients may access a wealth of knowledge about the nature of the data being sent. The XML Schema Language   is being widely deployed by industry leaders for validating XML files, enabling content aware XML editors and describing web services APIs. There are simple, inexpensive tools available from several companies that make it easy to develop XML Schemas, including tools that automatically deduce XML Schemas from class definitions or a representative set of XML documents. XML Schemas are appearing for all major XML domains of discourse, including horizontal domains (e.g., XHTML  , SOAP  and SVG ) and many vertical domains  .
XML documents may identify one or more Schemas providing detailed information about the elements, attributes and data types found in the document. In most cases, the set of XML document types processed by an application are known well in advance (e.g., at application development time). Therefore, the appropriate XML Schemas can be deployed with the application (i.e., pre-cached).
In cases where the appropriate XML Schemas are not known in advance or cannot be deployed with the application, they may be downloaded dynamically, for example from a schema repository, and cached for later use.
Most of the mainstream schema-aware encoding proposals either do not address compression of textual content  or separate the compression process into two separate phases   . In one phase, they encode the textual content of the XML data using traditional techniques derived from information theory and in the other phase they encode the structure of the XML data based on the schema without applying the principals of information theory to the resulting value space. We believe the principals of information theory apply equally to both phases and can be applied to unify them and achieve better results more efficiently. Researchers have found that techniques that combined these two phases generally produce better results . In addition, we believe the schema should inform the encoding process without unnecessarily constraining the XML data that can be encoded or losing certain aspects of the original document (see sections 4.1 and 4.6).
This section provides a brief overview of some of the benefits we expect from an efficient XML encoding.
A more efficient XML encoding requires less primary and secondary storage. This is particularly important for small mobile devices and embedded systems where primary and secondary storage are at a premium. In some cases, the size of the data may determine which devices may effectively use that data.
Probably the most obvious benefit of a more efficient XML encoding is reduced bandwidth requirements. Reducing bandwidth requirements may or may not increase the visible performance of a particular network-bound application (e.g., due to network latency); however, it will make more efficient use of available communications resources allowing more users to share those resource effectively. Consequently, those purchasing bandwidth directly or indirectly will achieve a better cost / performance ratio. Consumers and enterprises using the ever increasing number of mobile devices connected to always-on, packet-switched networks that charge for bandwidth by the kilo-byte will realize this savings directly on their monthly bills.
Engineers using efficient XML encodings have reported up to an order of magnitude performance increase over processing the XML text encoding . There are several ways in which a properly designed XML encoding can improve performance. First, the encoding may be read and written directly from an information set implementation (e.g., DOM, SAX, pull-parser) without a separate compression step. Since the encoding must be produced from an internal information set representation, the decoder can depend on certain invariants that allow it to avoid steps such as testing for well-formedness or schema validity. In addition, the encoding can avoid costly string conversion functions such as atoi() by avoiding storing non-string values as strings.
Second, smaller files require fewer CPU cycles to read and process. For example, consider reading data from a file. The computer must read and process each byte of data. The processor repeats a set of instructions for each byte of the data read. Reading twenty times more data requires twenty times more repetitions of these instructions.
Third, space efficiency has indirect impacts on performance that are more subtle. Modern computing devices employ a multi-level memory architecture typically consisting of an internal cache on the CPU, a level two cache on the CPU board, a main memory on the motherboard and a swap device on secondary storage (listed in order of increasing size and decreasing speed). During the execution of any piece of software, portions of its data will be scattered across all of these memory devices. The most recently accessed portions of the data are stored in the smallest, fastest memory devices assuming they are most likely to be reused. This assumption is more likely to be true for smaller data representations since a larger percentage of the entire dataset resides in the fastest memory devices at any given time. Very small data representations can fit entirely in the CPU’s internal cache, resulting in extremely fast processing times. On the other hand, very large pieces of data must be stored on the hard disk and repeatedly swapped in and out of main memory, resulting in an extremely inefficient process called thrashing. This problem is not limited to machines with small memories, but is particularly noticeable on large servers, on which a large number of processes and threads compete for the same limited memory resources. Using a smaller data representation can allow more processes and threads to effectively share the resources of the same machine. This increases server scalability and decreases the cost / performance ratio.
When using devices that are continuously connected to a power source it is easy to forget that work takes energy. However, each instruction executed by the processor, each byte of data transmitted or received over the radio and each byte of data read or written to secondary storage requires energy. For mobile devices, efficient use of resources translates directly to longer batter life. Because a smaller data representation makes more efficient use of available processing, storage and communication resources, it requires less energy and extends battery life.
In environments where efficiency is critical, such as in gaming, mobile, embedded and real-time systems, support for the XML text syntax is often not desirable or even possible. Therefore, these systems generally implement a more efficient and often proprietary data encoding. Consequently, it is not possible for these systems to access the wealth of information available in XML format or provide information to systems that speak XML without an expensive series of proprietary gateways and the associated risk that information fidelity and accuracy will be lost.
Researchers have found that compressed or encoded XML messages are competitive in size and often smaller than hand coded binary formats  making them an attractive alternative to proprietary formats. Providing a more efficient XML encoding will enable efficient systems to access and share information seamlessly with a wide range of XML systems. Per Metcalfe’s Law, the value of information exposed as XML will increase exponentially with the number of systems able to access it.
As discussed above, providing a more efficient XML encoding will enable a broader range of systems to use XML. In addition to interoperability benefits, this enables those systems to leverage the broad range of available XML technologies, development tools, software libraries, and developer expertise, which eliminates the time and cost associated with developing technologies, tools, libraries and expertise for a proprietary or less popular format from scratch. In addition, the quality, cost and tempo of evolution for XML technologies is driven by a competitive mass-market meaning high quality, rapidly evolving technologies can be acquired inexpensively and are often free.
Providing a more efficient XML encoding will enable a broader range of systems to use XML, including gaming, mobile, embedded and real-time systems. The size of the market for XML solutions will increase as the number of systems able to efficiently use XML increases, further stimulating the XML ecosystem.
This section provides a brief overview of some of the requirements we’d like to see satisfied by an efficient XML encoding. These requirements are informed by our technical research and empirical analysis of the characteristics of data exchanged between a diverse array of military systems during several real operational scenarios.
To maximize the benefits to and synergy with the XML eco-system, the efficient XML encoding must be general enough to apply to the broad range of XML applications. For example,
Data and Document Centric Applications: The encoding should work well for both data and document centric applications of XML. In particular, it should work well on documents with a large percentage of text content, support mixed content models and be able to preserve whitespace and XML artifacts like comments and processing instructions. A particularly important use case is XHTML, which will help the technology become a compelling addition to popular web browsers. Popular web browsers will be important for providing a ubiquitous platform for viewing encoded data and building XML applications.
It should be possible to turn support for various document-centric features off on a schema or instance-document basis, such that data-centric applications that do not need them will not be required to incur the associated overhead.
Lossless and Lossy Encoding: The XML encoding should support a lossless encoding of the data for applications that need a high-fidelity representation of the data (e.g., preservation of whitespace, comments and processing instructions). Applications that do not need this level of fidelity should be able to avoid incurring the associated overhead by selecting a lossy encoding mode.
Applications with and without Schemas: We believe the encoding should leverage schema information when it is available, but should not depend on it to operate. It should be possible to provide efficient encodings of untyped data, including entire XML documents or portions of XML documents associated with Schema’s anyType .
Large and Small Documents: The XML data encoding should work equally well on large XML documents, such as database dumps and aircraft maintenance manuals, and small documents such as those typically exchanged by web services or mobile devices. It should not depend solely on character redundancy or shared meta-data (i.e., schemas) to achieve compression. Rather, these techniques should be unified by relating them through their common roots in information theory.
The efficient XML encoding must be directly readable and writable from an appropriately modified information set  implementation (e.g., DOM, SAX, pull-parser). In particular, it should not require a separate compression step after serializing an XML document or a separate decompression step prior to parsing an XML document. Information set implementations should provide a unified information set abstraction to host applications independent of which encoding was used. Applications should be able to request the abstract information set be serialized using a particular encoding where appropriate.
The efficient XML encoding must support streaming. This is required by applications that must generate results responsively before the entire document has been received, read or processed. In addition, it is critical for devices with limited memory capacity, where it may not be possible to store a large portion of the document in memory before processing it.
In particular, the encoding must be formulated such that applications can begin processing a particular information item before information items occurring later in document order have been processed, read or even sent from their source. In addition, applications must be able to decode and examine each information item in the document without remembering information items that occurred earlier in document order.
This requirement is less restrictive than providing full random access to the document.
When bandwidth, processing power and storage capacity are at a premium, applications must be able to send and receive specific portions of an XML document on demand as they are needed instead of transferring the entire document. If a client requires only a small subset of the information items in a large document, transmitting the entire document is highly inefficient and obviates the benefits of the efficient encoding. Therefore the encoding must be capable of representing arbitrary collections of information items extracted from a document.
The efficient XML encoding will likely provide the largest benefits to devices with limited processing and memory resources. Therefore, it is important that the algorithms required to support the encoding have low space and time complexities. In addition, the implementation of these algorithms should be very small.
As mentioned above, AgileDelta has devised uniquely efficient methods for implementing the full web services stack and an intuitive data binding layer in as little as 7K. Adding a standard, efficient XML encoding to this framework should not increase this footprint by a significant percentage.
The efficient XML encoding should be informed by schema information when it is available, but it should not prohibit applications from intentionally encoding information sets that do not conform to the given schema. In particular, the encoding should be capable of representing instances of XML documents for which the Post Schema Validated Information set (PSVI) contain information items with the [Validity] property set to invalid . Such document instances should require more bits to encode due to their decreased probability of occurrence, but should not be prohibited. At the same time, the encoding should not require documents known to be schema-valid to incur any overhead for encoding information items that are schema-invalid (other than possibly a Boolean value indicating whether the document is known to be schema-valid or schema-invalid).
There has been much debate over the usefulness of transmitting or processing documents that are schema-invalid. In general, machines are not capable of inferring the meaning of information items they were not explicitly programmed to understand (the semantic web  is still a distant point on the horizon). However, although we prefer documents to be schema-valid, we have found several real-world scenarios where requiring strict schema-validity is at best disruptive and at worst disabling.
For example, it is sometimes necessary to store or transmit a document in an incomplete state. This occurs frequently with documents that are constructed collaboratively by several cooperating systems or organizations. Before the document is completed, it must be communicated between organizations in a schema-invalid form. Systems should not have to fall-back to the text encoding of an XML document to achieve this goal.
As another example, information exchange requirements often evolve faster than the standards (and the associated schemas) that capture them. One might argue the schema should be flexible enough to account for any possible contingency; however, this is not always possible and the complexity that results from the associated layers of abstraction is often undesirable. Therefore, in practice we’ve found that users will intentionally override validation results when they need to send information not captured in the schema. If the system does not allow them to override the validation results, they will find another way to send the needed information, for example by embedding it in a comment. Unfortunately, this ad-hoc form of capturing deviations from the schema is generally ignored by automated systems. Therefore, we’ve found it is much better to formalize a mechanism for articulating intentional deviations from the schema that can be understood and handled appropriately by receiving systems.
We’ve also found that there are several useful things an application can do with schema-invalid information items. Each application receiving schema-invalid information items may implement a different policy according to their processing requirements. For example, an application may:
Generate an error: Depending on the processing needs of the application, it may have no other option than to generate an error. Higher level application logic may be able to recover from the error and continue processing or may indicate a failure to the user or originating system via a user interface, log or message.
Ask for human intervention: The application may request assistance from a human user. The user may handle the schema deviation manually or establish a rule indicating how the application should treat similar deviations in the future.
Application specific error correction: The application may be programmed to “correct” certain classes of error automatically with no human intervention.
Ignore them: The schema-invalid information items may occur in a section of the document that does not effect the information the application needs to complete its task. Therefore, the application may safely ignore the schema-invalid items and complete its task successfully.
Pass them through: The application may pass the schema-invalid information items through to another application for handling.
Provide them to the end user: In some cases, the schema-invalid information is ultimately presented to a human, who will likely understand the deviation with no problems. Effectively, portions of the document may be used for structured human-to-human communications. The client application knows best which data fall into this category.
Ultimately, it is the receiving application that knows best how the information items in the document will be processed and how to handle deviations from the schema. XML applications today have both the responsibility and the freedom to define and enforce their own policies for handling schema-invalid documents. This ability should not be taken away when using a more efficient XML encoding. The same logic used to handle schema-invalid documents encoded as characters, should be available for processing schema-invalid documents encoded as bits.
This is not to say that client application must perform schema validation on XML encoded documents. On the contrary the encoding can communicate the results of validation performed by the encoder. The client simply reads this information and applies the appropriate error policies.
While the W3C is determining the need for a second standard encoding for XML, it may be worthwhile to invest some time into defining a standard API for an extensible encoder architecture, whereby custom CODECs could be added to XML information set implementations. There are hundreds of legacy data formats that could benefit from plugging into the XML ecosystem. A standard API that allows developers to plug in custom CODECs for creating an XML information set from a custom format or serializing an information set in a custom format would enable more users to take advantage of the rich set of XML technologies, tools, libraries and development expertise. At the same time, it would lower barriers to XML adoption, expanding the XML community and associated market for XML solutions.
Some of the proposed efficient XML encoding methods specify features we consider out of scope for a W3C standard. Some of these are features that could easily be added on top of an efficient XML encoding.
The data format should not incur any additional overhead to support random access to the data over and above that provided by the ability to encode query results (see section 4.4). This matches the current level of random access available to XML application today and should be sufficient.
Dynamic updates should be out of scope for the efficient XML encoding. The ability to incrementally receive, reconstruct and update documents on a client can easily be added on top of an appropriate XML encoding in environments where it is needed. Applications that don’t need this capability should not incur the additional overhead associated with it.
Versioning of XML schemas is going to be critical for XML as the number of systems exchanging XML data increases and those individual systems begin to evolve independently. Currently, the XML standards do not define a mechanism for expressing the version relationship between two or more XML schemas. For example, there is no widely accepted way to express that schema A is a later version of schema B or that schema A is compatible with schema B (i.e., valid instances of schema A are also valid instances of schema B and have the same meaning). Consequently, achieving continuous information interoperability over time between a diverse set of systems maintained by differing organizations according to different schedules and different priorities will be at best expensive and at worst impossible.
That said, we do not believe the XML encoding is the right place to begin addressing this problem. We believe the W3C should take a holistic view of the versioning problem and addresses it across the spectrum of XML technologies so the solution is not narrowly applicable to the efficient XML encoding.
In the interim, applications must continue to address this problem outside the scope of XML standards. The efficient XML encoding should not incur additional overhead to address it at this time.
Mixing of text and binary encodings is not a requirement for us and we do not believe the efficient XML encoding should incur any additional overhead to support this feature.
Given the recent interest and activity around efficient XML encodings, we believe now it is the right time to initiate a W3C working group to investigate the development of a single, global standard for efficiently encoding XML data. We believe this activity will result in several benefits to the current and future XML community. We are interested in sharing our history, expertise and ongoing research and development in this area with W3C members to expedite the development of this standard.
 O. Avaro and P. Salembier. MPEG-7 Systems: Overview. IEEE Transactions on Circuits and Systems for Video Technology, Vol II, No. 6, June 2001.
 R. Cherinka, J. Ricci and J. Schneider. Applying Knowledge Based Compression to Reduce DoD Bandwidth Requirements. MITRE WN 96B0000139, July 1996
 M. Cokus and D. Winkowski. XML Sizing and Compression Study for Military Wireless Data. Proceedings of XML 2002 Conference. December 2002. Available as http://www.idealliance.org/papers/xml02/dx_xml02/html/abstract/06-02-04.html.
 M. Girardot and N. Sundaresan. Millau: An Encoding Format for Efficient Representation and Exchange of XML over the Web. Proceedings of the 9th International World Wide Web Conference, May 2000. Available as http://www9.org/w9cdrom/154/154.html.
 ISO/IEC JTC1/SC29/WG11. Information Technology – Multimedia Content Description Interface – Part 1: Systems. Working draft of Text of ISO/IEC 15938-1/PDAM1. December, 2002.
 M. A. Malloy and J. C. Schneider. Experiences Designing Query Languages for Hierarchically Structured Text Documents. Proceedings of the W3C Workshop on XML Query Languages, December 1998. Available as http://www.w3.org/TandS/QL/QL98/pp/mitre.html.
 P. Sandoz and S. Pericas-Geertsen. Fast Web Services. Proceedings of JavaOne 2003 Conference, June 2003. Available as http://servlet.java.sun.com/javaone/sf2003/conf/sessions/display-3752.en.jsp.
 C. E. Shannon. A mathematica
theory of communication. Bell System Technical Journal, 27:379-423 and
623-656, July and October 1948.
 N. Sundaresan and R. Moussa, Algorithms and Programming Models for Efficient Representation of XML for Internet Applications. Proceedings of 10th International World Wide Web Conference, May 2001. Available as http://www10.org/cdrom/papers/542/.
John Schneider is Chief Technologist at AgileDelta. He has a great enthusiasm for XML, Web services, and the future architectures they enable for sharing information and business logic. He invented one of the first schema based compression schemes for structured text messages and is currently leading the development of efficient XML encoding technologies for the U.S. DoD. He is also leading the ECMAScript for XML (E4X) initiative, creating the first mainstream programming language with native support for XML. He contributed to the design of the XML Query and XML Schema languages and was selected to lead several international technology initiatives related to XML, messaging and data interoperability representing the United States. He has B.S. and M.S. degrees in Computer Science and has been developing new technologies for over 20 years.