Mike Cokus
Scott Renner
Dan Winkowski
The MITRE Corporation
{msc,sar,winkowski}@mitre.org
US Air Force policy is to use XML for data which flows from one system to another, especially for data that flows from one automated process to another (as opposed to data intended for presentation to humans). Some of these automated data flows will necessarily take place over wireless data links, where bandwidth is and will always be a severe constraint. Many of these exchanges are very small, well under 1000 bits.
Our position is:
1. We want to use XML in these bandwidth-limited situations -- all of the usual advantages of web services apply here as well.
2. In our application-level data exchanges -- especially in our wireless data link exchanges -- we can assume that data producers and consumers always have the XML schema for every document that is exchanged.
3. By exploiting this shared schema knowledge in compressing the data, and by using an appropriate binary representation, we may keep the advantages of using XML, while reducing the number of bits that must be transmitted to something close to the information-theoretic optimum. We have both theoretical hand-waving and experimental results in support of this belief.
4. We need a single standard for this schema-based compression, such that any producer with the same schema and valid document will always produce the same binary, and anyone with the same schema and binary will always produce the same source document (or infoset). Otherwise, our configuration management problems will explode. If we ever find ourselves worrying that "XYZ may not be able to handle this bitstream, because he may not have the ABC codec", then we're” hosed”.
5. We believe that a hybrid (combining schema-based and redundancy schemes) approach will be required to support a comprehensive method of XML compression. We need a single standard for hybrid compression methods for use as needed. The identification of the compression method used should be part of the encoded form of the XML document.
6.
We do not want a military-only standard. Those are bad. We want to buy
this capability as commercial off-the-shelf (COTS). We believe there are
civilian applications with similar needs. We want a W3C standard.
In 2002, MITRE undertook a study [1] for the US Air Fore to examine how the verbosity of XML may be mitigated using existing technologies. The study focused on a message class consisting of mostly structured content. The message standard was designed originally for efficient binary transmission and a large binary message sample set was available for comparison. MITRE developed an XML W3C schema that was a faithful reflection of the structure of the message standard and following common practice used descriptive XML element names. The element content was defined using XML datatypes to represent the ASCII or numeric expression of the binary data. A gateway was developed to transform the binary sample message set into XML consistent with the schema definition.
The study examined three basic compression approaches, namely redundancy-based compression, schema-based encoding and hybrid techniques. Redundancy-based compression methods take advantage of the repetition of character groups within the source document. Schema-based methods leverage design knowledge of the structure and content of the source document to produce efficient field encodings. Hybrid compression methods apply a combination of redundancy-based and schema-based methods to the same source document. Metrics on file size optimization were collected. Due to the differing maturity of the applications used in this study metrics on compression/decompression speeds are not reportable.
Two redundancy techniques were examined in the study. WinZip [2] is a PC application based on the Info-ZIP specification implementing the Lempel-Ziv (LZ77) and Huffman algorithms. We also implemented a “WBXML-like” encoding that applied WBXML [3] tag tokenization to the XML source documents based on code pages generated from an XML schema rather than a DTD.
Two schema-based compression methods were also selected. These methods work by encoding the file based on design knowledge about the permitted content and structure of the source document. It is analogous to a compilation stage that applies standard encoding rules to the source file to produce content suitable for machine-to-machine processing. In schema-based compression, the full XML tags are not represented in the encoded form but can be reconstructed using design knowledge during the decoding phase. The technique also applies design knowledge to efficiently encode the data contents and places structural markers only when necessary into the file. Additionally, schema-based methods may allow content to be decoded in streaming mode.
The two schema-based techniques investigated were MPEG-7/BiM and Abstract Syntax Notation One (ASN.1). MPEG-7/BiM [4] is an ISO/IEC standard developed by Moving Picture Experts Group (MPEG). BiM (Binary Format for MPEG-7) is MPEG-7's XML schema-based compression standard. Abstract Syntax Notation One [5], another ISO/IEC standard set, is used for defining messages for tree structured information exchange without regard to the implementation. Recent enhancements to the standard include support for XML.
Hybrid compression methods apply both redundancy-based and schema-based compression on the same source document. Specifically, redundancy-based compression is applied after applying schema-based compression to further reduce the size of the compressed format. Three hybrid approaches were selected, XMill, ASN.1 combined with WinZip, and MPEG-7/BiM combined with WinZip.
XMill [6], an open source research prototype developed by University of Pennsylvania and AT&T Labs, builds on the gzip library to provide an XML-specific compressor. The path encoding options support selective application of type-specific (i.e., based on design knowledge of the source document) encoders to data content. The data and content of the source document are compressed separately using redundancy compression. Since XMill was the only native hybrid application selected for this study, the experiment included tests of each schema-based method followed by redundancy compression to provide a basis for comparison.
Tests were run using two sample sets of messages. The first, "Test Set A", did not have associated an equivalent binary sample due to a technical error. The second, "Test Set B", were produce directly from the binary sample set and therefore results include comparisons to the binary messages. Test Set A, being derived from the same message standard, is similar in structure and content to Test Set B and was included only because resources did not permit application of the ASN.1 techniques to Test Set B.
The message collections were grouped into files consisting of 6853, 500, 200, and 8 messages. The XML translated messages compared to the corresponding binary formats exhibited a ratio of 17:1 in size.
|
Test |
Description |
Byte Size |
Ratio vs. Binary |
Ratio vs. XML |
|
|
|
|
|
|
|
Test Set A XML |
No Equivalent Binary Messages |
|
|
|
|
|
6853 messages |
9,878,662 |
|
|
|
|
8 messages |
6,010 |
|
|
|
|
|
|
|
|
|
Test Set B XML |
Equivalent Binary Messages Available for comparison |
|
|
|
|
|
6853 messages |
11,421,822 |
1694% |
|
|
|
500 messages |
855,297 |
1687% |
|
|
|
200 messages |
339,629 |
1699% |
|
|
|
8 messages |
12,209 |
1598% |
|
|
|
|
|
|
|
|
Test Set A Results |
|
|
|
|
|
|
|
|
|
|
|
WinZip Results for Test Set A with Max Compression |
|
|
|
|
|
|
6853 messages WinZip (Max) |
319,303 |
|
3.23% |
|
|
8 messages WinZip (Max) |
669 |
|
11.13% |
|
|
|
|
|
|
|
WBXML-like Results for Test Set A |
|
|
|
|
|
|
6853 messages, XSD-WBXML |
4,148,070 |
|
41.99% |
|
|
8 messages XSD-WBXML |
2,567 |
|
42.71% |
|
|
|
|
|
|
|
XMill Results for Test Set A |
|
|
|
|
|
|
6853 messages, XMill Default |
116,000 |
|
1.17% |
|
|
6853 messages, XMill, plus path encoder |
89,136 |
|
0.90% |
|
|
|
|
|
|
|
|
8 messages, XMill Default |
578 |
|
9.62% |
|
|
8 messages, XMill, plus path encoder |
1,152 |
|
19.17% |
|
|
|
|
|
|
|
MPEG-7 Results for Test Set A |
|
|
|
|
|
|
6853 messages, BiM (MPEG-7) |
783,403 |
|
7.93% |
|
|
8 messages, BiM (MPEG-7) |
181 |
|
3.01% |
|
|
|
|
|
|
|
ASN.1 (PER) Results for Test Set A |
|
|
|
|
|
|
6853 Messages, PER Aligned |
827983 |
|
8.39% |
|
|
8 Messages, PER Aligned |
58 |
|
0.98% |
|
|
|
|
|
|
|
ASN.1 (PER)/WinzZip (Max) Results for Test Set A |
|
|
|
|
|
|
6853 Messages, PER Aligned, WinZip (Max) |
236,488 |
|
2.40% |
|
|
8 Messages, PER Aligned, WinZip (Max) |
154 |
|
2.61% |
|
|
|
|
|
|
|
MPEG-7/WinzZip (Max) Results for Test Set A |
|
|
|
|
|
|
6853 messages, BiM (MPEG-7), WinZip Max |
231,762 |
|
2.35% |
|
|
8 messages, BiM (MPEG-7), WinZip Max |
181 |
|
3.01% |
|
|
|
|
|
|
|
|
|
|
|
|
|
Test Set B Results |
|
|
|