This document describes measurement aspects, methods, caveats, test data, and test scenarios for evaluating the potential benefits of a candidate binary XML format.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is part of a set of documents produced according to the Working Group's charter, in which the Working Group has been determining Use Cases, characterizing the Properties that are required by those Use Cases, and establishing objective, shared Measurements to help judge whether XML 1.x and alternate binary encodings provide the required properties.
The XML Binary Characterization Working Group has ended its work. This document is not expected to become a Recommendation later. It will be maintained as a WG Note.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
2 Relationship to Use Case and Characterization Documents
3 Considerations for Test Suite Development
4 Test Data
5 Property Measurement Methodology
6 Property Measurement - Detailed Analysis
6.1.2 Type & range
6.1.5 Known Tradeoffs
6.2 Processing Efficiency
184.108.40.206 Processing phase definitions
220.127.116.11 Standard APIs vs. abstract operations
18.104.22.168 Incremental Overhead
22.214.171.124 Measurement Considerations
6.2.2 Type & range
6.2.5 Known Tradeoffs
6.3 Accelerated Sequential Access
6.3.2 Type & range
6.3.5 Known Tradeoffs
6.4 Content Type Management
6.4.2 Type & range
6.4.5 Known Tradeoffs
6.5.2 Type & range
6.5.5 Known Tradeoffs
6.6 Efficient Update
6.6.2 Type & range
6.6.5 Known Tradeoffs
6.7 Embedding Support
6.7.2 Type & range
6.7.5 Known Tradeoffs
6.8.2 Type & range
6.8.5 Known Tradeoffs
6.9 Human Readable and Editable
6.9.2 Type & range
6.9.5 Known Tradeoffs
6.10 Integratable into the XML Stack
6.10.2 Type & range
6.10.5 Known Tradeoffs
6.11 No Arbitrary Limits
6.11.2 Type & range
6.11.5 Known Tradeoffs
6.12 Platform Neutrality
6.12.2 Type & range
6.12.5 Known Tradeoffs
6.13 Random Access
6.13.2 Type & range
6.13.5 Known Tradeoffs
6.14 Round Trip Support
6.14.2 Type & range
6.14.5 Known Tradeoffs
6.15.2 Type & range
6.15.5 Known Tradeoffs
6.16 Small Footprint
6.16.2 Type & range
6.16.5 Known Tradeoffs
6.17 Space Efficiency
6.17.2 Type & range
6.17.5 Known Tradeoffs
This document describes measurement aspects, methods, caveats, test data, and test scenarios for evaluating the potential benefits of a candidate binary XML format. This document relies on the XML Binary Characterization Working Group (XBC WG) documents for Use Cases and Properties. The focus of this document is to provide a basis for later comparison rather than reporting of actual measurements of actual implementations. The examined and potential use cases represent existing uses that might benefit from the use of an XML-like format, if it had certain additional properties. This potential expansion of the XML community depends on the existence, identification, and evolution of solutions that cover the broadest problem footprint in the best fashion. The XBC WG Characterization document represents the working group's consensus of required and useful properties. This document discusses how fulfillment of those properties can be precisely evaluated and how combinations of properties are best compared.
A particular format in a particular application situation may need to incorporate design tradeoffs that lower support for a particular property. Unless otherwise noted, the properties are written as positive requirements that are at least desirable.
Measurement of properties relies directly on Use Case needs. These needs are expressed in application-specific terms and context. The definition of properties in the Properties document, unified by common needs among use cases, provides the identification of measurement points, but additional information remains to be captured from the use cases. The primary additional information areas are the operational scenarios, representative test data, and the thresholds at which an aggregate solution might be worth significant adoption. Representation of these areas must be initially approximated and abstracted. The Use Case document details and summarizes the relationship between properties and use cases. The Characterization document represents decisions about thresholds of acceptability and ranking of properties.
In evaluating efficient formats with regard to the properties defined in XBC Properties and further addressed in this document, it is necessary to consider how properties may be related. Property relationships may be affected by the nature of the properties themselves, the environment, or the capabilities of the efficient format instances being evaluated.
Some combinations of properties may be contradictory, especially with respect to certain design strategies. Some solutions may not support certain properties or simultaneous combinations of properties. Certain properties or combinations are comparable, sometimes only in one direction, to other properties. For instance, a lossless encoder can be compared to lossy encoders in an evaluation of efficiency with the option of lossiness, but not vice versa. In addition, a non-schema solution can be compared to schema-based solutions in all modes, but schema-based methods might not be comparable in property combinations that contraindicate schema-based encoding.
These property combinations and application scenario details must be considered when planning test scenarios and when performing valid and useful format comparisons. The discussion below is intended to outline the goals and potential pitfalls of developing test scenarios for conducting detailed format comparisons. Wherever possible, test scenarios should be abstract (i.e., not tied to any one particular use case).
The purpose of abstract test scenarios is to catalog and unify the variability that is needed in realistic test suites to evaluate efficient formats. A test suite that is representative of the use cases must exercise appropriate combinations of this variability. In addition, testing every combination of property presence and weighting is not feasible with limited resources and is not very useful. Developing abstracting test scenarios can help to decrease the test suite size while maintaining its relevance.
This approach also makes clear any simultaneous need for certain sets of properties. This correlation can be used to create a small set of property profiles that cluster around certain types of problems. A property profile defines a set of properties that are essential or desirable for a particular abstract test scenario. The goal of defining property profiles is to simplify the number of test cases to be developed and applied. Ideally, abstract test scenarios would also include user use case situational descriptions and address data manipulation patterns. These manipulation patterns should include when and how data is created, read, modified, transferred, and disposed of.
It is also useful to employ abstract variability descriptions of application environments. The ranges listed below describe particular aspects of an application environment as it affects processing of data that could be externalized in an efficient format. This is an initial list, and should not be considered complete or minimized. At least one use case addresses each range, although not all possible combinations of values are indicated.
*Refers to the process of being created, transmitted and modified, and transmitted one or more times. An example would be a form that is routed from person to person with each person filling in data and signing their portion. This pattern is particularly important when speed and security are needed simultaneously.
Appropriate test data is crucial to understanding performance for all considered uses and circumstances. Data can be structure heavy, with many large tags, or data heavy. Data can be more uniform or more random. Data may benefit from generalized or application-specific compression or coding. Good test data simulates a variety of applications and broad testing of solutions.
Most format candidates and implementations will have some tunable parameters that affect which options are enabled and to what degree. It is impractical to test every combination of every parameter in such complex systems. To solve this assessment challenge, suitable edge and midpoint values must be chosen and various combinations iterated. Reports based on testing should highlight average, typical, and worst case performance with explanations as needed.
Test data available to the working group is published on the working group public web site at http://www.w3.org/XML/Binary/2005/03/test-data/
The methodology used includes two levels of property support measurement. The first, basic level provides a succinct screening of formats by thresholding properties. The threshold type is either boolean or trinary. A boolean measurement indicates whether the property is supported and is expected to perform better than, or in some cases the same as, XML 1.x. A trinary measurement records whether the format supports the property, does not prevent (DNP) the property, or prevents the property from being implemented. These thresholds are used with property ranking and are contained in the Characterization document to determine relative importance of properties which supports candidate format decision making.
The second, detailed measurement level for some properties is useful in detailed comparisons of candidate formats to each other. Valid and useful comparison of formats is difficult for binary XML candidates. This is caused by the need for a large array of properties constraining solutions which must simultaneously operate well on a broad range of data. Detailed measurement of properties naturally falls into different types and ranges of values. Some properties have one or more boolean membership values, others have categorical levels of compliance, relative, or absolute values. The success of fulfilling a property may depend on the data and usage scenario. Certain measurements, such as expected or actual performance of implementations and size of instances, require careful analysis. In most cases design or configuration tradeoffs for one property will affect many others. In some cases, that influence will be strongly correlated. Additionally, a format may be tunable in hinted or automatic ways to favor different property goals. An example of this would be optimizing for speed vs. compactness with various possible ratios of speed and compactness. It is important to note that both compactness and processing efficiency are affected by the method of support for most other properties. Many other properties are only beneficial when they are supported in ways that allow good compactness and processing efficiency.
This detailing of selected properties is in the Property Measurement - Detailed Analysis section.
The detailed property measurements identified by or submitted to the working group are documented below. Detailed property descriptions may have the following descriptive sections:
A number of properties can be measured independent of other properties. The key properties that are at the root of the need for a successful binary XML format are Compactness and Processing Efficiency. These two properties directly depend on nearly every other property in the sense that most of the other properties are interesting mainly when they are supported while also having good compactness and processing efficiency. For example, it is not useful to have a method of random access if it makes instances bigger and slower than just parsing an XML 1.x document.
The Compactness property measurement represents the amount of compression a particular format achieves when encoding data model items. The degree of compactness achieved with a particular format is highly dependent on the input data model items, strategies enabled, and application characteristics. In test scenarios, these characteristics should vary considerably to emulate all important use cases in order to properly measure the compactness property of each competing format. To objectively compare formats for their ability to represent data model items in a compact manner, competing measurements of various formats must be taken using the same scenario.
A possible disadvantage of any compact encoding might be the additional computation required to generate or interpret and use the encoding. There is a tendency, exhibited by many size minimization strategies, for compactness to be inversely proportional to Processing Efficiency. If compactness is absolutely maximized, processing efficiency will decrease in most cases. Note that for many test scenarios, it is possible to improve both compactness and processing efficiency relative to the use of XML 1.x. It is desirable for a format to support the ability to control its compactness based on the need for processing efficiency, available memory, or other properties. For example, if the format is processed on a high-bandwidth server, the algorithm should be able to be tuned to obtain maximum processing efficiency by sacrificing memory efficiency. On the other hand, if the format is processed on low-bandwidth mobile handsets, the algorithm should be able to obtain maximum compactness by sacrificing processing efficiency. A key need is the ability to balance compactness with processing efficiency in a tunable way. Certain strategies, principally frequency-based dynamic analysis such as gzip compression, are more appropriate when size is the overriding concern. Given the constraints of simultaneously minimal size and processing overhead, methods such as tokenization with dictionary tables might be more successful.
Size efficiency, or compactness, concerns the optimization of the storage or transmission resources needed to represent data model items. Several categories of methods are known to be useful. This section reviews major categories of methods and related topics which provides background for format analysis.
A data object, which is the representation of data model items, consists of three logical components that usually have a physical representation. These are the data, the structural information, and metadata (including typing). For XML 1.x, the structure and metadata are represented by tag syntax and naming while data is mostly present in attribute values and element text. Some strategies for data representation remove some or all structural and metadata representation and place it in external metadata or embedded in code.
There are three categories of methods to reduce the size of a data object or data model items: compression, decimation, or externalization. Competitive formats may make use of one or more methods from each category. Compression is the transformation of data into corresponding data that takes less storage through the removal or reuse of redundant information and more efficient coding of data. Compression is often paired with decimation, the process of eliminating some details that are not used or of less importance than more important components of the original data. This is called "lossy compression" as opposed to "lossless compression" or just "compression".
Externalization is the process of representing an original data model items as an external representation with varying degrees of reuse and a data object that relies on that external instance as a source of redundancy. This external information can be considered shared information between a sender and receiver. In some cases, this information is relatively stable, long term shared information, while in others, the potentially sharable information is ephemeral. Besides the trivial replacement of an object with a reference, there are two main externalization methods, schema-based (usually long-term shared information) and delta-based (usually short-term shared information). A schema-based method relies on a specification for certain aggregate data types, structure, and/or values. Trivially, this could mean sending and receiving code that simply writes and reads values in a certain order with no explicit structure. In this case, the structure and data type metadata is implicitly present in the code. More sophisticated methods rely on interface definition languages (IDL) or the reuse of validation schema such as XML Schema for externalization purposes. The use of these structural and metadata specifications may result in code generation and/or the production of metadata for use by an interpretive engine. Schema-based externalization usually has long-term schema reuse characteristics. A schema-based externalization is relying on long-term redundancy. This is compatible with some programming and lifecycle models, but can conflict with some application needs.
When the externalization method relies in a generalized way on representing differences from a template, parent object, or earlier message, it is called a delta. Delta mechanisms can be implemented in a high level, logical operations level, or as a low level byte or slot difference representation. Deltas can be produced by a computational differencing operation or by recording the location of changes as they happen. Deltas take advantage of both long term and short term redundancy.
In many cases, compression benefits from processing as much data as possible at the same time rather than considering individual fragments in isolation. This leads to processing models where a bulk compression or decompression step is performed. Generally, this leads to the data being inaccessible to application logic until all of the data, or at least all of the data up to a certain point, is decompressed.
There are numerous methods of compression which rely on different methods of detecting redundancy and representing data. These methods sometimes have data access pattern needs and are generally good at compressing some data while having limited use on other data. Some popular methods include:
See the Compression FAQ for more information on these methods.
For a given input document, this property is measured as a set of values [Tokenization, Schema, Compression (Data Analysis), Compression+Schema, Delta, Lossy]. Each of these values represents the percent smaller the encoded version of the document is from the original, N/S (not supported) if that method is not supported, or N/A (not applicable) if the method doesn't apply. (Lossy and Lossless are defined by the Round Trip Support Measurement.) Tokenization is the use of the format without the benefit of compression, schema-based encoding, deltas, or lossy compression. Schema represents schema-based encoding methods. In some cases, compression and schema-based encoding will be used together. Data analysis for compression, use of schemas, deltas, and lossy compression should be noted as optional when appropriate. Lossy compression quality must be normalized to some quality value, preferably based on an objective measure. An array of values may be necessary to represent important points on a lossiness spectrum.
XML 1.x would measure as follows: [0%, N/S, N/S, N/S, N/S, N/S].
The amount of compactness an XML format can achieve for a given XML document is a function of the document's size, structure, schema, and regularity. Because XML documents exist with a wide variety of sizes, structures, schemas, and regularity, it is not possible to define a single size threshold or percentage compactness that an XML format must achieve to be considered sufficiently compact for a general purpose W3C standard. The amount of compactness achieved by an XML format will vary from one XML document to the next.
The amount of compactness achieved by an XML format will also vary from application to application. The amount of compactness practical for a given application depends on the optimizations the application can effectively employ. In general, there are two common categories of optimizations XML formats use to improve compactness: schema optimizations (externalization) and document optimizations (compression).
Schema optimizations leverage shared knowledge of a class of XML documents to improve compactness (e.g., by omitting information known by all participants). This shared knowledge may be codified in some form of schema, but may also be embodied in other forms, such as source code. Schema optimizations are particularly effective for applications in which the XML format must be competitive in size with existing or alternate hand designed binary formats. However, schema optimizations cannot be used in applications where it is not practical or possible to codify shared knowledge about the subject XML documents or assume each participant has access to this knowledge.
Document optimizations analyze XML documents to identify patterns and derive smaller representations for the patterns that occur most frequently. Most well known data compression algorithms, such as Deflate, Lempel-Ziv coding, and Huffman coding fall into this category. Document optimizations are particularly effective in applications involving larger XML documents with repetitive structures, however they are not very effective on very small XML documents. In addition, document optimizations cannot be used in applications where it is not practical or possible to allocate the time, memory, or processing resources required to analyze each document.
These two categories of optimizations partition the set of XML applications into the four classes below. Each class defines a metric for determining whether a format is sufficiently compact for that class. To maintain independence from variations in document size, structure, schema and regularity, each metric defines sufficient compactness relative to well known and freely available encoding specifications.
The following table classifies each XBC use case according to this classification scheme. Most use cases include applications that fall into more than one class.
|Metadata in Broadcast Systems||X||X|
|Floating Point Arrays in the Energy Industry||X|
|X3D Graphics Model Compression, Serialization, and Transmission||X||X|
|Web Services for Small Devices||X||X|
|Web Services within the Enterprise||X||X|
|FIXML in the Securities Industry||X|
|Multimedia XML Documents for Mobile Handsets||X||X|
|Intra/Inter Business Communication||X||X|
|XMPP Instant Messaging Compression||X||X||X|
|XML Documents in Persistent Store||X|
|Business and Knowledge Processing||X||X|
|XML Content-based Routing and Publish Subscribe||X|
|Web Services Routing||X||X|
|Military Information Interoperability||X||X||X||X|
|Sensor Processing and Communication||X||X|
|SyncML for Data Synchronization||X||X|
|Supercomputing and Grid Processing||X||X||X||X|
An XML format is sufficiently compact to be a general purpose W3C standard if it is sufficiently compact for each of these four application classes. Formats that do not meet this criteria do not achieve sufficient compactness to satisfy the majority of the binary XML use cases that state compactness as a requirement.
When measuring the Compactness property, the encoder is not permitted to use prior knowledge about the semantics of information items used in the input document. For example, the encoder is not permitted to use specialized codecs to encode the contents of a specific element or attribute in the given instance document based on the name or location or that element or attribute.
Candidate formats will likely have multiple optional methods for achieving compactness depending on the circumstance. Measurement of compactness consists of encoding the same data using the major combinations of methods which are categorized by: tokenization, document optimization (compression), schema-based encoding, the combination of document and schema-based optimization, deltas, and lossy compression. Not all of these will be available or appropriate for every candidate format and scenario combination. In this case, a "lower" level method score can be used for comparison purposes. For instance, if a format does not support deltas, a schema or compression+schema score could be used. When scoring for scenarios that cannot use certain methods, such as schemas or compression, these values may indicate N/A (not applicable). Care must be taken to consider what methods could be used to great benefit beyond traditional models for existing data formats.
After scoring each combination of methods for each appropriate combination of test data and scenario, the best numbers for each test are tabulated to determine an overall score. Weighting of this comparison may be needed to appropriately reflect market impact and the effects of overlapping scenarios.
High scores for this property may be at odds with higher scores in the following properties (An example is given why each is listed):
Processing Efficiency is a measure of the efficiency, and effectively the speed, of processing an instance of a format. Determining the relative speed of different formats in a complete and valid way is difficult. This is because there are many variables that affect actual speed, including processing library implementation details that are not fundamentally required by the format. Ideally, different formats could be compared based on determination of their best reachable performance levels in all needed situations. In practice, this cannot be done with absolute accuracy. As a result, comparative evaluation must be accomplished by a combination of complexity analysis, processing characterization estimation, format characteristic analysis, fitness for all needed test scenarios, and actual empirical testing. It is important to stress that while empirical testing provides proof of obtaining at least a certain level of performance, by itself it proves little about whether better performance can be obtained for a particular format and test scenario. Complexity analysis tends to be able to provide better proof of the theoretical limits of performance, although this is not infallible in the face of unexpected algorithms. Additionally, in some cases complexity for multiple candidates may be, for example, linear and relative performance differences may be dominated by format or method details that affect overhead such as an extra level of indirection. As an example of a subtle but possibly dominant detail, one format may tend to allow better locality of reference in processing than another. With cache memory in modern systems running 25 or more times faster than main memory, a large subset of processing scenarios could perform better for the former format.
Applications use data formats to communicate information or to store data for later use. XML 1.x, and presumably any binary XML candidates, provides external data representation that is rich and flexible along with other benefits. The use of XML tends to be better, overall, than more simplistic approaches for many applications. While solving all efficiency problems during the creation of XML was not doable, ever-advancing experience and research on the problem have provided new insight. A format that solves these problems while retaining the benefits of XML 1.x and possibly adding new benefits aids existing applications and offers to greatly expand the range of applications which can justify the use of XML technology.
A key observation about the information technology industry is that often the macroscopic separation of concerns at the operating system, programming language, protocol, service, application framework, or application constrains problem solving and optimization. In the past it was rare, for instance, for an application developer outside of an operating system vendor to cause changes in an operating system to solve performance problems. (A notable exception to this is the addition of facilities for direct access to SCSI command queuing in operating systems for the largest database vendors.) With respect to formats and performance, it has usually been the case that programming languages have been optimized for in-memory operations on native variables while data formats have been designed without prime consideration of processing complexity. Because of the pervasive need for modularization and network distribution of application components, any overhead crossing the boundary between external format and memory representation is amplified. An application exists to accomplish actual work of some kind. Any operations outside of that work are overhead. While much of the overhead in existing systems exists for a logical reason for a particular environment, when considering candidate formats for binary XML, those reasons are a temporary artifact and immaterial. This means that it is important to analyze the effect of candidate format design decisions on existing and best possible processing complexity. The first step in this analysis is to define the processing phases involved in typical applications and determine variability points.
An application logic step is an operation that finds, traverses, reads, or modifies actual payload data in an instance. Processing that is overhead may include decompression, parsing, implied or required memory allocation or reference attachment, data binding, index maintenance, and schema retrieval and processing. Some candidate methods may involve other operations related to the use of schemas. Parsing is the conversion of a serialized form of data into a more readily usable form or events with arguments (SAX et al). Data binding can imply several levels. The simplest usable level, "structure without conversion", converts parse events into a data structure that captures all usable data and the usable structure of that data with no conversions. SAX and other parse/event engines are pure parsing engines. A DOM library implementation, when reading an XML 1.x object, parses and produces an application generic, XML-specific DOM data structure. The use of DOM is, in a semantic sense, equivalent in most cases to an application using SAX or similar for parsing and from parse events building an application-specific data structure. An application specific data structure may be interpretive "structure with conversion" or it may include representation of data values directly in native, 3GL (third generation language) constructs such as objects or structs, "native structure binding". In the case of non-native structures, format details may create overhead in application processing such as insertion and deletion which might be a tradeoff for other advantages. It might be that candidate formats have no substantial differences in how they present to application phases in which case this analysis would be moot. A survey of possible candidates indicates some methods that may be beneficial.
Numerous official, unofficial, and experimental application programming interfaces exist to process XML data. These APIs have provided valuable experience and have been an asset to application development environments. It is expected that any new format would be able to support existing APIs.
It has become apparent that there are certain design flaws in existing APIs in addition to a desire for features that simplify and streamline development. One example of a fundamental flaw that potentially affects performance is the "create object, fill object, link into tree" paradigm of the DOM API. Even if a format exists that supports minimal copying and coherent data, this API forces multiple copies, fragmented representation, and/or data reordering. Additionally, the new industries, data, and application types made possible by a successful binary XML format may require processing that is beyond traditional XML operations. This indicates that new APIs will be experimentally proposed and that valid evaluation of candidate formats must involve an abstract representation of scenario operations that can be translated to the best available API.
One aspect of a format is whether it allows and supports the ability to operate efficiently so that processing is linear to the application logic steps rather than the size of data complexity of the instance. It is often desirable for processing complexity to be related to work needed rather than the size or complexity of data. Size refers to the number of bytes taken by the instance. Data complexity refers to the granularity of XML-visible objects such as elements and attributes. A format that supports incremental overhead is fast for a single operation on an instance of any combination of size or complexity. While many applications desire this characteristic, it is not an independent property of the format because it is a meta-property of other properties such as random access and efficient update. If a format supports incremental overhead in a partial or complete way then certain properties operate incrementally.
While not measured as an independent property, this section provides some guidance when examining the presence of incremental overhead. The degrees of support for Incremental Overhead are expressed in terms of cost of use vs. size/complexity of an instance. The overhead of moving raw data as an efficient block copy is assumed. After parsing and data binding, data is accessed in an application through three main methods:
The differences in these approaches can be large and are affected by specific choices in format, implementation constraint, and API. All of these choices can affect efficiency. Minor differences are frequently not useful, but algorithmic complexity measures and performance validation can be very indicative. One important tradeoff is native access plus linear or worse overhead relative to size/complexity at one extreme vs. fully intermediated access and little or no overhead relative to size/complexity. Fully valid comparisons of this spectrum of approaches must include algorithmic complexity, logical analysis, characteristic analysis such as modeling locality of reference, and end-to-end and end-to-middle/middle-to-end measurements of available implementations.
The measurement of Incremental Overhead includes a category classification and an indication of algorithmic complexity (in O(n) or relative to linear P^2 notation). An example might be: "linear/no cost, O(P)*4, O(S/1000)".
Incremental Overhead degree categories:
Algorithmic complexity relates to the fundamental theoretical performance characteristics of an algorithm. Although particular measurements of different algorithms on the same data may be useful, without understanding the algorithmic complexity of the algorithms involved, the comparison is not known to be valid in all cases. Each algorithm has scaling characteristics that are related to various kinds of overhead, startup, and input/output data related operations. The relationship of the size and complexity of input/output data vs. the performance of the algorithm is represented as a formula that consists of linear and nonlinear factors plus constant factors. Typically, algorithmic complexity is expressed as operations on 'n' which represents the input size, count, or complexity. The following illustrates the value of considering algorithmic complexity with the example of random access support in a format.
Let's assume one wants to access a random element out of an XML document with one million elements. On average, the code will have to examine (parse, read, etc.) 500,000 elements. More generally, the time it takes to average any element out of an n-element document is proportional to n. It might be n/2, a slow implementation might be 2*n, and a fast one n/4, but fundamentally the complexity cost is tied to n.
A format which implements random access, however - in the sense that an index table is included in the format itself - can provide access to the nth element in time proportional to - depending on how the index works - the log of n or even in constant time. Again, there are various constant factors which may vary between implementations.
As n gets larger, it is always bigger than log(n) and bigger than 1 - no matter what the constant factors are. Thus one can reason about the relative performance of the format for certain operations without resorting to ever measuring any implementations. On the other hand, if one is interested in improving only the constant factor, then one must measure implementations, with all the difficulties that topic involves.
For example, DOM defines an API which supports random access in the sense that nodes do not need to be accessed in order. However, because of how XML is defined, random access via a DOM API still takes time proportional to n - the size of the document. DOM over XML does not support random access in the sense which is used here, namely, better-than-linear access time.
That said, the DOM API most likely could be implemented over a different file format to provide true random access; such an implementation would make use of the index included in the file. This continues down the stack: if the file is stored on tape, which does not support random access, then the benefits of the file format will still not be achieved.
The amount of increase in processing speed is dependent on the input documents used for testing. Therefore, to objectively compare formats for their inherent processing speed, competing measurements using the conditions of the measurement should be the same. The documents used as test data should vary in size and complexity to generate a set of results. In addition, normal performance profiling steps need to be followed. These include, but are not limited to, constructing a proper test environment with stable machines and software, utilizing a private network, and providing proper "warm-up" for adaptively compiled systems like Java. This requires the use of an appropriate set of test Scenarios, Property Profiles, and Test Data.
Any algorithm used for this measurement should have a theoretical runtime of no more than O(n). However, this measurement alone cannot be used to effectively determine the speed of the algorithm. It is possible that two algorithms with O(n) runtimes could have vastly different performance characteristics if, for example, one algorithm used 100 cycles per byte processed, while the other used 500 cycles per byte processed. Both algorithms would be O(n), but result in vastly different performance measurements.
For a given test scenario, property profile, and test data test scenario, this property is measured in several different ways:
Each measurement is recorded as a percentage faster than a standard text-based alternative for each type of operation.
Measurements must be taken as follows:
This will allow a fair comparison between various alternate formats to determine their processing efficiency differences.
Processing Efficiency has a correlated relationship with Small Footprint and Space Efficiency and an inverse relationship with Compactness. Additionally, this property can be considered a measurement of the processing efficiency for most other properties.
High scores for this property may be at odds with higher scores in the following properties:
The objective of Accelerated Sequential Access to reduce the amount of time required to access XML data model items in a document. The fundamental measurement is therefore the average time needed to access an XML data model item. This time can be compared to a baseline measurement of the average time needed to access an XML data model item using an unaccelerated sequential access method like that used to implement SAX.
Not all accelerated sequential access methods use a sequential index and incur T(ix). In this case it is only necessary to compare T(sk) average for the unaccelerated case against the accelerated one.
If accelerated sequential access supports update of the sequential index we should also take this cost into account.
T(up) - time to update the sequential index.
T(up) should also be added to T(am) for the average number of total updates (nu).
T(am) = T(ix) + ns ( T(sk) ) + nu ( T(up) )
For the baseline, unaccelerated sequential access case we consider only T(sk) for the average total number of seeks (ns).
T(am) = ns ( T(sk) )
Example: For an implementation of accelerated sequential access to XML: T(ix) 5.00ms T(sk) 3.50ms T(up) 3.00ms ns 1000 nu 50 T(am) = 5 + 1000 ( 3.5 ) + 50 ( 3.0 ) = 3655
For unaccelerated sequential access: T(sk) 4.00ms T(am) = 1000( 4 ) = 4000
Accelerated sequential access may have resource costs which can impact system performance. A more comprehensive model would be needed to take these into account in a full assessment of the comparative benefit of a accelerated sequential access implementation. As an approximation, an implementation which produces lower number for the following resource costs will be better in performance than an implementation with the same T(am) but with higher resource costs:
A method is described in the preceding section to measure the effect of Accelerated Sequential Access both in absolute terms (index creation, seek time, update time) and relative to an access method implemented on a format which does not support this property.
This property may be directly measured and compared by running seek and update (if supported) operations over a set of input documents for the Accelerated Sequential Access-capable format and text XML.
Implementation of the Random Access property will, in most cases, eliminate the need for Accelerated Sequential Access in that it subsumes its behavior and performance characteristics. Some Use Cases may specify both properties but only in the sense that Accelerated Sequential Access is seen as essential if and only if Random Access is not supported.
As a format supporting Accelerated Sequential Access will typically require the addition of information (an index) in the document, this property may be a tradeoff against Compactness. Additional cost and complexity is introduced if update is supported, possibly limiting the ability to support the Efficient Update property.
Measures the degree to which a format specifies usable Content Type Management information.
Degrees of support:
Note that there currently is dissent as to whether a binary XML format should be considered to be a content coding (like gzip) or not. Here are the options:
The Deltas property is a representation of arbitrary changes from a particular instance of a base, parent document which, along with that parent document, can be used to represent the new state of the parent. The parent is identified in a globally or locally unique manner. A delta is distinct from a fragment or a computed difference, although the latter could be represented as a delta. This property is somewhat related to support for Efficient Update.
Measurement of this property consists of determining whether the format supports a high-level or low-level delta mechanism and then determining the granularity, compactness, and processing efficiency of the mechanism.
A high-level delta mechanism, represented as the Deltas property, consists of some indication of operation, such as replace ID or delete element, and some representation of the content of the change. This kind of delta can be represented by XML data. The creation and use of this delta requires serialization of an XML representation and high-level interpretation of the operation and data. Both the creation and use of a high-level delta requires possibly complex processing and in will in many cases result in the size of the delta instance being larger than absolutely necessary. As an example, if a particular high-level delta mechanism can only replace whole nodes of some kind, changes of a few characters might require a delta that includes all surrounding text.
A low-level delta feature of a format could support a fine grained, very low complexity, and efficient method of representing changes to a parent document. This property could be implemented at or below the level of representation of the structure of an XML-like format. As an example, a mechanism could track or represent which characters were inserted, replaced, or deleted relative to a parent along with just the data that changed. This type of mechanism is low complexity because it is implemented using some method that allows efficient traversal of the data for logical or actual construction of the parent plus delta. This access or construction should have complexity on the order of access to ranges of bytes and efficiency similar to the size of "new" data with very small overhead.
Evaluations of candidate formats that implement this property will produce a delta type categorization, a granularity measure, and compactness and processing efficiency performance characteristics.
Delta type categorization:
The measurement for this property is by inspection of format specification, logical analysis, and empirical testing of test scenarios based on test scenarios that could benefit from Deltas, at least considering any of the categories listed in the Deltas property description.
This property doesn't depend on other properties. It does have a weak relationship with Efficient Update based on solving similar problems.
A low-level delta can be created through two types of methods. The most efficient method would be to capture changes to a parent in a kind of copy-on-write operation. This could have very low complexity. The other main method is a differencing operation that compares a before and after version of a document and represents the difference as a delta. While the resulting delta might be similar, the computational complexity of the latter might be arbitrarily difficult while the former is minimal and linear.
The Efficient Update property is concerned with whether a format instance can be modified efficiently without being completely rebuilt. When a format is designed with efficient update as a constraint, it will tend to be apparent that this is possible. When this was not planned for, it is still possible that a processor could implement an efficient update capability. In the latter case, an evaluation of the format must determine if there are features that prevent or assist such implementation. As the property description notes, this property is somewhat related to support for Deltas.
There are three aspects under which this property should be evaluated:
Evaluations of candidate formats that implement this property will produce three percentage values and a standard deviation. For update and retrieval, these are positive or negative percentages of improvement relative to comparison XML 1.x solution. For compactness, this percentage is overhead over a linear creation of an instance with the same data in the candidate format, along with an estimated (for analytical) or actual (for empirical) standard deviation.
The measurement for this property is by inspection of format specification, logical analysis, and empirical testing of test scenarios based on test scenarios that call for Efficient Update.
This property doesn't depend on other properties. It does have a weak relationship with Deltas based on solving similar problems.
The ability to support efficient updates in the direct, complete sense tends to imply compactness measures that are not monolithic and a mechanism for growing or shrinking data without requiring repositioning for all data following the change. Solutions for these tradeoffs will likely focus on differing granularity and may be tunable.
Measures the degree to which a format supports embedding of files of arbitrary type within serialized content.
This property is measured along an integer scale from [0,6], where zero indicates no embedding support and six indicates the greatest possible degree of embedding support.
This property is measured by considering which of the following statements is true, based on that format's specification:
The measurement levels resulting from this analysis are:
Measures the degree to which a format is competitive with alternatives across a diverse range of data, applications and use cases.
Generality is, in part, a function of the formats ability to optimize for application specific criteria and use cases. For example, some applications need to maximize compactness and are willing to give up some speed and processing resources to achieve it. While others need to maximize speed and are willing to give up some compactness to achieve it. Similarly, some applications require all the information contained in a document and are willing to give up some compactness to preserve it. Other applications are willing to discard certain information items in a document to achieve higher compactness.
Generality is also a function of the optimizations the format includes for efficiently representing documents of varying size and structure. For small, highly structured documents, a format informed by schema analysis will generally produce more compact encodings than a format informed solely by document analysis (e.g. generic compression software). For larger, more loosely structured documents, a format informed by document analysis techniques will generally produce more compact encodings than a format solely informed by schema analysis. A format informed by both schema analysis and document analysis will generally produce more compact encodings across a broader range of documents than a format that only includes one of these techniques.
This property is measured along an integer scale in the range [0, 20], where a zero indicates a very specialized format that applies narrowly to a small set of data, applications, and use cases and 20 indicates a very general format that applies to a wide range of data, applications, and use cases.
This property is measured by counting the number of statements below that are true of the format, based on inspection of the format specification and objective analysis of compactness results over a wide range of XML documents with varying size and structure. Statements designated as [optional] will broaden the applicability of a binary XML file format, but are not required for that format to be considered sufficiently general. The statements are organized into sections for readability.
Flexible schema analysis optimizations
Flexible document analysis optimizations
Flexible fidelity optimizations
Competes with frequency based compression
Competes with schema based encodings and hand optimized formats
This property indirectly measures presence of these properties: Compactness, Embedding Support, No Arbitrary Limits, Platform Neutrality, Robustness, Roundtrip Support, Schema Extensions and Deviations, Schema Instance Change Resilience, and Specialized Codecs.
This measurement is a pair of integers <m,n>, each on the scale [0,5]. The first number indicates the degree to which a file in a format may be humanly readable and editable; the second number indicates the degree to which a file in a format must be so. Thus, the greater the difference between the two numbers the greater the degrees of freedom given to the file's creator with respect to this property.
Each item in the following list of statements is evaluated to determine if it is never true, may be true, or is always true of file created according to the file's specification.
(Note: the first number in the score is therefore always greater to or equal than the second number.)
Measures the ease with which a given format integrates with the rest of the XML Stack of recommendations, based on its orthogonality in specification and the way in which it supports the core assumptions common to XML specifications. Many relevant considerations are presented in the Architecture of the WWW.
This property is measured using a scale derived from the notion that the XML Stack is fundamentally syntax-based and defines several different data models.
The following scale (from lowest to highest support) is used:
The simplest way of integrating well into the XML Stack is obviously to be fully compatible with the XML syntax. This however does not mean that the given format shall be XML 1.x itself, for instance it could be a subset allowing only certain tokens or requiring a certain form and encoding (for instance a canonical version of the SOAP subset of XML). While this would enable normally impossible optimization to XML parsers, it would likely move the complexity over to XML generators, and if it subsets XML it will create problems for applications using the features excluded from the subset.
It must also be noted that some core XML technologies such as signatures and encryption rely directly on the XML syntax. There is therefore a tradeoff in which a format could integrate perfectly well with the XML Family minus these two members.
The degree that the format supports no inherent limits is characterized as: No inherent limits, few limits (i.e. unreasonably large names), and many limits (fixed lengths, small tables).
Experience has shown that arbitrary limits in the design of reusable systems must be carefully scrutinized for the probability of future conflicts. As computing limitations have repeatedly been surpassed in short order and technology has been put to innovative uses, decisions that turned out to be short-sighted have led to painful migration. This property provides a rough measure with which to compare different approaches.
The range of this measurement is membership in a category. These categories are: "no inherent limits", "few limits", and "many limits".
The measurement for this property is by inspection of format specification and logical analysis.
Each type of flexibility in a data format can be a tradeoff between efficiency in the expected typical case and the ability to handle cases that are not expected to be encountered. In many cases in the past, seemingly sensible choices have not aged well with increases in computing capacity and new uses of technology.
Measures the degree to which a format is platform neutral as opposed to being optimized for a given platform.
This property measurement is represented as a selection from a range of: "not platform neutral", "platform neutral, single choices", "platform neutral, multiple choices". More than one aspect may need be measured. For instance, character encodings may have a wide variety of options while scalars may not. It is expected that at least character encoding and scalar encoding are included.
This property is measured along an axis of values that rate its platform neutrality from none to optimal:
This property has a weak link to Implementation Complexity in that if it is supported at its optimal level it will lead to require multiple encoding options that could be costly in implementation terms.
While allowing a format to support a large range of options to enable optimal processing between similar platforms, the added complexity may in fact have a generally negative impact as it complicates the format. While an assessment of this tradeoff can only be made on a format by format basis, it must be noted that allowing too many hooks for optimization may in fact prove to be a pessimisation.
The objective of Random Access is to reduce the amount of time required to access XML data model items in a document. The fundamental measurement is therefore the average time needed to access an XML data model item. This time can be compared to a baseline measurement of the average time needed to access an XML data model item using a sequential access method like that used to implement SAX.
This performance metric does not take into account what may be accessed with random access method and what operations may be supported on what is looked up (for example, can the looked-up item be treated as a sub-document or fragment).
Total time (T(am)) amortizes the cost of T(ra) over the average number of total seeks (ns).
T(am) = T(ra) + ns ( T(lu) + T(sk) )
If random update of the access table is supported we should also take into account this cost.
T(up) - time to update an access table (fixed)
T(up) should also be added to T(am) for the average number of total updates (nu).
T(am) = T(ra) + ns ( T(lu) + T(sk) ) + nu ( T(up) )
For the baseline, sequential access case we consider only T(sk) for the average total number of seeks (ns).
T(am) = ns ( T(sk) )
For an implementation of random access to XML:
T(ra) 10.00ms T(lu) .05ms T(sk) 1.00ms T(up) 1.00ms ns 1000 nu 50
T(am) = 10 + 1000( .05 + 1.00 ) + 50 ( 1.00 ) = 1110
For sequential access:
T(am) = 1000 ( 4 ) = 4000
In this example, random access is advantageous if the average total number of seeks is over 3. For ns = 3, nu = 0 the random access T(am) is 13.15 and the sequential T(am) is 12 while at ns = 4, nu = 0 it is 14.20 versus 16.
Values in time for each of the following:
Random access has resource costs which can impact system performance. A more comprehensive model would be needed to take these into account in a full assessment of the comparative benefit of a random access implementation. As an approximation, an implementation which produces lower numbers for the following resource costs will be better in performance than an implementation with the same T(am) but with higher resource costs.
The random access implementation can be categorized by the embedding or non-embedding of the access table, indicated by one of:
Another simplification made in comparing T(am) for random access and sequential access is the assumption made that the random access implementation is able to provide access to the data model items the user wants. If this is not the case, either the implementation of random access will not be useful to that user, its performance notwithstanding, or alternate methods of access would have to be provided and accounted for in the T(am). The access coverage to the data model provided by the random access implementation can be categorized as being in one of the following categories:
It should also be specified whether the implementation does or does not provide alternative access methods to obtain all data model items.
The random access implementation can also be categorized by its support for fragmentation into one of the following categories:
Complete support for the semantics implied by random access include the ability to do random update and having support for stable virtual pointers, as described in the property description for Random Access.
Stable Virtual Pointer support:
Random access is tested by starting with a test scenario and appropriate test data and constructing a realistic pattern of random access and update workload. This workload is performed repeatedly with detailed measurement of each phase of computation along with overall characteristics of random access support and performance as detailed above.
It is important to understand how different random access strategies will perform in general. Care must be taken to account for the effects of cached memory, detailed measurement mechanisms, and other things that affect performance. Memory and architecturally related limits and boundaries should be exercised to determine inefficiency pitfalls.
The presence of this property overrides the need for Accelerated Sequential Access.
Carrying and maintaining random access tables as part of a format instance negatively affects compactness. This can be minimized if only certain information indicated is indexed. Additionally, if random access indexing information is not supported by all processors of the format instance, it may need to be rebuilt for certain transitions.
Measures the degree to which a format supports round-tripping and round-tripping via XML.
These two properties are measured along the same enumerated scale consisting of the following values:
This property is measured by comparing the set of data models which can be represented in XML with those that can be represented in the alternative format.
With regards to Roundtrip Support (XML to binary to XML):
With regards to Roundtripping via XML (binary to XML to binary):
Formats supporting both roundtrip and roundtrip via XML will tend to have the same data model versatility measurement as XML, as that is a measure of the set of data models which they support. Formats with greater data model versatility are more likely to support round-tripping but less likely to support round-tripping via XML, and vice versa.
Measures the degree to which a format supports the creation and inclusion of digital signatures.
This property is measured along an integer scale from [0,8], where zero indicates no support for digital signatures six indicates the greatest possible degree of support. (Note that a format with a score of zero is still signable, in that a file consists of a sequence of bytes and any sequence of bytes can be signed.)
This property is measured by assigning the indicated number of points for each of the following statements which is true of the format, based on that format's specification:
A candidate format should be able to be processed by diverse platforms. Many of these platforms have very limited resources for program storage. A format that requires little actual code and data tables (aka initialized or BCC data) is attractive more widely. Inspection of specifications can be a useful form of analysis. Analysis of actual implementations can also be enlightening when those implementations are optimized by skilled developers.
The detailed measurement for this property will consist of code and initialized data measured, estimated, or projected to a series of platforms that relate to key architectures including 64K StrongARM, Java bytecode, and Intel/AMD Pentium/64bit.
The measurement for this property is by inspection of format specification, logical analysis, survey of implementations and implementers, and projections from one architecture to the others.
This property depends on design choices made in a format that may require large amounts of code or initialized data.
Space Efficiency is the measurement of dynamic memory needed to decode, process, and encode a candidate format. In this case, processing doesn't include any application processing or needs, but may include any format-induced processing or bookkeeping that must be done to adhere to the format. Special consideration must be given to separate and discount overhead that a format requires that is accomplishing something that an application would likely need to perform anyway.
This is a percentage measurement relative to the expected dynamic memory costs of popular and theoretical XML 1.x processing systems. This may include both DOM and parser event (SAX et al) style processing. Due to the nature of applications in memory-constrained environments, it is the DOM-style measurement that is ranked for this property.
The measurement for this property is by inspection of format specification, logical analysis, and empirical testing on test scenarios.
Some Compactness methods tend to increase memory usage, sometimes dramatically. Reducing Processing Efficiency may also affect dynamic memory needed.
The measurement methodologies are the result of the work of the XBC Working Group contributors: Robin Berjon (Expway), Carine Bournez (W3C), Don Brutzman (Web3D), Mike Cokus (MITRE), Roger Cutler (ChevronTexaco), Ed Day (Objective Systems), Fabrice Desré (France Telecom), Seamus Donohue (Cape Clear), Olivier Dubuisson (France Telecom), Oliver Goldman (Adobe), Peter Haggar (IBM), Takanari Hayama (KDDI), Jörg Heuer (Siemens), Misko Hevery (Adobe), Alan Hudson (Web3D), Takuki Kamiya (Fujitsu), Jaakko Kangasharju (University of Helsinki), Arei Kobayashi (KDDI), Eugene Kuznetsov (DataPower), Terence Lammers (Boeing), Kelvin Lawrence (IBM), Eric Lemoine (Tarari), Dmitry Lenkov (Oracle), Michael Leventhal (Tarari), Don McGregor (Web3D), Ravi Murthy (Oracle), Mark Nottingham (BEA), Santiago Pericas-Geertsen (Sun), Liam Quin (W3C), Kimmo Raatikainen (Nokia), Rich Salz (DataPower), Paul Sandoz (Sun), John Schneider (AgileDelta), Claude Seyrat (Expway), Paul Thorpe (OSS Nokalva), Alessandro Triglia (OSS Nokalva), Stephen D. Williams (Invited Expert).