W3C

XML Binary Characterization Properties

W3C Working Draft 05 October 2004

This version:
http://www.w3.org/TR/2004/WD-xbc-properties-20041005/
Latest version:
http://www.w3.org/TR/xbc-properties
Editors:
Mike Cokus, MITRE Corporation
Santiago Pericas-Geertsen, Sun Microsystems

Abstract

This document describes properties which have been identified as desirable for any serialization of an XML data model. These properties have been derived from requirements induced by use cases collected by the XBC WG [XBC Use Cases]. Properties are divided into two categories: serialization (or format) properties and processor properties. In addition, Section 6 Additional Considerations lists additional considerations which, because of the difficulty to establish an accurate measurement, have not been listed as properties but are nonetheless relevant for an accurate comparison between two different proposals.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the First Public Working Draft of the XML Binary Characterization Properties Document. It has been produced by the XML Binary Characterization Working Group, which is part of the XML Activity.

This document will be part of a series of documents following this work on Properties determination. Further work in the XML Binary Characterization Working Group will focus on establishing measurement of the properties that are described here, to help judge whether XML 1.x and alternate binary encodings provide those properties, that are required by the previously gathered Use Cases.

This is a First Public Working Draft and is expected to change. The XML Binary Characterization Working Group does not expect this document to become a Recommendation. Rather, after further development, review and refinement, it will be published and maintained as a Working Group Note.

Comments on this document should be sent to public-xml-binary-comments@w3.org (public archives). It is inappropriate to send discussion emails to this address.

Discussion of this document takes place on the public public-xml-binary@w3.org (public archives).

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Table of Contents

1 Introduction
2 Design Goals of the XML Serialization
3 XML Data Models
4 Serialization Properties
    4.1 Accelerated Sequential Access
        4.1.1 Definition
        4.1.2 Description
    4.2 Byte Preserving
        4.2.1 Definition
        4.2.2 Description
    4.3 Compactness
        4.3.1 Definition
        4.3.2 Description
    4.4 Data Model Versatility
        4.4.1 Definition
        4.4.2 Description
    4.5 Efficient Update
        4.5.1 Definition
        4.5.2 Description
    4.6 Embedding Support
        4.6.1 Definition
        4.6.2 Description
    4.7 Encryptable
        4.7.1 Definition
        4.7.2 Description
    4.8 Extension Points
        4.8.1 Definition
        4.8.2 Description
    4.9 Fragmentable
        4.9.1 Definition
        4.9.2 Description
    4.10 Hinting
        4.10.1 Definition
        4.10.2 Description
    4.11 Human Readable/Editable/Deducible
        4.11.1 Definition
        4.11.2 Description
    4.12 Integratable into the Web
        4.12.1 Definition
        4.12.2 Description
    4.13 Integratable into XML Family
        4.13.1 Definition
        4.13.2 Description
    4.14 No Arbitrary Limits
        4.14.1 Definition
        4.14.2 Description
    4.15 Processing Speed
        4.15.1 Definition
        4.15.2 Description
    4.16 Random Access
        4.16.1 Definition
        4.16.2 Description
    4.17 Robustness
        4.17.1 Definition
        4.17.2 Description
    4.18 Round Trippable
        4.18.1 Definition
        4.18.2 Description
    4.19 Schema Instance Change Resilience
        4.19.1 Definition
        4.19.2 Description
    4.20 Self Contained
        4.20.1 Definition
        4.20.2 Description
    4.21 Signable
        4.21.1 Definition
        4.21.2 Description
            4.21.2.1 Byte Sequence Preservation
            4.21.2.2 Partial Signatures
            4.21.2.3 Signature Interoperability
    4.22 Specialized codecs
        4.22.1 Definition
        4.22.2 Description
    4.23 Streamable
        4.23.1 Definition
        4.23.2 Description
    4.24 Support for Error Correction
        4.24.1 Definition
        4.24.2 Description
    4.25 Transcodable to XML
        4.25.1 Definition
        4.25.2 Description
    4.26 Transport Independence
        4.26.1 Definition
        4.26.2 Description
    4.27 Support for Open Content Models
        4.27.1 Definition
        4.27.2 Description
    4.28 Verifyable Integrity
        4.28.1 Definition
        4.28.2 Description
    4.29 Version Identification
        4.29.1 Definition
        4.29.2 Description
5 Processor Properties
    5.1 Draconian
        5.1.1 Definition
        5.1.2 Description
6 Additional Considerations
    6.1 Forward Compatibility
    6.2 Free
    6.3 Implementation Cost
    6.4 Single Conformance Class
    6.5 Small Footprint
    6.6 Ubiquitous Implementation
7 References

Appendices

A Acknowledgments (Non-Normative)
B XML Binary Characterization Properties Changes (Non-Normative)


1 Introduction

While XML has been enormously successful as a markup language for documents and data, the overhead associated with generating, parsing, transmitting, storing, or accessing XML-based data has hindered its employment in some environments. Use cases describing situations where some characteristics of XML prevent its effective use are described in another publication of the XBC WG. [XBC Use Cases]

The question has been raised as to whether some optimization of XML is appropriate to satisfy the constraints presented by those use cases. In order to address this question, a compatible means of classifying the requirements posed by the use cases and the applicable characteristics of XML 1.x must be devised. This allows a characterization of the potential gap between what XML 1.x supports and use case requirements. In addition, it also provides a way to compare use case requirements to determine the degree to which a common approach to XML optimization could be beneficial.

Hereinafter, the phrase "XML serialization" will often be preferred over the term "XML" in an attempt to emphasize that XML is a particular way of serializing an XML data model. It is in the interest of this WG to analyze alternative serializations, hence the importance of this distinction.

For the purposes of this document, a property is defined as a unique characteristic of an XML serialization which affects the serialization's utility for some collection of use cases. A consequence of this definition is that a property shall only be regarded as positive or negative in the context of one or more use cases. In other words, a collection of use cases is necessary to understand how a property affects the utility of a serialization.

The properties discussed in this document include existing properties of XML 1.x [XML 1.0] [XML 1.1], as well as properties not attributed to XML 1.x, but required by some use case(s) discussed in [XBC Use Cases]. Whenever possible, property definitions are based on established vocabulary, and reference official documents and definitions recognized by the XML and other relevant communities.

2 Design Goals of the XML Serialization

The XML 1.0 recommendation [XML 1.0] outlines a number of design goals or constraints which resulted in the creation of XML as it is known it today. The design goals were:

  1. XML shall be straightforwardly usable over the Internet.

  2. XML shall support a wide variety of applications.

  3. XML shall be compatible with SGML.

  4. It shall be easy to write programs which process XML documents.

  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

  6. XML documents should be human-legible and reasonably clear.

  7. The XML design should be prepared quickly.

  8. The design of XML shall be formal and concise.

  9. XML documents shall be easy to create.

  10. Terseness in XML markup is of minimal importance.

Compatiblity with SGML [ISO 8879] was a key design goal during the conception of XML. In fact, XML is regarded as a subset (or profile) of SGML, whose main purpose was to reduce the inherent complexity of SGML. By reducing the complexity, e.g. the number of options, XML became much simpler to implement than SGML, and this in turn resulted in the availability of a myriad of tools and APIs. It is precisely these tools and APIs (as well as the phenomenal growth of the Internet) that have attracted a number of different communities to XML.

3 XML Data Models

A serialization must be defined with respect to a data model (DM). Given that multiple DMs have been identified by the XML community, and in order to broaden the applicability of this document, the XBC WG decided to define properties in a DM-agnostic manner.

The following DMs have been identified to be in scope: XML Infoset [XML Infoset], Post-Schema Validation Infoset [Schema Part 1] and XQuery 1.0 Data model [XQuery DM]. Unless otherwise specified, a property whose definition refers to a DM must be considered in the context of each of the DMs listed in this section.

Editorial note: SP16 September, 2004

The list of DMs that are in scope is not intended to be final at this point. Additional DMs are being discussed by the WG and may appear in future versions of this document.

4 Serialization Properties

4.1 Accelerated Sequential Access

Editorial note: MC29 September, 2004

There is currently disagreement in the group as to whether this is a conceptually viable property. However, the group agrees that it is important to consider in evaluating use cases and should nonetheless be retained in the document. This property may be migrated to the Additional Considerations section in a future draft.

4.1.1 Definition

Accelerated sequential access is the ability to sequentially stream through an XML file when searching for data model items more rapidly than the average seek time using character-by-character comparison.

4.1.2 Description

Accelerated Sequential Access is similar to the 4.16 Random Access property in its overall objective of reducing the amount of time needed to access data model items, but differs in the method used to accelerate access. In random access, lookup is performed in constant time through the use of a table, while in accelerated sequential access data model items are searched in streaming mode, resulting in a lookup time that is related (yet not necessarily proportional) to the number of data model items in the document.

One approach to supporting this property is through the inclusion in the XML document of an index which allows skipping over content as the document is read. For example, an element index might contain the offset to the start tag of the next peer element. If the application recognizes that a desired data model item is not in the current element, or in the children of the current element, it will be able to restart the match at the offset of the next peer element without inspection of all data model items contained in the current element. A format that enables faster matching via the conversion of strings to tokens can also be considered as supporting the Accelerated Sequential Access property.

Performance of accelerated sequential access is measured by the time (algorithmic complexity) required to find data model items, the time needed to construct any indexes and special structures, and also the size of those indexes and special structures (impacting memory consumption and bandwidth utilization in the transport). Most implementations will support modification of the XML document; the cost of updating the indexes or special structures is another performance factor which can be measured.

4.2 Byte Preserving

Editorial note: SP16 September, 2004

This property is still under development.

4.2.1 Definition

A format A is byte-preserving with respect to some other format B if a file in format B when transformed to format A and then back to format B always results in a file that is byte-for-byte identical to the starting file. A format is universally byte-preserving if this definition holds for all B.

4.2.2 Description

Byte preservation is of interest when the precise sequence of bytes in some file must be maintained, yet that format is unsuitable for some purpose, such as transmission. The most common byte-preserving formats are those used by loss-less compression programs, such as gzip. Such formats are typically universally byte-preserving, in that they are byte-preserving with respect to all other file formats.

Other formats may be byte-preserving with respect to only one or a limited number of formats, but this is not common.

Interestingly, formats are not necessarily byte-preserving with respect to themselves. For example, an XML file can be transformed to another XML file in such a way that all information items are preserved yet the byte sequence differs.

4.3 Compactness

4.3.1 Definition

Compactness refers to the size of the in-memory or otherwise stored representation of an encoding of XML. Compactness is achieved by ensuring that the format includes as little extraneous information as possible. Extraneous information is any information that is not needed in order to process the encoded form completely and properly.

4.3.2 Description

A compact encoding can be achieved in different ways: a number of different techniques such as lossy/loss-less, schema-based/non-schema-based, delta-based/non-delta-based, among others, have been considered. For example, JPEG files are an example of a lossy encoding where bits of the original document are thrown away (and cannot be recovered) in order to achieve a compact representation. The same type of lossy encoding could be employed for XML documents in order to achieve compactness.

Alternatively, differing degrees of compactness can be achieved with a loss-less encoding, whereby redundant information is removed. In this manner no information is lost, however, compactness is achieved through the removal of this redundant information. A loss-less encoding would typically be less compact than a lossy encoding.

Furthermore, a schema-based encoding of an XML document can achieve a degree of compactness by using prior knowledge about the structure and content of a document. A serialization is schema-based if it uses information from the document's schema to achieve a better degree of compactness. This information could be used later as the document is processed or reconstituted. It is worth pointing out that although not self contained, a schema-based encoding is not inherently lossy given that, in principle, a decoder can reproduce the data model using both the encoding and the schema. Thus, as with other techniques, a schema-based encoding can be lossy or loss-less.

Another mechanism to achieve compactness is through a delta-based encoding.  Delta-based encodings are generated by comparing an original document with a secondary, reference document. The resulting document is the delta between the original and the reference document. This type of encoding can be lossy or loss-less. In either case, the original document can be reconstituted by using both the delta and the reference document.

The advantages of a compact representation are:

  1. Storage: Large XML documents can be stored in the compact format, thus saving space.

  2. Transmission: Large XML documents can be transmitted more efficiently when represented in a more compact form, thus saving time.  This is especially important when sending XML over low-bandwidth connections.

A disadvantage of any compact encoding might be the additional time and CPU required to generate the encoding.

4.4 Data Model Versatility

4.4.1 Definition

A format supports a data model if every instance of that data model can be serialized to and deserialized from the format without any information loss. Note that this does not imply that every data model item will be part of the format; it only mandates for every data model item to be at least computable from the format.

4.4.2 Description

A format that supports several data models is more versatile than a format that has been designed with one data model in mind. A given solution may be more suited at representing a specific data model, but given the number of different data models available for XML (c.f., 3 XML Data Models), it is desirable to have the flexibility to support others. Support for a given data model is not necessarily a boolean measure. Some formats may require external information in order to support a certain data model: the amount of information that is needed can be used to measure how suitable a format is at representing the data model in question. Solutions which need less external information will likely create smaller delivery sizes--assuming the external information is factored into the equation.

4.5 Efficient Update

4.5.1 Definition

Efficient Update refers to the efficiency of applying changes to a part of a serialization. This property is important for applications that require the modification/insertion/deletion of specific data model items.

4.5.2 Description

The least efficient case requires the de-serialization of the data prior to the application of the changes, followed by the serialization of the data once it has been inserted, modified or deleted. The most efficient case would apply changes directly on the serialized data, thus avoiding the need to cross data representation boundaries. It is worth pointing out that methods that use incremental update techniques, in order to avoid the aforementioned steps, are not guaranteed to be the most efficient given the cost associated in tracking all the changes applied to a data item upon retrieval.

There are three aspects under which this property should be evaluated:

  1. Efficiency of update: This is the time required to apply the changes, starting from the original serialization up until the updated serialization is produced.

  2. Efficiency of retrieval: This is the time required to retrieve a (possibly) modified value.

  3. Compactness: This is the additional space required for the application of an update.

4.6 Embedding Support

4.6.1 Definition

A format supports embedding to the extent to which it provides for the interchange and management of embedded files of arbitrary format.

4.6.2 Description

A variety of use cases call for the inclusion of files of one type inside another: images, video, and sound embedded within multimedia documents; arbitrary files inside Web service messages; large datasets bundled with metadata. File formats vary in their support for this use.

Formats designed for narrowly constrained purposes, such as GIF, typically make no provision for embedding. While it may be possible to encode some additional data in certain metadata fields within such formats, doing so violates the spirit of the file format and requires tight agreement between the sender and receiver for interchange. Such formats effectively offer no interchange or management support and are not considered to support embedding.

Other formats, such as XML and TIFF, permit embedding simply by virtue of flexibility: they do nothing to prevent file embedding. However, because these formats have no mechanism for distinguishing an embedded file from other types of data, tight agreement is still required between the sender and the receiver for interchange. Such mechanisms are also not easily manageable.

Other formats, such as XSL-FO and PDF, provide specific embedding points. By establishing a general mechanism, they make embedded data interchangeable and manageable because there is an a priori agreement for creating and identifying embedded files.

Finally, there are packaging formats, such as MIME multipart/related and ZIP, which exist solely for the purpose of containing embedded files. Packaging formats generally provide significant management capabilities by supporting metadata, signatures, encryption, and compression of embedded files. They are typically designed specifically for the interchange of these embedded files.

Evaluation of a format for embedding support should take into account both interchange and manageability, as described here, as well as support for related properties like 4.3 Compactness, 4.16 Random Access, 4.21 Signable, 4.7 Encryptable and 4.23 Streamable.

4.7 Encryptable

4.7.1 Definition

In principle, every format (viewed as a sequence of bytes) is encryptable. However, there are features specific to a format which make encryption more or less useful: (1) support for content-level encryption, (2) support for non-encrypted meta-data and (3) support for self-contained encrypted data.

4.7.2 Description

Encryption of a raw file is the simplest way of encrypting and as such applies to arbitrary files. Any file can be encrypted using tools such as PGP. The resulting file can be decrypted yielding a bit-for-bit identical copy of the original file. The resulting encrypted file has limited value as it cannot be recognized as containing any given format. For many applications (such as Digital Rights Management or DRM) this is insufficient as it is often necessary to encrypt only a portion of the content.

The understanding of a file representation opens up the encryption possibilities by allowing only portions to be encrypted, together with the inclusion of additional meta-data accompanying the encrypted portion. The attached meta-data can contain information about what was encrypted (i.e., original format of the encrypted data and, possibly, information about who holds the key or who is authorized to view the file). This type of file format is more useful since it allows the user to examine and identify the file type and then use the meta-data information to obtain the decryption key (e.g., when purchasing a song with DRM).

Finally, the question of whether encrypted data is self-contained may be of relevance to some applications. An encrypted data is self-contained if it retains its meaning even without the context of the surrounding non-encrypted data information.

4.8 Extension Points

Editorial note: SP27 September, 2004

This property is still under development.

4.8.1 Definition

An extension point is a method for easily extending a format and its implementation. There is a gradient of options that includes: (1) does not allow extension points, (2) allows for single extension point, (3) pervasive ability, and (4) everywhere.

4.8.2 Description

The extension might be for a new, alternate, or experimental version of the format, an implementation of a layered format, or an application specific or otherwise proprietary extension. An important consideration relative to concerns about interoperability, evolution, and debugging is whether features can be represented by standard convention in XML. Some features that extend existing data models may be part of the initial version of a format while other potential additions may be implemented as extension points.

4.9 Fragmentable

4.9.1 Definition

A format is said to be fragmentable when it supports the ability to encode data model instances that do not represent an entire document, together with sufficient context for the decoder to process them in a minimally meaningful way.

4.9.2 Description

While typical usage of XML involves exchanging entire documents (the special case of external parsed entities notwithstanding), it is sometimes desirable to support the ability to exchange smaller, independently exploitable parts of a document. Cases in which such a functionality is desirable include for instance the transmission of deltas, some error resilience mechanisms, or the ability to encode the [XQuery DM] which can represent data items that are not documents on their own.

This property is similar to 4.23 Streamable in that processors featuring these properties can deal with small parts of a document, but it is different in that the fragments can be treated independently and in arbitrary orders--unlike 4.23 Streamable where atomic items are treated in document order. This difference incurs additional requirements to support the transmission of the context required to process a fragment (at the very least the set of in-scope namespaces, and possibly also the values of xml:base, xml:space, and xml:lang).

This property can be measured as a boolean indicating whether it is supported or not, and further refined by the level of granularity (how big is the smallest atom that can be encoded) and the genericity (can any item in a data model constitute a fragment) supported by a given format.

In addition to the ability to process fragments in isolation, it is possible to consider storing one or more parts of a document instance as immediately extractable fragments, so that they can be pulled out of an instance with little or no additional processing cost.

4.10 Hinting

Editorial note: SP27 September, 2004

This property is still under development.

4.10.1 Definition

TBD

4.10.2 Description

TBD

4.11 Human Readable/Editable/Deducible

4.11.1 Definition

A format satisfies this property if the data encoded can be understood and possibly modified without the assistance of either a format specification or implementations thereof.

4.11.2 Description

Situations often arise in system testing and troubleshooting in which it is convenient to be able to examine or repair data using a limited tool set. For example, a text editor is likely available, but a format-specific parser may not be. Similarly, the data may conform to some specification but that specification may not be readily available.

A similar situation arises in archiving applications. For short-term archiving, on the order of a few decades, it is generally reasonable to assume that format specifications and implementations are available. But for long-term archiving, on the order of centuries or longer, it is best to assume that the format specification and implementations will be lost or unusable.

These problems are related; they are both fundamentally about how easy it is to discover the data in a file when very little about that file is known. To that end, a format satisfying this property should:

  • Use natural language as text and avoid the use of magic binary values or compression. (A magic value is one which is assigned arbitrarily and whose meaning cannot be derived except by reference to some external authority, such as a file format specification. For example, if a format used the number '73' to indicate a black-and-white image and '29' to indicate color then it uses a magic number, as there is no decipherable logic to these assignments. Compare this to a format which uses '2' and '24' to indicate black-and-white and color, respectively, and is therefore not using a magic number since these value are the bit-depth of the image.)

  • Use a limited number encoding mechanisms, and prefer those that are most obvious. For example, use RGB triples to encode color, not both RGB and CMYK.

  • Be self-contained, as external information (such as a referenced schema) may not be available. (c.f., 4.20 Self Contained)

  • Maintain locality of data model items. For example, keep element names inline as in XML, and not consolidated into a token dictionary located elsewhere in the file.

This property can also be improved by using identical meaning in different forms, as does the Rosetta Stone, which contains the same text in several languages. However, this might be best considered a property of a document and not of a format.

4.12 Integratable into the Web

Editorial note: SP16 September, 2004

This property is still under development.

4.12.1 Definition

TBD

4.12.2 Description

TBD

4.13 Integratable into XML Family

Editorial note: SP16 September, 2004

This property is still under development.

4.13.1 Definition

TBD

4.13.2 Description

TBD

4.14 No Arbitrary Limits

4.14.1 Definition

The evolution of a format (without breaking backwards compatibility) may be hindered by the presence of arbitrary limits such as maximum sizes, bounded lengths, support for a fixed set of character encodings, etc. The degree to which a format supports no inherent limits is characterized as follows: (1) no inherent limits, (2) few but reasonable limits, (3) too many limits.

4.14.2 Description

Arbitrary limits can cause discontinuities in format evolution which translate in application difficulties for unforeseen uses. Whenever possible, a format should avoid arbitrary limits, by supporting the encoding of unbounded scalars, unbounded lengths, an open-ended set of characters encodings, etc. Failure to do so often results in incompatibility problems between newer and older versions, once a bound is deemed inappropriate due to advancements in technology or new and unforeseen uses cases of the format.

4.15 Processing Speed

4.15.1 Definition

This property refers to the speed at which a given format can be generated and/or consumed for processing.

4.15.2 Description

There are three broad areas of processing speed with regard to an XML format:

  1. Serialization: The generation of the format from a data model.

  2. Parsing: The reading of the format in order to process and extract various pieces of information contained in it.

  3. Data Binding: The creation of an application data model from the data contained in the format.

It is sometimes desirable for an XML format to allow for all areas to be performed in a more efficient manner than it is currently allowed with XML. For example, it should be possible to serialize a message faster than using XML. Furthermore, parsing the resulting format should be faster than parsing XML.

Processing Speed should be considered in an end-to-end manner, from application accessible data on one end to application accessible data on the other end. In other words, it is desirable to have a process that is efficient not only in parsing, but in generation, transmission and data binding. However, not all applications need symmetric speed. Some applications may require efficient parsing without specific needs as to how long it takes to generate the format. Other applications may have the opposite concerns.

4.16 Random Access

4.16.1 Definition

Random access is the ability to lookup the location of data model items in a document, as opposed to searching for data items by sequential traversal of XML events or XML structures. Although the objective of Random Access is to reduce the amount of time (algorithmic complexity) needed to access data model items, the property is characterized by its support for a random access method to the data.

4.16.2 Description

Even though the random access method is intended to reduce lookup time, a cost will be associated with the structures which support the lookup. The cost in terms of processing time will be at least the time required for a sequential scan of an input document as well as the additional costs in terms of storage, memory consumption, and bandwidth utilization. The random access property will be of interest in those use cases where that cost is less significant than the access cost to the document without random access.

Many of the implementation characteristics of random access in XML are related to the ability to build a table for efficient access to data items in the document. The following example illustrates a simple form of a random access table. The byte position and length are obtained for each data item in the document using a token scanner; this information is later stored in a table and associated with the document. Given a scheme for looking up a data item in the table, an application could directly read any data item in a memory-mapped byte array (using the byte position and length) or it could perform a sequential byte traversal over the byte stream.

The access table may contain information other than the position and length of the data model item. Some additional information will be indispensable in almost all cases since these two properties will only support look-up of data items by position. Examples of other information associated with each entry in the access table include its kind, its schema-described data type, as well as other schema-described properties such as optionality and namespace scope (alternately, schema-aware documents could simply reference schema information which is available in the schema itself stored outside of the document). This additional information may be more or less useful to specific types of applications; as stated before, the cost of construction, access and size of the table will be balanced against the potential benefit to specific types of applications.

The access table may be complete, selective, on-demand, or heuristic. A complete access table would provide addressing information for all data items in the document. A selective access table would provide addressing information for some of the data items in the document. For example, a random access system can provide addressing information only for elements. An on-demand access table can provide addressing information for specific infoset items that have been requested by the application. Finally, a heuristic random access system can provide addressing information for data items based on some criteria such as prior experience with related documents.

The type of access table has bearing on the time needed to construct it, the size of the access table, and the time needed to obtain addressing information for a data item. These constraints are the trade-offs made against the level of addressability into the document the table offers.

The actual construction of the access table and the format of the lookup results is largely an implementation decision but some design features will better support different approaches to enabling random access to data model items efficiently. For example, a selective implementation of random access to data items which supports lookup by element name might be best implemented as a hash table. An on-demand implementation that uses a sequence of XPath statements to specify data items of interest may return an array of nodesets whose order is determined by the input sequence.

Another characteristic of data items lookup is whether it is guaranteed to return a unique data item or if multiple data items may match a lookup and, in this case, whether multiple, iterative, or a single selected item will be returned. This characteristic will again have a bearing on performance issues and trade-offs may reasonably be made according to the needs of the application.

In allowing direct access into a document, the random access method can, potentially, have some impact on the binary format's fragmentability property. A format which supports both random access and fragmentability would have to have a mechanism for providing the context of a data item. The context should include active namespaces and, possibly, hierarchical context such as enclosing and ancestor elements. This context information also has a bearing on the support of schema-awareness allowed through the random access method (namespaces are needed to resolve the schema; hierarchical context may be needed to determine the relevant content model). The context is naturally preserved when the index or access table is absolutely independent, from the access point of view, from the infrastructure which supports fragmentation. This design is most feasible when the content of the document is relatively static and primary accessed for reading. In a more dynamic situation the random access method and fragmentation infrastructure may need to be interrelated and dependant. In this case, the inclusion of context information may impact either or both the speed of random access and the storage required by the access table. A format supporting random access should specify whether it supports full, partial, or no fragmentability.

A fundamental distinction in the implementation of the Random Access property is whether the format contains the access table or it is simply guaranteed not be incompatible with the implementation of random access methods. In the first case the access table is bound to the XML document instance, enabling it to be transported and accessed by different processing nodes while retaining all information necessary for random access. In this case, a standard will have to explicitly specify the format of the access table. In the second case, transportability of random access information embedded in the document is not required but the ability to build an access table external to the document (or multiple documents) is. For example, in a persistent store (e.g., a database), an index can be built over all documents in the store but without the additional storage overhead associated with including an access table in each document. Conditions incompatible with the random access property of a binary XML format in this scenario may include such things as the impossibility to construct an index due to, for example, a compression scheme that does not store data sequentially, or the prohibitive processing time requirements for the construction of an index dictated by the complexity associated with the format.

It is also possible for these two approaches to coexist; the document may, optionally, contain an access table and also guarantee efficient indexability. A consequence of allowing the embedded access table to be optional is that the document must be complete and capable of being correctly processed without the access table. It must also be possible to construct and add the embedded access table from an XML document with does not contain the access table. Some mechanism must exist to enable applications to identify whether the access table is present or not, and to negotiate the format when documents are interchanged. The notion could be extended to support different types of access tables appropriate to different application scenarios.

Two operations are associated with random access: random read (extract) and random update. The random read operation extracts information from an XML document. This information can contain one or several values and/or one or several data model items (subtrees). The random update operation allows data items in the XML document to be inserted, deleted or replaced using random access addressing mechanisms to locate the position in the document where the update is to be made.

There are multiple implementation techniques for the random update operation. They can either update documents directly by modifying the document representation itself or do it indirectly by storing new parts of document in separate tables. In any case, for interoperability, random update must allow the updated document to be written out to an XML 1.x representation with the updates in place.

Any write process presents challenges with respect to synchronization of updates when updates may come simultaneously from multiple threads or processes. A random access update implementation either automatically or optionally synchronizes updates or assumes the user takes responsibility for synchronizing data requests.

Other characteristics of random update include whether it can enforce XML well-formedness and/or schema compliance of updated data, and whether such enforcement takes place each time an update is made, or on request, or when the XML is written out.

The random access is tightly associated with the efficiency of read and update operations. Generally speaking, it is possible to build access tables for an XML document in its textual representation. However, this would require a processor to maintain all structural information (references to each item and property) as well as all data type information (for schema-aware documents) outside of the document itself. Such information, very often, takes significantly more space than the document itself; additionally, most of the random access operations require too many reads and writes in this case.

4.17 Robustness

4.17.1 Definition

A format is said to be robust if it allows processors to insulate erroneous sections in a way that does not affect the processability of the remaining sections.

4.17.2 Description

Processors should minimize the size of the sections affected by errors (within some limited boundary of the error positions) and be able to unequivocally interpret the remaining sections without compromising the partial format integrity. From an application's point of view, the result is a partially complete data model instance with sporadic information losses caused by the errors.

The robustness property of a format can often be translated into (i) the processor's ability to detect errors if there are any and (ii) the format's facility to permit skipping over to the position where the processor can resume document processing. To support the processor's ability to detect errors, dedicated redundancy such as a cyclic redundancy check (CRC) is added to the format in the so-called channel coding algorithms.

There are applications that have certain constraints that do not permit or afford the time of data re-transmission and are required to do their best in an attempt to recover as much information as possible out of the document even in the face of errors.

One such application is found in multimedia broadcasting systems that broadcast data to wireless devices. Broadcasted media has to be resilient from possible errors and continuously allow processing on the devices even with sporadic corruption caused by transmission errors.

Another example is business document exchange over unreliable networks. Documents may consist of data of different levels of significance. Some portion of the document may be absolutely crucial for performing the business while others may be provided merely as informative. It is often the case that businesses can move forward as soon as the crucial data is processed even with errors found in the informative part.

The property of robustness primarily concerns with bit-stream level errors and is generally agnostic of those types of errors that are found at document-level such as grammatical or semantic errors. Those document-level errors are better accounted for by 5.1 Draconian.

This property can be measured by the degree to which processors can restrict the size of affected sections of the format upon errors within their smallest possible neighborhood. There are two different levels of robustness along this dimension:

  1. The first level is fragment-wise robustness, where processors can skip a whole fragment if errors were found therein, but can still process subsequent fragments correctly.

  2. The second level is not being able to restrict the affected section at all, with errors generally resulting in immediate fatality of the current processing.

When the format in question supports fragment-wise robustness, another relevant measurement for consideration is support for self-contained fragments. In general, self-containment of fragments makes it easy to implement error-checking algorithms and to skip over the affected sections or fragments--in general minimizing the size of the unprocessed data. For example, some formats that use breadth-based document ordering make fragments not self-contained and errors in one fragment may affect pervasively in the corresponding data model instance. Therefore, for those formats supporting fragment-wise robustness, there are two more levels of robustness: support for self-contained fragments or lack thereof.

4.18 Round Trippable

4.18.1 Definition

Round-trippable is the property of being able to produce an XML instance or fragments from the unmodified intermediate representation which is equivalent to the original input.

4.18.2 Description

In the course of processing, an XML document may be taken from its serialized form to an intermediate representation and back again to its serialized form without the application of any changes. It may be necessary to determine given the original and final serializations, whether the unmodified instances or fragments are equivalent. An XML format which is round-trippable supports this operation.

Equivalence verification is the process of being able to verify that two XML instances or fragments are equivalent. Equivalence verification is necessary to prove that round-tripping has been successful. Round-tripping is proof that no significant information is lost in the conversion to an intermediate representation. The same process of equivalence verification may also be used to test the equivalence of any set of instances or fragments without regard to how they were produced.

Equivalence may be exact or lossless. Exact equivalence means that an exact copy of the input XML instance or fragment can be produced from the intermediate format and verified. Lossless equivalence means that an XML instance or fragment can be output and verified which is identical to the input XML instance or fragment in all aspects which are significant for the equivalence algorithm and can only differ in those aspects which are not significant.

The process that takes XML instances or fragments and converts them to a representation suitable for equivalence verification is known as canonicalization. The representation is known as a canonical form. Canonicalization applies only to lossless equivalence: if equivalent instances or fragments are always exactly equivalent, canonicalization is not needed. Two types of lossless equivalence have been defined in the context of XML digital signatures [XML Digital Signature].

Round-tripping involves conversion of the serialized XML instance to a representation supporting one or more data models. Round-tripping will be possible only if the intermediate data model supports everything in the input that is necessary for equivalence. Lossless round-tripping of XML 1.x through the data models in 3 XML Data Models is possible, but exact round-tripping is not.

Canonicalization is necessary for equivalence verification of XML 1.x when round-tripped through any of the aforementioned data models. Canonicalization of XML 1.x for digital signature verification is an oft-cited performance problem. An XML format which assures full compatibility between its serialization and supported data models, and is capable of guaranteeing exact equivalence, could eliminate the need for canonicalization. Short of a full support for exact equivalence, the performance problem associated with canonicalization could be ameliorated with better compatibility between the serialization and the data model, reducing the amount of processing needed to obtain the canonical form.

4.19 Schema Instance Change Resilience

Editorial note: MC21 September, 2004

This property is still under development.

4.19.1 Definition

A Schema Instance Change Resilience serialization format is one in which areas of interest that do not change or, with less flexibility, are changed only in restricted ways, such as additions, are still valid to receivers expecting earlier or later versions even after other areas have changed. An area of interest may be defined as a subtree, an Xpath, or other infoset items.

4.19.2 Description

It is very common for a data format instance definition to change over time. A format supports limited Schema Instance Change Resilience if only the schema or related metadata needs to be distributed. A format is fully flexible if any change is supported and less so if restricted, to additions for example. Full support means that changes are only needed when directly affected by modifications, such as removing an infoset item.

This property is related to 4.20 Self Contained and may be fulfilled with or without self containment. A non-self-contained solution might rely on loadable schema meta-data or delta parent instances.

There are three categories of serializations with respect to Schema Instance Change Resilience:"requires schema-related updates for any changes", "does not require schema-related updates for certain changes such as addition", and "does not require schema-related updates".

4.20 Self Contained

4.20.1 Definition

A serialization of an XML data model instance is self contained if the only information required to reproduce the data model instance is:

  1. the representation of the data model instance;

  2. the serialization format specification.

This is a ternary property with the following possible values:

  1. not self contained;

  2. support for self containment upon request;

  3. always self contained;

4.20.2 Description

Applications in which the receiver is unable to request or receive additional information besides the serialization of the data model instance require that this property have the third value, "always self contained". This is desirable in applications in which it would be difficult, impractical, or costly to access additional resources.

An example of the case given above can happen when there is a significant time lag between generation and consumption of the serialization such as in archiving applications. To ensure that the receiver is capable of reproducing the data model instance from the archived serialization, the serialization format must be self contained. Accordingly, no additional information is required which might no longer be available at the time of consumption of the archived serialization.

Another example of the case given above can occur when deploying intermediary applications that are inserted between the sender and the receiver of data model instances without the prior knowledge of either. For instance, XML firewalls and load balancing applications must efficiently inspect the content of data model instances in order to make decisions. A serialization format that is "always self contained" is helpful in such situations. However, the applications may still be able to function in some cases if they can be pre-configured with the additional knowledge external to the format definition that is required to process the serializations.

An XML Schema-based serialization could also be considered "not self contained" unless the schema information is stored together with the serialized representation (an optional feature to include the schema would be an example of "support for self containment upon request").

XML Schema-based serialized representations allow the efficient representation of data model instances in bitstreams due to knowledge of the structure and datatypes defined in the XML Schema. An XML Schema-based serialized representation requires the receiver to have access to the XML Schema definitions of the represented XML document to access the contained data model instance.

Such an XML Schema-based serialized representation is not self contained. Besides the serialized representation of the data model instance and the format specification of the serialized representation, the XML Schema definition is also required to reproduce the data model instance. However, XML Schema-based serialized representation can be "self contained" if relevant information of the XML Schema itself is contained in the representation.

The encryption (see 4.7 Encryptable) of a data model instance is related to the Self Contained property. An encrypted representation is "not self contained" for the purpose of protecting the encrypted information (i.e., the key for decryption is not contained in the representation). However, encryption is assumed not relevant to this discussion. Encryption is applicable to any representation format and thus orthogonal to the Self Contained property. Self Contained is discussed in this section with respect to the data model serialization format itself.

4.21 Signable

4.21.1 Definition

A format is signable to the extent to which it makes the creation and validation of digital signatures both straightforward and interoperable.

4.21.2 Description

In principle any file format is signable, in that the bytes which compose any file may be fed to a digital signature algorithm. Signatures, however, are only useful when they can be both created and verified, and verification requires that the verifier be able to determine precisely which bytes were signed, by whom, and how. Formats vary in how amenable they are to specifying and maintaining this information, and this in turn can be a measure of how "signable" they are.

4.21.2.1 Byte Sequence Preservation

Other things being equal, file formats which define a one-to-one relationship between each possible data model instance and the serialization of that instance are more easily signed because they require that processors maintain the byte sequences exactly. A text file can be said to operate this way, in that a change to any byte of the file results in a different text file.

Other formats define a one-to-many relationship between the data model instance and possible serializations. Such formats permit (or even invite) processors to modify the byte sequences. XML is such a format; for example, a processor could replace each character entity in an XML document with a numeric entity reference and have encoded the same information but with a significantly different byte sequence. The ability to sign XML then requires the development of a canonicalization algorithm which defines a particular serialization of each data model instance that is used for signature purposes.

Finally, a format which has a one-to-many relationship between data model instances and serializations but also defines a canonical serialization might be considered as falling in between the two extremes; signing and verifying is more work than if there is only one serialization but it saves the effort of developing the canonical format itself.

4.21.2.2 Partial Signatures

It is often desirable to sign only a portion of a file, such as in electronic document use cases in which multiple signatures are attached to a single document. This capability is about both which portion of the file is not signed, and therefore modifications to which will not break a signature, as well as which portion is signed. In such use cases, the signed portion is determined at the semantic (i.e., schema) level. For example, a signature may be applied to page one of a multi-page document but not to any other pages.

It is critical that such signatures are calculated over all portions of the file which encode information relevant to the semantic construct; otherwise, portions not included may be modified and the signature is insecure. For example, consider an XML document in which a sub-tree uses a namespace prefix. If the prefix declaration is outside the sub-tree and therefore not covered by the signature then the declaration can be altered--thus changing the meaning of the signed portion--without breaking the signature.

Other things being equal, then, formats which place all bytes representing the encoding of semantic constructs in a contiguous range are more signable because those ranges are more easily determined and specified. Formats which permit such ranges to be created but do not guarantee them are less signable because the application must either determine all ranges which must be signed or arrange for that information to be placed in a self-contained sub-tree.

Finally, there are formats which will never place semantic constructs in contiguous ranges but scatter that information into tables and other mechanisms used to achieve compactness or other format properties. For example, a format may place element names in a vocabulary index table. That table may contain names of some elements in the signed region and others which are not; one must then determine how much of the table to sign and how to permit subsequent modifications to only break the signature when necessary. Such formats are least signable with respect to partial signatures.

4.21.2.3 Signature Interoperability

Signers must be able to communicate to signature verifiers which bytes were signed, by whom, and how. Other things being equal, formats which make no provisions for recording this information are less signable because they require additional agreement between the parties involved in order to make signatures interoperate.

Other formats may provide syntax for encoding this information in the file format itself. Such formats are more signable because interoperable signatures can be created simply by reference to the format itself; no additional agreements with verifiers are required.

4.22 Specialized codecs

4.22.1 Definition

This is a property of formats that are able to associate processor extensions (known as plugins or codecs) with specific parts of a document in order to encode and decode the latter more optimally than the processor's default approach would.

4.22.2 Description

Some specific vocabularies contain data structures that can benefit from special treatment, typically in order to obtain higher compression ratios or much faster processing. However, it would naturally be highly impractical to include all such specific cases in the default format specification so that all processors would have to implement them, while only a small subset of users would exercise the functionality.

Therefore, a format may include the ability to reference predefined extensions to the processor (both encoder and decoder) that are tailored to a specific need and can therefore encode certain parts of the document in an optimized manner. This requires the format to be able to flag segments as having been encoded with additional software, and processors to support means in which they can be extended.

Though the presence of this property makes the format in general more suitable to a larger set of uses and less likely to include very specialized features of use to only a small fragment of the format's user base, it also carries a high cost in terms of interoperability as it requires all participants involved in the exchange to support the additional software that knows how to decode the specific extension.

While in practice there are subtleties as to the ways in which this property can be supported, it is more accurately measured as a boolean indicating whether a format supports it or not.

4.23 Streamable

4.23.1 Definition

For a serialization format, streamability is a property of the processors (serializer, parser) that convert a data model instance into the format and vice versa. A processor is streaming if it is able to generate correct partial output from a partial input. A format is streamable if it is possible to implement streaming processors for it.

Streamability (both input and output) must be considered relative to a data model. Once the data model is fixed, streamability is defined to be a boolean property: a format/processor is either streamable or not streamable.

4.23.2 Description

Streamability is needed in memory-constrained environments where it is important to be able to handle data completely as it is generated to avoid buffering of data inside the processor. It is crucial when the document is generated piecemeal by some outside process, possibly with indefinitely long breaks between consecutive parts. Examples of the former requirement are provided by the Web Services for Small Devices and the Multimedia XML Documents for Mobile Handsets Use Cases. Examples of the latter requirement are provided by the Binary XML in Broadcast Systems and the XMPP Instant Messaging Compression Use Cases.

A precise definition can be derived by assuming that the data model consists of atomic components, which are assembled into documents in some structured manner. The serialization of a document expressed in the data model is then simply a traversal of its atomic components in some defined order with each applicable component being translated to the output stream. Output streamability is the ability to create a correct initial sequence of the output stream from a partial traversal. Input streamability is the ability to create the corresponding partial traversal from such an initial sequence so that the application can process the results of this partial traversal as if it were traversing the complete document.

Streamability is also characterized by the amount of buffering that needs to be done in the processors. Buffer space is measured in the number of items in the input (for serializer, the atomic components of the data model, for parser, the elements (e.g. bytes) of the stream). A requirement for streamability is that both processors be implementable such that they only require constant buffer space, no matter what the input document is or how it is mapped to the data model.

Another important consideration not captured by the above is the need for lookahead in the parser. If the parser is required to look ahead in the input stream to determine where the atom currently being read ends, and it is possible that the lookahead is not available (due to e.g. the serializer concurrently streaming the output), streamability is lost.

A simple example of a non-streamable format can be had by considering an XML format where the text content of an element is required to be a single length-prefixed string. In this case the serializer, if invoked through an output API (e.g. SAX used for output) that permits text content to be passed piecemeal, will need to buffer all text content before being able to output any of it.

The buffer space requirement precludes some serialization techniques, e.g. compression over the whole document. This shows a trade-off between the streamability and efficient data model transfer properties. For the same reason, building an element index for accelerated sequential access on the fly may not be possible. A streamability requirement also eliminates SOAP with Attachments and XOP as solutions to the Embedding External Data in XML Documents Use Case.

While streamability is a boolean property, it is sometimes possible to compare two formats as to how streamable they are. These methods of comparison are not intended to be quantifiable or even precise. Estimation of a format's streamability is further hampered by the characterization of streamability through the processors, since a format may permit widely differing processor implementations.

Since streamability is measured relative to a data model, it is possible to consider the applicable data models relative to which the compared formats are streamable. A format streamable relative to more data models than another can be said to be the more streamable of the two. Buffer space requirements provide another consideration. While streamability permits a constant-space buffer, a comparison of the required buffer sizes for two formats provides a way to compare their streamability.

4.24 Support for Error Correction

Editorial note: MC21 September, 2004

This property is still under development.

4.24.1 Definition

This property requires that error correcting codes can be applied to the representation of XML data model instances. Error correcting codes applied to a representation a) enable to identify a section in which errors distorted the representation can be located and b) enable to recover the undistorted section from the erroneous one.

Representation formats of XML data model instances can be categorized in three classes: (0) No Partitioning of the representation is possible, (1) Partitioning of the representation is possible and (2) Partitioning according to the importance of the information is possible.

4.24.2 Description

Error correction requires that redundancy is contained in the representation of the information that allows to recover information even if errors for instance in a transmission have occurred in the representation of the information. The redundancy can serve on one hand to identify that an error has occurred and in certain circumstances to correct the error.

Various algorithms exist that insert redundancy so that a decoder is capable of detecting and potentially correcting errors. These techniques are called Forward Error Correction (FEC) since they do not require further backward communications between the receiver and the original sender. Examples of block based FEC algorithms include Hamming and Reed-Solomon Codes, an example of a continual FEC is the Turbo Code.

In general the error correction requires methodologies known as channel coding. Usually these algorithms are applied separately from those of source coding for efficient, redundancy free information representation. Handling them separately enables the adaptation of the channel coding, i.e. the insertion of redundancy into the representation tuned to the expected channel characteristics. For instance a channel with large error bursts might require applying interleaving, i.e. a defined resorting of bits, so the error bursts are distributed over a larger section of the bitstream. This enables forward error correction mechanisms to be also applied in case of channels with error bursts such as wireless channels.

To support for error correction a representation of XML data model instances has to interface with common channel coding algorithms. For interfacing the representation format shall allow partitioning of the representation according to the importance of the represented information to support unequal error protection based on the importance of the information. For instance in EPG data rights information might be ranked more important than names of actors. Accordingly, being Fragmentable is a prerequisite of support for error correction.

4.25 Transcodable to XML

4.25.1 Definition

TBD

4.25.2 Description

TBD

4.26 Transport Independence

Editorial note: MC24 September, 2004

This property is still under development.

4.26.1 Definition

(See Description)

4.26.2 Description

The format should be independent from the transport service. However, the format should state its assumptions (if any) about characteristics of the transport service. For example, OMG's GIOP (Generic InterORB Protocol) requires that the transport service provides error-free and ordered delivery of messages without any arbitrary restrictions on the message length. Protocol binding specifies how the format is transmitted as payload in a specific transport (TCP/IP, e.g.) or messaging (HTTP, e.g.) protocol.

4.27 Support for Open Content Models

Editorial note: MC24 September, 2004

This property is still under development.

4.27.1 Definition

A format implements Support for Open Content Models if it can directly and efficiently support the inclusion of arbitrary XML as the value of elements.

4.27.2 Description

An important distinction exists between those possible solutions that require explicit indication of what elements may be extended and those solutions that allow extension anywhere.

The level of Support for Open Content Models is one of: "Supports arbitrary variable sections", "Supports predefined variable sections", "Does not support variable sections".

The degree of Support for Open Content Models is one of: "Supports variable sections with normal efficiency", "Supports variable sections with degraded efficiency".

4.28 Verifyable Integrity

Editorial note: MC21 September, 2004

This property is still under development.

4.28.1 Definition

TBD

4.28.2 Description

TBD

4.29 Version Identification

4.29.1 Definition

The version of an instance should be able to be efficiently identified by an implementation.

4.29.2 Description

It is considered best practice to reliably and efficiently identify versions of a format.

The possible value for this property is one of the following: version indication, no version indication, external fragment version indication, internal fragment version indication.

5 Processor Properties

5.1 Draconian

5.1.1 Definition

A processor is said to be draconian when it does not continue normal processing upon discovering an error, as opposed to trying to recover from it.

5.1.2 Description

Draconian is a property frequently discussed in relationship to XML, the specification for which requires all processors to be draconian, and is contrasted with most HTML processors that will try their best to interpret authorial intent and continue processing in the face of severe format errors.

Draconian processors are usually valued for several reasons. The fact that they do not suffer from the issues inherent in handling the complex heuristics of recovering from arbitrary errors decreases their implementation complexity and augments their reliability.

There are cases when one does not want a processor to be draconian. For instance, one might value a tool able to read a document with erroneous sections and attempt to fix them as some XML editors will do. Oftentimes though, such tools are not considered to be "normal" processors.

There are three different levels of draconianism in processors. The first category is strict and integral, and will reject an entire document for a single error (Integral Draconian). The second type is more lenient: it will reject an entire fragment if it is found to be in error, but will process other separate fragments in the document (Compartmented Draconian). This requires the ability to fragment a document into smaller, sufficiently independent parts, and to skip forward to the next fragment. It may result in partial documents. Finally, the last category covers processors that aren't draconian, and try their bests to recover from errors and continue normal processing (Non-Draconian).

6 Additional Considerations

6.1 Forward Compatibility

Forward compatibility supports the evolution to new versions of a format. XML has changed very little while data models have continued to evolve. Additional specifications have added numerous conventions for linking, schema definition, and security. A format must support the evolution of the data models and must allow corresponding implementation of layered standards. Evolution of XML and its data models could mean additional character encodings, additional element/attribute/body structure, or new predefined behavior similar to ID attributes. Examples might be more refined intra-document pointers, type or hinting attribute standards, or support for deltas. An implementation should indicate how certain classes of model changes might be implemented. This resilience relates to properties like 4.8 Extension Points, 4.14 No Arbitrary Limits and 4.29 Version Identification.

6.2 Free

Free in the context of an XML format means that is free to create and use an XML format and the right to build tools and applications for the format is completely unencumbered and royalty-free.

If the format is unencumbered and royalty-free it can be recommended by the W3C and stands a better chance for adoption across the industry. These conditions can positively affect the potential for ubiquitous use of the format. A free format is also more likely to have free, open source code for processing it and free tools for building applications which use it, especially when its 6.3 Implementation Cost is low. This is another factor in the potential for ubiquitous use of the format.

6.3 Implementation Cost

A requirement on XML was "It shall be easy to write programs which process XML documents." Implementation cost is this requirement applied to an alternate representation format of a data model for XML. This property covers the implementation of a generic tool chain, but not any application-specific processing code.

A low implementation cost may contribute to ubiquity in that if tools to process the format need to be implemented as a part of an application (e.g. because they do not exist for the target platform), a low-cost format is more likely to be adopted. To fulfill this requirement the format needs to be easy enough to implement so that this additional implementation is not an impediment to the actual application development. However, low implementation cost is not necessary to achieve ubiquity.

A rough estimate of implementation cost can be made by considering how much time it takes for a solitary programmer to implement sufficiently robust processing of the format (the "Desperate Perl Hacker" measure). A proposed upper limit in the case of XML was a couple of hacking sessions (though this limit has proven to be too optimistic). An alternate format needs to do at least as well as XML, and preferably better, to fulfill this requirement.

Another factor to consider is the kind of code that must be written to implement the format. If either input or output require sophisticated algorithms (e.g. specialized compression not ubiquitously available), this increases the format's implementation cost. If the format has several optional properties and e.g. size-decreasing special serialization possibilities, the number of required possible code paths in the processors increases. This makes the processors harder to test comprehensively, and hence contributes to their fragility, requiring more time for a robust implementation.

Finally, it should be possible to have the format processing be XML-compatible at as low a level as possible. This helps by making it possible to utilize existing XML-based infrastructure to a larger extent. If very high-level tools need to be re-implemented because of the requirements of the format, its implementation cost increases.

6.4 Single Conformance Class

Editorial note: MC29 September, 2004

This additional consideration is still under development.

There should be exactly one conformance class. All features are required to be supported by an implementation, but not required to be used in a specific usage instance.

Any full implementation must demonstrate interoperability with a test suite testing each feature.

6.5 Small Footprint

Editorial note: MC29 September, 2004

This has been identified as a potential consideration, pending further investigation.

6.6 Ubiquitous Implementation

Editorial note: MC29 September, 2004

This additional consideration is still under development.

Ubiquitous means "everywhere". A ubiquitous serialization format would be supported for the widest possible range of computing devices, applications, and use cases.

In evaluating a serialization format in terms of ubiquity, particular consideration of small, resource-constrained devices may be given particular attention, as such devices are projected to soon number in the billions. A serialization format used on such devices would literally be present everywhere.

Ubiquity may be directly related to 6.3 Implementation Cost. A low implementation cost is likely to create an environment where a very large community of developers will be willing and able to create a critical mass of tools, with a resultant feedback and amplification result that leads the marketplace toward ubiquitous implementation. On the other hand, a high cost/complexity serialization format that can meet the environmental constraints of a large number of devices, applications, and/or use cases could also stand a good chance of becoming ubiquitous due to economically-motivated industry commitment. XML 1.0 is considered by most to be a good of example of low cost/complexity format leading to ubiquity. Counterexamples where ubiquity was obtained by trading off, to a greater or lesser extent, ease of implementation include PDF and MPEG-3. While each of these certainly have significant costs of implementation, each also addresses the needs of its user domain so well that each has attained ubiquity in that domain.

7 References

XBC Use Cases
XML Binary Characterization Use Cases (See http://www.w3.org/TR/2004/WD-xbc-use-cases-20040728/.)
XML 1.0
Extensible Markup Language (XML) 1.0 (See http://www.w3.org/TR/REC-xml/.)
XML 1.1
Extensible Markup Language (XML) 1.1 (See http://www.w3.org/TR/xml11/.)
ISO 8879
Standard Generalized Markup Language (SGML) (See http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=16387.)
XML Infoset
XML Information Set (See http://www.w3.org/TR/xml-infoset/.)
Schema Part 1
XML Schema Part 1: Structures (See http://www.w3.org/TR/xmlschema-1/.)
XQuery DM
XQuery 1.0 and XPath 2.0 Data Model (See http://www.w3.org/TR/xpath-datamodel/.)
XML Digital Signature
XML-Signature Syntax and Processing (See http://www.w3.org/TR/xmldsig-core/.)

A Acknowledgments (Non-Normative)

The editors would like to thank the contributers (to be completed)

B XML Binary Characterization Properties Changes (Non-Normative)

2004-05-07SPDocument Created.