This document is the result of a study to identify desirable properties in an XML format. An XML format is a format that is capable of representing the information in an XML document. The properties have been derived from requirements induced by use cases collected in the [XBC Use Cases] document. Properties are divided into two categories: algorithmic and format. Besides these two categories, Section 6 Additional Considerations lists additional considerations which, because of the difficulty to establish an accurate measurement, have not been listed as properties but are nonetheless relevant for an accurate comparison between different proposals.
This document is an editors' copy that has no official standing.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
2 Design Goals for XML
3 Syntax vs. Model
4 Algorithmic Properties
4.1 Processing Efficiency
4.2 Small Footprint
4.3 Space Efficiency
5 Format Properties
5.1 Accelerated Sequential Access
5.3 Content Type Management
5.5 Directly Readable and Writable
5.6 Efficient Update
5.7 Embedding Support
5.9 Explicit Typing
5.10 Extension Points
5.11 Format Version Identification
5.14 Human Language Neutral
5.15 Human Readable and Editable
5.16 Integratable into XML Stack
5.17 Localized Changes
5.18 No Arbitrary Limits
5.19 Platform Neutrality
5.20 Random Access
5.22 Roundtrip Support
5.23 Schema Extensions and Deviations
5.24 Schema Instance Change Resilience
5.25 Self Contained
5.27 Specialized codecs
5.29 Support for Error Correction
5.30 Transport Independence
6 Additional Considerations
6.1 Forward Compatibility
6.2 Implementation Cost
6.3 Royalty Free
6.4 Single Conformance Class
6.5 Widespread Adoption
While XML has been enormously successful as a markup language for documents and data, the overhead associated with generating, parsing, transmitting, storing, or accessing XML-based data has hindered its employment in some environments. Use cases describing situations where some characteristics of XML prevent its effective use are described in another publication of the XBC WG [XBC Use Cases].
The question has been raised as to whether some optimization of XML is appropriate to satisfy the constraints presented by those use cases. In order to address this question, a compatible means of classifying the requirements posed by the use cases and the applicable characteristics of XML must be devised. This allows a characterization of the potential gap between what XML supports and use case requirements. In addition, it also provides a way to compare use case requirements to determine the degree to which a common approach to XML optimization could be beneficial.
For the purpose of this document, a property is defined as a unique characteristic of an XML format which affects the format's utility for some collection of use cases. A consequence of this definition is that a property shall only be regarded as positive or negative in the context of one or more use cases. In other words, a collection of use cases is necessary to understand how a property affects the utility of a format.
The XML 1.0 recommendation [XML 1.0] outlines a number of design goals or constraints which resulted in the creation of XML as it is known today. The design goals were:
XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
Compatibility with SGML [ISO 8879] was a key design goal during the conception of XML. In fact, XML is regarded as a subset (or profile) of SGML, whose main purpose was to reduce the inherent complexity of SGML. By reducing the complexity, e.g., the number of options, XML became much simpler to implement than SGML, and this in turn resulted in the availability of a myriad of tools and APIs. It is precisely these tools and APIs (as well as the phenomenal growth of the Internet) that have attracted a number of different communities to XML.
The [XML 1.0] defines the XML language using a BNF grammar. Although a number of data models have been built on top of XML, as a syntactically defined language, XML is in itself data model agnostic. As stated earlier, an XML format is a format which is capable of representing the information in an XML document. Information, however, is in the eye of the beholder; what constitutes information, as opposed to just data, depends on the data model on which an XML processor is based.
The XML infoset was an attempt to establish a separation between data and information in a way that would suit most common uses of XML. In fact, many of the existing data models are defined by referring to XML infoset items. However, the XML infoset does not establish a sound separation between information and data for all applications of XML. For example, the XML infoset recommendation does not regard the use of single or double quotes to delimit an attribute value as information, yet there are applications like XML editors for which this distinction matters.
The discussion on which is the right data model or what constitutes data versus information is unlikely to end anytime soon (if ever). Thus, it was the decision of the XBC WG to leave it out of this document to avoid premature exclusion of potential uses of XML formats not captured in any of the existing data models. There are, however, properties such as 5.22 Roundtrip Support which can be used to tighten the relationship between XML and an alternative XML format, but that do so without diving into the controversial data model discussion.
This property refers to the speed at which a new format can be generated and/or consumed for processing with respect to that of XML.
There are three broad areas of processing with regard to an XML format:
Serialization: The generation of the format from a data model.
Parsing: The reading of the format in order to process and extract various pieces of information contained in it.
Data Binding: The creation of an application data model from the data contained in the format.
It is sometimes desirable for an XML format to allow for all areas to be performed in a more efficient manner than it is currently allowed with XML. For example, it should be possible to serialize a message faster than using XML. Furthermore, parsing the resulting format should be faster than parsing XML.
Processing efficiency should be considered in an end-to-end manner, from application accessible data on one end to application accessible data on the other end. In other words, it is desirable to have a process that is efficient not only in parsing, but in generation, transmission and data binding. However, not all applications need symmetric speed. Some applications may require efficient parsing without specific needs as to how long it takes to generate the format. Other applications may have the opposite concerns.
This property refers to the size of a processor implementing a new format with respect to that of a processor implementing XML.
Establishing the exact footprint of an implementation of a format is impractical due to the number of different programming languages and platforms that are currently available. However, given the specification of a format it is possible to determine if the format enables the implementation of processors whose footprints are smaller than the typical XML processor for a similar application. This can be accomplished by considering the number and/or complexity of the features that are required (which impacts the size of the code segment) and the amount of data that must be available to a processor in order to support the format (which impacts the size of the initialized data segment).
Perhaps the best example is that of XML versus SGML. By simply inspecting the corresponding specifications it is possible to estimate, given the reduced number of options and features, the footprint of a typical XML processor to be smaller than that of an SGML processor. (In fact, many experts in the field view this property of XML 1.x as one of the key reasons of its success, so it is only natural to consider it when evaluating alternate formats).
This property refers to the memory requirements of a processor implementing a new format with respect to that of a processor implementing XML.
A format should be processable in a wide variety of platforms. Small devices such as mobile handsets, for instance, have limited amount of memory compare to those of desktop PCs or servers. The amount of dynamic memory that a format requires in order to process an instance may hinder its application in certain platforms. XML is currently supported in devices that have much less memory than a PC. Thus, it is imperative for an alternate format to enable the implementation of processors whose memory requirements are smaller (or at least not higher) than the typical XML processor.
In many cases, space efficiency is inversely proportional to processing efficiency. I.e., the desired level of space efficiency is often achieved by increasing the processing time. Therefore, it is desirable for a format to enable the implementation of processors whose space efficiency could be configured based on the available memory. For example, if the format is processed on a high-end server, a processor should support maximum processing efficiency at the expense of memory efficiency. On the other hand, if the format is processed on a low-end mobile handset, a processor should support maximum space efficiency at the expense of processing efficiency.
Accelerated sequential access is the ability to sequentially stream through an XML file when searching for data model items more rapidly than the average seek time using character-by-character comparison.
Accelerated Sequential Access is similar to 5.20 Random Access in its overall objective of reducing the amount of time needed to access data model items, but differs in the method used to accelerate the access. In random access, lookup is performed in constant time through the use of a table, while in accelerated sequential access data model items are searched in streaming mode, resulting in a lookup time that is related (yet not necessarily proportional) to the number of data model items in the document.
One approach to supporting this property is through the inclusion in the XML document of an index which allows skipping over content as the document is read. For example, an element index might contain the offset to the start tag of the next peer element. If the application recognizes that a desired data model item is not in the current element, or in the children of the current element, it will be able to restart the match at the offset of the next peer element without inspection of all data model items contained in the current element. A format that enables faster matching via the conversion of strings to tokens can also be considered as supporting the Accelerated Sequential Access property.
Performance of accelerated sequential access is measured by the time (algorithmic complexity) required to find data model items, the time needed to construct any indexes and special structures, and also the size of those indexes and special structures (impacting memory consumption and bandwidth utilization in the transport). Most implementations will support modification of the XML document; the cost of updating the indexes or special structures is another performance factor which can be measured.
Compactness refers to the size of the in-memory or otherwise stored representation of an XML format. Compactness is achieved by ensuring that a format includes as little extraneous information as possible. Extraneous information is any information that is not needed in order to process the format completely and properly.
A compact encoding can be achieved in different ways: a number of different techniques such as lossy/loss-less, schema-based/non-schema-based, delta-based/non-delta-based, among others, have been considered. For example, JPEG files are an example of a lossy encoding where bits of the original document are thrown away (and cannot be recovered) in order to achieve a compact representation. The same type of lossy encoding could be employed for XML documents in order to achieve compactness.
Alternatively, differing degrees of compactness can be achieved with a loss-less encoding, whereby redundant information is removed. In this manner no information is lost, however, compactness is achieved through the removal of this redundant information. A loss-less encoding would typically be less compact than a lossy encoding.
Furthermore, a schema-based encoding of an XML document can achieve a degree of compactness by using prior knowledge about the structure and content of a document. A format is schema-based if it uses information from the document's schema to achieve a better degree of compactness. This information could be used later as the document is processed or reconstituted. It is worth pointing out that although not self contained, a schema-based encoding is not inherently lossy given that, in principle, a decoder can reproduce the data model using both the encoding and the schema. Thus, as with other techniques, a schema-based encoding can be lossy or loss-less.
Another mechanism to achieve compactness is through a delta-based encoding. Delta-based encodings are generated by comparing an original document with a secondary, reference document. The resulting document is the delta between the original and the reference document. This type of encoding can be lossy or loss-less. In either case, the original document can be reconstituted by using both the delta and the reference document.
The advantages of a compact representation are:
Storage: Large XML documents can be stored in the compact format, thus saving space.
Transmission: Large XML documents can be transmitted more efficiently when represented in a more compact form, thus saving time. This is especially important when sending XML over low-bandwidth connections.
A disadvantage of any compact encoding might be the additional time and CPU required to generate the encoding.
A format integrates into the media type and encoding infrastructure if it defines one or more media types and/or encodings for itself as well as the way in which they should be used.
The media type and encoding infrastructure provides for a common and simple way of identifying the contents of a document and the content coding with which it is transmitted. It is fundamental to the functioning of the Web and enables powerful features such as content negotiation. While required for the Web, these mechanisms are not specific to it and are typically reused in many other situations.
It is therefore desirable that formats meant to be used on the Web define (and preferably register) the media type and/or encoding that one is to use when transmitting them.
There are multiple ways in which an alternate XML format could define how media types and encodings are to be used with it. Several options of note and their associated trade-offs are:
The alternate XML serialization is considered to just be a content coding. In this case it may have a media type (as gzip does with 'application/gzip' in addition to the 'gzip' content coding) but the principal way of using it is to keep the original media type of the XML content and only change the content coding. The upside of this approach is that the existing content dispatching system is untouched, that the media type information is fully useful, and that the content coding infrastructure is put to good use. The downside is that there is philosophical and technical dissent as to whether an alternate XML serialization is an encoding in the way that gzip is —a discussion that needs to involve considerations concerning the 5.22 Roundtrip Support, 5.5 Directly Readable and Writable, and 5.16 Integratable into XML Stack properties. With this approach content negotiation is fully possible. The behaviour of fragment identifiers does not need to be re-specified.
The alternate XML format is not a mere content coding but requires the definition of one or more media types. This case subdivides into two options:
There is only the alternate XML format's media type. Any content sent using that format must have that media type. The upside of this approach is that it is simple. The downside is that you lose all media type information of the original XML content so that you must then define another system to provide that information, or define new media types for all possible content (application/binxhtml, image/binsvg, etc.). With this content negotiation is entirely impossible (or rather, totally useless) unless new media types are defined for all things XML. The behaviour of fragment identifiers becomes impossible to specify, or has to be re-specified for all the new media types.
A new media type suffix is defined in the manner that it was done for XML content (e.g., "+bix") to be used for all content expressed using the alternate XML serialization. The upside of this approach is that it's simple and that the diversity of media types is maintained. The downside is that it requires much more intrusive modifications to systems that rely on existing media types. With this content negotiation is possible, but with lesser power. The behaviour of fragment identifiers has to be re-specified to map back to the one in +xml types.
A delta is a representation of arbitrary changes from a particular instance of a base, parent document which, along with that parent document, can be used to represent the new state of the parent. The parent is identified in a globally or locally unique manner. A delta is distinct from a fragment or a computed difference, although the latter could be represented as a delta.
A common need is to convey changes of a potentially large existing object with minimal processing and data representation. Overall compression of an object can help minimize data transmitted, but there are always size and change combinations where this is of minimal use when replication is needed. A delta is similar to a fragment in that it contains a subset of an overall object. The primary difference is that a fragment is a source independent, contiguous range of data. A delta is an efficient record of one or more changes to the original document. A delta captures changes to the parent efficiently, represents the changes efficiently, and is efficiently usable by a receiver along with the original, parent document. The receiver uses the combination of the original parent document and one or more deltas to operate as if the receiver had a single object that was the end result of all changes. One operating mode that has convenient characteristics is to append deltas to copies of the parent object, either in memory or as a file. Some existing document interchange formats make use of appended delta instances, the Adobe Portable Document Format (PDF) being the most notable [PDF Reference].
An important use for deltas is to factor out redundancy in a way that is related to schema-based redundancy removal. The concept of a delta has been used in similar ways in the past, an example of which is the minimization of later packets in SLIP/PPP protocols by referring to prior packets. The simple equivalent to a parent and its delta is a complete copy of the before and after instances. This avoids losing any data, but the knowledge of what has changed must be created by comparison. A delta instance can be created by a similar differences process, but this is a high complexity operation. An example of the differencing approach is RFC3229 and the experimentally registered VCDIFF algorithm.
A delta may be more efficient if nearby changes are localized in the format. A delta may be the mechanism used to localize changes.
Some examples where deltas are required or useful are:
Efficient and repeated replication of large objects and their changes among distributed nodes.
Transaction logging of changes to objects, allowing for rollback or replay.
Efficient representation of messages of all sizes in an application protocol or other communication by reusing redundancy in structure, invariants, or common values.
Rapid and efficient creation of new instances of an object based on a template which may be large or otherwise should be shared.
There are at least two major ways that a delta-like facility has been created. These could be referred to as high-level change operations and low-level change tracking. Efficiency, granularity, and time complexity of these methods will vary. XML 1.x does not include explicit support for deltas.
A format is directly readable and writable if it can be serialized from an instance of a data model and parsed into an instance of a data model without first being transformed to an intermediate representation.
Formats that are directly readable and writable generally make more efficient use of available memory and processor resources that those that are not. In addition, they sometimes have better streaming characteristics.
The parser for a directly readable format can parse the format into an instance of the data model in one logical step. Likewise, the serializer for a directly writable format can serialize an instance of the data model in one logical step. In contrast, a parser for a format that is not directly readable must transform the original format into the intermediate format before it parses the intermediate format into an instance of the data model. Likewise, the serializer for a format that is not directly writable must serialize an instance of the data model into the intermediate format before it transforms the intermediate format to the target format. Unless the order and organization of items in the intermediate format correspond closely to order and organization of corresponding items in the target format, the required transformations will negatively impact streaming.
An example of a format that is not directly readable and writable is a gzipped XML stream. To create an instance of a data model from a gzipped XML stream, the stream must be decompressed to XML format, then parsed into the data model. Likewise, to create a gzipped XML stream from an instance of a data model, the data model must be serialized to XML format, then compressed. The compression and decompression steps require additional processor and memory resources above and beyond that required to parse and serialize the XML format. In addition, the two step process limits streaming.
Efficient Update refers to the property of being able to efficiently apply changes to a part of a format instance. This property is important for applications that require the modification/insertion/deletion of specific data model items in a way more efficient than a complete deserialize/modify/serialize cycle.
The least efficient case requires the deserialization of the data, prior to the application of the changes, followed by the serialization of the data, once the data has been inserted, modified, or deleted. The most efficient case would apply changes directly on the serialized data, thus avoiding the need to cross data representation boundaries. A format could have characteristics that allow it to be modified efficiently, in place, without being completely rebuilt and potentially without moving substantial amounts of data. This is direct support for Efficient Update. XML can be modified in-place, but data after the modification must be moved. XML + gzip cannot be modified in place at all. As a existence proof, a serialized DOM where nodes are allocated in a file in a malloc-like way and each object allocated uses file-relative pointers, would completely support efficient update.
If direct update is not possible or efficient, efficient support for the 5.4 Deltas property would allow an application to use the original instance along with one or more deltas to serve as an efficient update mechanism. While the production of a low-level delta for some formats is cheap and the use of a stack of low-level deltas can be relatively cheap, this method requires an ever growing stack of changes and indirection layers. This can be inefficient in certain scenarios.
This property is concerned with the ability of a format to be modified without being rebuilt. The 5.4 Deltas or 5.12 Fragmentable properties, with how application semantics would be used to actually represent changes from one instance to the next.
There are three aspects under which this property should be evaluated:
Efficiency of update: This is the time and complexity required to apply the changes, starting from the original serialization up until the updated serialization is produced.
Efficiency of retrieval: This is the time required to retrieve a (possibly) modified value.
Compactness: This is the additional space required for the application of an update or the typical overhead of supporting different kinds of changes to a format instance. In the existence proof example, inserting a new element might be efficient because it might just result in an append to the file while inserting characters in a large text might cause a new chunk to be allocated at the end of the file and the old chunk to become an unused block. While the block could be reused just like with malloc, mitigating the cost, it is still a potential inefficiency.
A format supports embedding to the extent to which it provides for the interchange and management of embedded files of arbitrary format.
A variety of use cases call for the inclusion of files of one type inside another: images, video, and sound embedded within multimedia documents; arbitrary files inside Web service messages; large datasets bundled with metadata. File formats vary in their support for this use.
Formats designed for narrowly constrained purposes, such as GIF, typically make no provision for embedding. While it may be possible to encode some additional data in certain metadata fields within such formats, doing so violates the spirit of the file format and requires tight agreement between the sender and receiver for interchange. Such formats effectively offer no interchange or management support and are not considered to support embedding.
Other formats, such as XML and TIFF, permit embedding simply by virtue of flexibility: they do nothing to prevent file embedding. However, because these formats have no mechanism for distinguishing an embedded file from other types of data, tight agreement is still required between the sender and the receiver for interchange. Such mechanisms are also not easily manageable.
XML falls somewhere between these first two cases. It is flexible enough to allow the embedding of files, but only if those files consist entirely of character data. Embedding binary data requires an additional agreement as to how it is encoded as character data, e.g., via base64 encoding. This also imposes a penalty on both compactness and processing speed.
Other formats, such as XSL-FO and PDF, provide specific embedding points. For example, XSL-FO defines the instream-foreign-object element for embedding objects which are in a non-XSL-FO namespace. By establishing a general mechanism, they make embedded data interchangeable and manageable because there is an a priori agreement for creating and identifying embedded files.
Finally, there are packaging formats, such as MIME multipart/related and ZIP, which exist solely for the purpose of containing embedded files. Packaging formats generally provide significant management capabilities by supporting metadata, signatures, encryption, and compression of embedded files. They are typically designed specifically for the interchange of these embedded files.
Evaluation of a format for embedding support should take into account both interchange and manageability, as described here, as well as support for related properties like 5.2 Compactness, 5.20 Random Access, 5.26 Signable, 5.8 Encryptable and 5.28 Streamable.
A format is encryptable to the extent to which it makes the encryption and decryption of a file straightforward and interoperable.
In principle any file format is encryptable in that the bytes which compose any file may be fed to an encryption algorithm. Encryption capabilities, however, are most useful when the encryptor and decryptor can agree upon which algorithm was used and which portions of the file are encrypted. Formats vary in how amenable they are to specifying and maintaining this information, and this in turn can be a measure of how "encryptable" they are.
It is often desirable to encrypt only a portion of a file. In the most basic use of this capability a file may contain unencrypted data regarding the encryption algorithm and parameters used for the remainder of the file. This can promote interoperability, as described below.
In other situations it is desirable to leave certain metadata (e.g., SOAP headers or XMP packets) unencrypted but encrypted the remainder of the document in order to permit certain routing or query functions to be performed by intermediaries. In the case of compound documents it is sometimes desirable to leave the metadata of each embedded document unencrypted while encrypted the remainder of the document.
Other things being equal, formats which place all bytes representing the encoding of data model constructs (such as SOAP headers) in a contiguous byte range better support partial encryption because those ranges are more easily determined and specified. We say such formats are "more encryptable". Formats which permit such ranges to be created but do not guarantee them are less encryptable because the application must either determine all ranges which must be encrypted or arrange for that information to be placed in a contiguous byte range.
Finally, there are formats which will never place data model constructs in contiguous ranges but scatter that information into tables and other mechanisms used to achieve compactness or other format properties. For example, a format may place element names in a vocabulary index table. That table may contain names of some elements in the encrypted region and others which are not; one must then determine how much of the table to encrypt. Such formats are least encrypted with respect to partial encryption.
Encryptors must be able to communicate to decryptors which portions of a file are encrypted and by what mechanism. Other things being equal, formats which make no provisions for recording this information are less encryptable because they require additional agreement between the parties involved in order to make encryption interoperate.
Formats may provide syntax for encoding this information in the file format itself. Such formats are more encryptable because interoperable encryption support can be created simply by reference to the format itself; no additional agreements with decryptors are required.
Explicit typing is a property of an XML format in which datatype information of data model items is intrinsically a part of the format.
Datatype information is used to constrain and type validate XML input and to enable interpretation of data model items in the document as a specified type.
In XML 1.x, datatype information is not an intrinsic part of the format. Common usage is to express datatype information in a separate document such as an XML Schema. XML applications that require datatype information would interpret the associated schema document in order to identify the datatype of specific data model items. Type information may also be conveyed in XML through the use of some additional markup in the content of the XML document; for example, by adding an attribute such as "xsi:type" with a type value to the element for which type information is needed.
The use of a separate document for datatype information, requiring processing a schema and mapping the schema against the data model items in the document instance, may cause degradation of performance. Infrastructure issues such as how to obtain the schemas and how to assure that they are the correct versions, are difficult problems to solve and have no standardized solutions. Putting datatype markup in the document instance may address some of the problems with using external documents if all types are built-in but is not an XML language-independent, reliable, nor standardized method. If types are user-defined the schema is actually still required for interpretation of the type. Conveying type information through markup may also be an inefficient method for encoding type information in the document.
Explicit typing as an intrinsic part of an XML format can be schema-dependent or schema-independent. A scheme for schema-dependent explicit typing might put type information into the instance but might still rely on a schema for interpretation. A schema-independent scheme for explicit typing would be fully self-contained and therefore enable XML applications to perform type validation and data binding without the overhead of schema processing, but with the requirement that the type system used is universally understood and with the limitation that the type system is not extensible. Schema-dependent explicit typing can still offer some of the advantages of schema-independent explicit typing in that processing of instances without the schema may be possible if the set of types used are all universally understood, or if it is possible to perform useful processing without knowing the schema-based definition of extended types. This achieves partial self-containment. Another possibility is to embed schema information for extended types in the instance itself, making this approach schema-dependent but still fully self-contained.
It is possible to represent type information much more compactly as part of the format than it is through the use XML markup in the document instance. Explicit typing offers a way to include type information which has the advantages of being XML language-independent, reliable and standard. An explicit typing scheme is required in schema-less formats that represent primitive datatypes natively. For example, a schema-independent XML format which represented 32-bit floating point numbers as a 32-bit sequence of bits would need to tell the parser that sequence of bits should be interpreted as a floating point number.
An extension point is a method for easily extending a format and its implementation.
The extension might be for a new, alternate, or experimental version of the format, an implementation of a layered format, or an application specific or otherwise proprietary extension. Formats may support extension points in various ways and to different degrees. Some formats do not allow extension points in any predefined way. Some formats allow for a single extension point. Other formats allow multiple extension points but may restrict what items may be extended, such as adding just new attribute types or character encoding. The most flexible formats support extension points in all important ways and on arbitrary data items. An important consideration relative to concerns about interoperability, evolution, and debugging is whether features can be represented by standard convention in XML. Some features that extend existing data models may be part of the initial version of a format while other potential additions may be implemented as extension points. An example would be the addition of item metadata such as typing or encoding. Another example would be the addition of new tokens for support of a new kind of data model item.
This property refers to the ability to efficiently determine the version of a format from a document instance.
It is considered a best practice to reliably and efficiently identify versions of a format. XML 1.x supports this notion as part of the optional document's prolog (if absent, XML version 1.0 is assumed). It is desirable to access this information as early as possible, so a format that does not make this information available when the processing starts should be considered inefficient as far as this property is concerned.
A format is said to be fragmentable when it supports the ability to encode instances that do not represent the entirety of a document together with sufficient context for the decoder to process them in a minimally meaningful way.
While typical usage of XML involves exchanging entire documents (the special case of external parsed entities notwithstanding), it is sometimes desirable to support the ability to exchange smaller, independently exploitable parts of a document. The presence of this property largely facilitates a variety of other properties of a format and processing tasks that may be performed on top of a document such as the transmission of deltas, error resilience mechanisms, improved access times, or the prioritized transmission of document parts.
This property is similar to 5.28 Streamable in that processors featuring these properties are able to deal with small parts of a document, but it is different in that the fragments can be treated independently and in arbitrary orders —unlike 5.28 Streamable where atomic items are processed in document order. This difference incurs additional requirements to support the transmission of the context required to process a fragment (at the very least the set of in-scope namespaces, and possibly also the values of xml:base, xml:space, and xml:lang). Several standard efforts within the W3C refer to the ability to fragment XML documents, notably XQuery Update and XML Fragment Interchange [XML Fragment Interchange].
In addition to the ability to process fragments in isolation, it is possible to consider storing one or more parts of a document instance as immediately extractable fragments, so that they can be pulled out with little or no additional processing cost. For example, by supporting localized versions of any table tokenization, namespaces, or other redundancy reduction measures that may have been employed in the document. This ability to encode self-contained subtrees is also useful to facilitate document composition.
A format has the property of generality if it is competitive with alternatives across a diverse range of XML documents, applications and use cases.
To be successful as a global standard, a format must be valuable for a wide range of XML documents, applications and use cases. XML documents vary in size from tens of bytes to tens of gigabytes. They vary in structure from highly structured data to semi-structured and loosely structured documents. Some XML applications require strict schema validity, while others deal more flexibly with schemas or do not use schemas at all. Some XML applications require preservation of insignificant whitespace, comments, processing instructions, etc., while others ignore these items or actually prohibit some of them. Some binary XML use cases need to optimize for compactness at the expense of speed, some need to optimize for speed at the expense of compactness and some require a balance between these two extremes.
Formats that are competitive with alternative solutions across a wide range of XML documents, applications and use cases are more general and more likely to succeed as a global standard. On the other hand, formats that are optimized for specific data, applications or use cases at the expense of others are more specialized and are not likely to succeed as a global standard.
A format is human language neutral if it is not significantly more optimal for processing when its content is in a given language or set thereof, and does not impose restrictions on the languages or combinations of languages that may be used with it.
Historically, it has often been a property of many data and document formats that they only supported a small subset of existing human languages (often due to supporting a limited legacy character encoding), and therefore were unusable in a large set of situations. More recent formats such as XML do not suffer from similar limitations. While it is impossible for a format to perform identically, in terms of compactness or processing efficiency, for a language that can be entirely captured using a single byte per character and for one that requires a multi-byte encoding, not favouring one over the other ensures better internationalization support, resiliency to the passing of time, and makes wider adoption possible.
A format is humanly readable and editable to the extent to which a person can understand and modify it without either a specification of the format or implementations of that specification. For example, many persons are capable of reading and editing XML files without having read the XML specification and without the assistance of XML-specific software applications.
Situations often arise in system development, testing, and troubleshooting in which it is convenient to be able to create, examine or repair data using a limited tool set. I.e., a text editor is likely available, but a format-specific parser may not be. Similarly, the data may conform to some specification but that specification may not be readily available.
For example, HTML is a humanly readable and editable format. As such, many HTML files have been created or updated merely by reading other HTML files and using no more than basic text editors. The mechanism by which the humanly readable and editable quality of HTML aided its rapid adoption is sometimes called the "view source" effect, in reference to the web browser menu items which permits the underlying source HTML of any page to be viewed.
A similar situation arises in archiving applications. For short-term archiving, on the order of a few decades, it is generally reasonable to assume that format specifications and implementations are available. But for long-term archiving, on the order of centuries or longer, it is best to assume that the format specification and implementations will be lost or unusable and the documents will be deciphered by people.
Whether for short-term web page authoring or long-term archiving, this property is fundamentally about how easy it is to understand and modify the data in a file when little about that file is known. To that end, a format satisfying this property should attempt to make the first guess of someone attempting to read or modify the file correct. Specifically:
Use a regular and explicit structure, such as element tags defined by XML.
Use natural language as text and avoid the use of magic binary values or compression.
(A magic value is one which is assigned arbitrarily and whose meaning cannot be derived except by reference to some external authority, such as a file format specification. For example, if a format used the number '73' to indicate the use of UTF-8 and '29' to indicate UTF-16 then it uses a magic number, as there is no decipherable logic to these assignments. Compare this to a format which instead uses the strings 'utf-8' and 'utf-16'.)
Use a limited number encoding mechanisms, and prefer those that are most obvious. For example, use RGB triples to encode color, not both RGB and CMYK.
Be self-contained, as external information (such as a referenced schema) may not be available. (c.f., 5.25 Self Contained)
Maintain the same order and position of information items as it has in the data model being represented. For example, keep element names inline as in XML, and not consolidated into a token dictionary located elsewhere in the file, elements should be serialized in the same order as they appear in the model instead of being stored in, say, alphabetical order.
XML as a data format is surrounded by a large body of specifications that provide additional features (validation, transformation, querying, APIs, canonicalization, signatures, encryption, rendering, etc.) considered to form the XML Stack. A format is said to integrate well into the XML Stack if it can easily find its place into the large body of XML-related technologies, with minimal effort in defining new or modified specifications.
One of the great powers of XML is that whenever a technology is added to its core set (the XML Stack) it becomes instantly available to the others, thereby leading to better-than-linear increments in the quality and usefulness of the system. The value of a new format can therefore be in part measured by how well it integrates into the XML Stack so as to be able to reuse as much as possible of the existing functionality.
It should be noted that a vital factor in making this orthogonality in specifications possible has been the syntax-based nature of XML that enables a loose coupling between various systems that may then be based on a data model reused from another specification, on a data model of their own, or directly on the XML 1.x syntax.
A format exhibits localized changes if a single change to an information item in a data model instance produces a corresponding change in the format that is limited to a single small range of bytes. A format also exhibits localized changes if multiple nearby data model changes cause small nearby changes in the format. This property refers to changes in a new complete instance relative to an instance that exists prior to changes. This is distinct from but related to the 5.4 Deltas property which represents the ability to produce an instance that consists only of the changes relative to the original instance. In a delta, the changes are naturally packed closely together. Some use cases that require localized changes may be served by the ability to work with deltas.
It is possible to create a format where a minor difference between two logical data model instances causes widespread differences in the byte representation of the resulting format instances. It is also possible to create formats where most differences, which could be considered changes, would be more localized. The default definition of "nearby logical data model changes" would be the depth-first nature of XML. Other interpretations are possible, such as changes to siblings versus deep children.
This property measures whether changes in the data model being represented are reflected with relatively small and similarly coherent changes in the format. An individual change should produce a small contiguous byte range change while a group of data model changes should produce format instance changes that are relatively similar in distance. The 5.4 Deltas property provides for dense format instances that localize changes that are arbitrarily far apart. Some applications may benefit from localized changes but be as well or better served by deltas, making this property more important if deltas are not supported.
As an example, XML supports localized changes well while gzipped XML does not. Gzip, being a streaming dictionary-based compressor, will tend to produce a different byte stream after the point of a difference in input.
The No Arbitrary Limits property refers to the degree to which a format imposes limits on quantities such as sizes, lengths, maximum number of character encodings, etc. Formats normally establish these limits by allocating a maximum number of bits for the storage of those quantities.
Arbitrary limits can cause discontinuities in format evolution which translate into application difficulties for unforeseen uses. A format should try to avoid imposing limits on quantities such as string lengths, tables sizes, etc. to the extent to which those decisions may result in incompatibility problems with subsequent revisions of the format designed to address advancements in technology or new uses. It is worth noting that in many cases there is a tradeoff between flexibility and ease of use as well as between flexibility and performance.
Platform neutrality is the property of formats that are not significantly more optimal for processing on some computing platforms or architectures than on others.
It is naturally impossible for a format to perform identically on all computer platforms and architectures, but in many cases it is possible to optimize a format for processing on a given platform to the detriment of several others. Some platform neutral formats may have set endiannesses or word-lengths, but these have been chosen to correspond to the format's needs and not to match a given platform's specificities, e.g., by being defined around the native structures of a given programming language. Platform neutrality ensures not only that wide adoption is possible, but also makes the format more resilient to the passing of time. In some cases, options in the format may be used based on the preferred parameters of the systems involved.
Random access is the ability to lookup the location of data model items in a document, as opposed to searching for data items by sequential traversal of XML events or XML structures. Although the objective of Random Access is to reduce the amount of time (algorithmic complexity) needed to access data model items, the property is characterized by its support for a random access method to the data.
Even though the random access method is intended to reduce lookup time, a cost will be associated with the structures which support the lookup. The cost in terms of processing time will be at least the time required for a sequential scan of an input document as well as the additional costs in terms of storage, memory consumption, and bandwidth utilization. The random access property will be of interest in those use cases where that cost is less significant than the access cost to the document without random access.
Many of the implementation characteristics of random access in XML are related to the ability to build a table for efficient access to data items in the document. The following example illustrates a simple form of a random access table. The byte position and length are obtained for each data item in the document using a token scanner; this information is later stored in a table and associated with the document. Given a scheme for looking up a data item in the table, an application could directly read any data item in a memory-mapped byte array (using the byte position and length) or it could perform a sequential byte traversal over the byte stream.
The access table may contain information other than the position and length of the data model item. Some additional information will be indispensable in almost all cases since these two properties will only support look-up of data items by position. Examples of other information associated with each entry in the access table include its kind, its schema-described data type, as well as other schema-described properties such as optionality and namespace scope (alternately, schema-aware documents could simply reference schema information which is available in the schema itself stored outside of the document). This additional information may be more or less useful to specific types of applications; as stated before, the cost of construction, access and size of the table will be balanced against the potential benefit to specific types of applications.
The access table may be complete, selective, on-demand, or heuristic. A complete access table would provide addressing information for all data items in the document. A selective access table would provide addressing information for some of the data items in the document. For example, a random access system can provide addressing information only for elements. An on-demand access table can provide addressing information for specific data items that have been requested by the application. Finally, a heuristic random access system can provide addressing information for data items based on some criteria such as prior experience with related documents.
The type of access table has bearing on the time needed to construct it, its size, and the time needed to obtain addressing information for a data item. These constraints are the trade-offs made against the level of addressability into the document as offered by the table.
The actual construction of the access table and the format of the lookup results is largely an implementation decision but some design features will better support different approaches to enabling random access to data model items efficiently. For example, a selective implementation of random access to data items which supports lookup by element name might be best implemented as a hash table. An on-demand implementation that uses a sequence of XPath statements to specify data items of interest may return an array of nodesets whose order is determined by the input sequence.
Another characteristic of data items lookup is whether it is guaranteed to return a unique data item or if multiple data items may match a lookup and, in this case, whether multiple, iterative, or a single selected item will be returned. This characteristic will again have a bearing on performance issues and trade-offs may reasonably be made according to the needs of the application.
In allowing direct access into a document, the random access method can, potentially, have some impact on the binary format's 5.12 Fragmentable property. A format which supports both random access and fragmentability would have to have a mechanism for providing the context of a data item. The context should include active namespaces and, possibly, hierarchical context such as enclosing and ancestor elements. This context information also has a bearing on the support of schema-awareness allowed through the random access method (namespaces are needed to resolve the schema; hierarchical context may be needed to determine the relevant content model). The context is naturally preserved when the index or access table is absolutely independent, from the access point of view, from the infrastructure which supports fragmentation. This design is most feasible when the content of the document is relatively static and primary accessed for reading. In a more dynamic situation, the random access method and fragmentation infrastructure may need to work together to achieve the desired goal. In this case, the inclusion of context information may impact either or both the speed of random access and the storage required by the access table. A format supporting random access should specify whether it supports full, partial, or no fragmentability.
A fundamental distinction in the implementation of the Random Access property is whether the format contains the access table or it is simply guaranteed not be incompatible with the implementation of random access methods. In the first case the access table is bound to the XML document instance, enabling it to be transported and accessed by different processing nodes while retaining all information necessary for random access. In this case, a standard will have to explicitly specify the format of the access table. In the second case, transportability of random access information embedded in the document is not required but the ability to build an access table external to the document (or multiple documents) is. For example, in a persistent store (e.g., a database), an index can be built over all documents in the store but without the additional storage overhead associated with including an access table in each document. Conditions incompatible with the random access property of a binary XML format in this scenario may include such things as the impossibility to construct an index due to, for example, a compression scheme that does not store data sequentially, or the prohibitive processing time requirements for the construction of an index dictated by the complexity associated with the format.
It is also possible for these two approaches to coexist; the document may, optionally, contain an access table and also guarantee efficient indexability. A consequence of allowing the embedded access table to be optional is that the document must be complete and capable of being correctly processed without the access table. It must also be possible to construct and add the embedded access table from an XML document with does not contain the access table. Some mechanism must exist to enable applications to identify whether the access table is present or not, and to negotiate the format when documents are interchanged. The notion could be extended to support different types of access tables appropriate to different application scenarios.
Two operations are associated with random access: random read (extract) and random update. The random read operation extracts information from an XML document. This information can contain one or several values and/or one or several data model items (subtrees). The random update operation allows data items in the XML document to be inserted, deleted or replaced using random access addressing mechanisms to locate the position in the document where the update is to be made.
There are multiple implementation techniques for the random update operation. They can either update documents directly by modifying the document representation itself or do it indirectly by storing new parts of document in separate tables. In any case, for interoperability, random update must allow the updated document to be written out to an XML 1.x representation with the updates in place.
Any write process presents challenges with respect to synchronization of updates when updates may come simultaneously from multiple threads or processes. A random access update implementation either automatically or optionally synchronizes updates or assumes the user takes responsibility for synchronizing data requests.
Other characteristics of random update include whether it can enforce XML well-formedness and/or schema compliance of updated data, and whether such enforcement takes place each time an update is made, or on request, or when the XML is written out.
The Random Access property is tightly associated with the efficiency of read and update operations. Generally speaking, it is possible to build access tables for an XML document in its textual representation. However, this would require a processor to maintain all structural information (references to each item and property) as well as all data type information (for schema-aware documents) outside of the document itself. Such information, very often, takes significantly more space than the document itself; additionally, most of the random access operations require too many reads and writes in this case.
A format that includes embedded random access indexing abilities may support a further variation that could be called a stable virtual pointer that is valuable for several uses and needed by certain use cases. A stable virtual pointer is an index entry that is created through any manner that results in an absolute location in the data model instance. This index entry has a permanent ID relative to the file and has the property that any update to the data model will result in the virtual pointer still pointing to the same location. Location in this case means the same element, character, or position relative to other items in the data model instance. Characters, elements, attributes, and other objects can be inserted, updated, or removed and the virtual pointer continues to point to the same relative position. The format might support references to this stable virtual pointer. Internal references might be specially tagged and should have low complexity of usage for an application. A stable virtual pointer allows efficient construction, manipulation, and usage of data models other than tree structured ones, and also allows stable references both internally and externally which would be complex and costly to maintain otherwise.
A format is said to be robust if it allows processors to insulate erroneous sections in a way that does not affect the ability to process the remaining sections.
Processors exhibiting robustness minimize the size of the sections affected by errors (within some limited boundary of the error positions) and are able to unequivocally interpret the remaining sections without compromising the partial format integrity. From an application's point of view, the result is a partially complete data model instance with sporadic information losses caused by the errors.
The robustness property of a format can often be translated into (i) the processor's ability to detect errors if there are any and (ii) the format's facility to permit skipping over to the position where the processor can resume document processing. To support the processor's ability to detect errors, dedicated redundancy such as a cyclic redundancy check (CRC) may be added to the format in the so-called channel coding algorithms.
There are applications that have certain constraints that do not permit or afford the time of data re-transmission and are required to do their best in an attempt to recover as much information as possible out of the document even in the face of errors.
One such application is found in multimedia broadcasting systems that broadcast data to wireless devices. Broadcasted media has to be resilient from possible errors and continuously allow processing on the devices even with sporadic corruption caused by transmission errors.
Another example is business document exchange over unreliable networks. Documents may consist of data of different levels of significance. Some portion of the document may be absolutely crucial for performing the business while others may be merely informative. It is often the case that businesses can move forward as soon as the crucial data is processed even if errors are found in the informative parts.
The property of robustness is primarily concerned with bit-stream level errors and is generally agnostic of such types of errors that as are found at the document level such as grammatical or semantic errors. It does not cover the behaviour of lax parsers such as most HTML browsers that rely on complex and non-standard heuristics in order to continue processing in the face of severe errors so as to interpret authorial intent and provide bug for bug compatibility with previous implementations. Rather, robustness corresponds to draconian processing that may be compartmentalized to fragments of a document.
A format supports roundtripping if converting a file from XML to that format and back produces an output equivalent to the original input.
A format supports roundtripping via XML if converting a file from that format to XML and back produces an output equivalent to the original input.
In the course of processing, a file may be converted between XML and an alternate representation, or vice versa one, or more times. Roundtrip support measures the degree to which the original input and the final output of this process are equivalent, assuming no other changes to the file.
A format may support roundtripping to various degrees. Exact equivalence means that an exact copy of the input can be produced from the intermediate format. Lossless equivalence means that an XML instance or fragment can be output and verified which is identical to the input XML instance or fragment in all aspects which are significant and can only differ in those aspects which are not significant.
A format generally supports lossless equivalence if it directly supports the same data models as XML, as this means each element of the data model instance has a representation both in that format and in XML. Otherwise, a file may contain an element with no representation and which is therefore lost during conversion.
For example, if a format supported nested attributes on attributes, this information would have no direct equivalence when converting to XML's single-level attributes. This is not to say that such nested attributes could not be encoded in XML according to some encoding scheme, but that any such encoding scheme would not qualify as direct support for the data model. Support for round tripping is defined to require direct data model support.
Relationship to Canonicalization: Equivalence verification is necessary to prove that round-tripping has been successful. If a format supports exact equivalence then verification is trivial. Otherwise, the data must first be converted to a canonical form. XML 1.x canonicalization is an expensive operation. Short of support for exact equivalence, a format which supports a more efficient canonicalization algorithm is preferable to one that does not.
Support for schema extensions and deviations is the ability to represent information items that were either not defined in the schema associated with the input document or do not conform exactly to the associated schema definitions. The phrase open content has been used to refer to one form of schema extension, wherein an instance document is permitted to include elements and attributes beyond those defined by the schema. All non-schema-based formats exhibit this property. A format that prohibits applications from intentionally encoding information sets that do not conform to the given schema does not exhibit the schema extensions and deviations property.
In the pursuit of more space efficient encodings of data, one strategy is the use a schema of some kind to inform the encoder and decoder. There are several different ways that this information can be used to minimize data in an instance. Some strategies result in removing some or all self-describing structure, type information, and identifiers in a way that often makes evolution of the schema, encoding applications, and decoding applications rigid and difficult. Other strategies allow more flexibility and more loose coupling. Self-contained, fully self-describing, methods exhibit this property completely.
This property illustrates the tradeoff between space efficiency and flexibility. Flexibility can include mismatch in versions between schemas or encoder/decoder versions, parts of a logical schema that evolve independently in a collection of applications or partners, and the general ability to evolve gracefully with loose coupling. Schema-informed strategies are one of several types of methods for minimizing data size. Other methods generally provide full support for arbitrary instances but may have other tradeoffs. While more than one method may be employed in a format, the primary method can be seen as a major decision in a format.
An important distinction exists between possible solutions that require explicit prior indication of what elements may be extended and those solutions that allow extension anywhere. With some format designs, this indication is by schema metadata. These formats would not satisfy the schema extensions and deviations property because these definitions are effectively part of the schema. Schema extensions and deviations may or may not be encoded relative to another schema. This property is different from 5.10 Extension Points in that it refers to dynamic extensions to a schema or the case in which no schema is used at all.
From some perspectives, the ability to handle arbitrary instances is potentially not part of what would be called "valid XML" because it is not described by a schema or not described in detail. Other perspectives include environments, application architectures, and development situations where infrastructure requiring schemas for validation and/or encoding is onerous or impossible to use. The ultimate authority for validity of a data model instance lies with the developer and application that works with the data object. Particular schema specifications represent typical validation needs factored out into a common language and potentially for data-driven validation engines. These will never be complete or sufficient for every case and may be difficult to fit to some needs. In some cases, the need for flexibility and other solution aspects contraindicates schemas and formats that require any use of schemas.
A schema instance change resilience format is one in which areas of interest that do not change or, with less flexibility, are changed only in restricted ways, such as additions, are still valid to receivers expecting earlier or later versions even after other areas have changed. An area of interest may be defined as a subtree, an XPath, or other data model items.
It is very common for a data format instance definition to change over time. A format supports limited Schema Instance Change Resilience if only the schema or related metadata needs to be distributed. A format is fully flexible if any change is supported and less so if restricted, to additions for example. Full support means that changes are only needed when directly affected by modifications, such as removing a data model item.
This property is related to 5.25 Self Contained and may be fulfilled with or without self containment. A non-self-contained solution might rely on loadable schema meta-data or delta parent instances.
There are three categories of serializations with respect to Schema Instance Change Resilience: (i) requires schema-related updates for any changes (ii) does not require schema-related updates for certain changes such as addition and (iii) does not require schema-related updates.
An XML format is self-contained if the only information that is required to reproduce the data model instance is (i) the representation of the data model instance and (ii) the specification of the XML format.
For applications in which the receiver is unable to request or receive additional information, it is important that the document instances are self contained. This is desirable in applications in which it would be difficult, impractical, or costly to access additional resources.
An example of such application is an archiving application where there is a significant time lag between generation and consumption of the data. To ensure that the receiver is capable of reproducing the data model instance from the archived format, the format must be self contained. Accordingly, no additional information is required which might no longer be available at the time of consumption.
Another example would be an infrastructure consisting of intermediary applications that are placed between senders and receivers without their prior knowledge. For instance, XML firewalls and load balancing applications must efficiently inspect the content of document instances in order to make decisions. A format that is always self contained is helpful in such situations. However, applications may still be able to operate if they can be pre-configured with the additional knowledge external to the format definition that is required. In addition, formats that provide optional support for this property can still be used if this feature is negotiated during the initial handshake.
Schema-based formats allow for the efficient representation of data models instances due to knowledge of the structure and datatypes defined in the schema. Thus, a schema-based format requires the receiver to have access to the schema definitions for the decoding process to be successful. A schema-based format is considered not self contained unless the schema information is also stored as part of the format (an optional feature to include the schema would be an example of an optionally self-contained format). The partial use of schema information would lead to several partially self-contained formats.
The following example illustrates the general concept of schema-based versus non-schema-based encodings. Consider the following simple XML fragment that might be used to describe a pixel of data:
<red>10</red> <green>20</green> <blue>30</blue>
One way of describing this fragment in binary would simply be to serialize the three values as binary bytes:
The receiver would need to know in advance what each of these bytes meant. This could be done my means of a simple schema:
byte 1 = red byte 2 = green byte 3 = blue
This would be fine until something changed on either side of the link. For example, one side might now define a new color component (alpha):
byte 1 = red byte 2 = green byte 3 = blue byte 4 = alpha
If the other side was not immediately made aware of this change (a very common real-world occurrence), it would be expecting three bytes, but would receive four, and would therefore not know what to do with the last byte. Most likely, it would be ignored.
A more serious issue would arise if alpha was inserted at a different position then the end:
byte 1 = red byte 2 = alpha byte 3 = green byte 4 = blue
In this case, the processor that was not aware of the change would likely interpret the newly added byte in the wrong way.
The format is self-contained if everything that the receiver needs to know to decode the contents is included. The XML instance at the beginning of the example illustrates this. So does the binary case if the schema is included (most likely in an optimized binary form) with the message contents:
byte 1 = red byte 2 = alpha byte 3 = green byte 4 = blue 10,40,20,30
By moving some information from the schema to the message format, improved resilience to change can be achieved. For example, we could define a simple type system for the color values and add these identifiers to the message instance:
Schema: '1' = <red>, 1 byte '2' = <green>, 1 byte '3' = <blue>, 1 byte Message instance: '1',10,'2',20,'3',30
This allows for two things that could not have been done before:
The elements can be transmitted in any order;
If an additional element is added at one transfer end that the other end does not know about, it can take an appropriate action. For example, if '4' was added as the type code for alpha in the previous example, the processor without this information could simply discard an '4' element because it did not know what to do with it and still function normally as it was before.
Additional tradeoffs can be made by adding length information to allow resilience to content length changes and additional type information using well-known type codes to allow a message instance to be interpreted to some degree without the need for a schema.
A format is signable to the extent to which it makes the creation and validation of digital signatures both straightforward and interoperable.
In principle any file format is signable, in that the bytes which compose any file may be fed to a digital signature algorithm. Signatures, however, are only useful when they can be both created and verified, and verification requires that the verifier be able to determine precisely which bytes were signed, by whom, and how. Formats vary in how amenable they are to specifying and maintaining this information, and this in turn can be a measure of how "signable" they are.
Other things being equal, file formats which define a one-to-one relationship between each possible data model instance and the serialization of that instance are more easily signed because they require that processors maintain the byte sequences exactly. A text file can be said to operate this way, in that a change to any byte of the file results in a different text file.
Other formats define a one-to-many relationship between the data model instance and possible serializations. Such formats permit (or even invite) processors to modify the byte sequences. XML is such a format; for example, a processor could replace each character entity in an XML document with a numeric entity reference and have encoded the same information but with a significantly different byte sequence. The ability to sign XML then requires the development of a canonicalization algorithm which defines a particular serialization of each data model instance that is used for signature purposes.
Finally, a format which has a one-to-many relationship between data model instances and serializations but also defines a canonical serialization might be considered as falling in between the two extremes; signing and verifying is more work than if there is only one serialization but it saves the effort of developing the canonical format itself.
It is often desirable to sign only a portion of a file, such as in electronic document use cases in which multiple signatures are attached to a single document. This capability is about both which portion of the file is not signed, and therefore modifications to which will not break a signature, as well as which portion is signed. In such use cases, the signed portion is determined at the semantic (i.e., schema) level in the data model. For example, a signature may be applied to page one of a multi-page document but not to any other pages.
It is critical that such signatures are calculated over all portions of the file which encode information relevant to the semantic data model construct; otherwise, portions not included may be modified and the signature is insecure. For example, consider an XML document in which a sub-tree uses a namespace prefix. If the prefix declaration is outside the sub-tree and therefore not covered by the signature then the declaration can be altered —thus changing the meaning of the signed portion— without breaking the signature.
Other things being equal, then, formats which place all bytes representing the encoding of semantic data model constructs in a contiguous byte range are more signable because those ranges are more easily determined and specified. Formats which permit such ranges to be created but do not guarantee them are less signable because the application must either determine all ranges which must be signed or arrange for that information to be placed in a self-contained sub-tree contiguous byte range.
Finally, there are formats which will never place semantic constructs in contiguous ranges but scatter that information into tables and other mechanisms used to achieve compactness or other format properties. For example, a format may place element names in a vocabulary index table. That table may contain names of some elements in the signed region and others which are not; one must then determine how much of the table to sign and how to permit subsequent modifications to only break the signature when necessary. Such formats are least signable with respect to partial signatures.
Signers must be able to communicate to signature verifiers which bytes were signed, by whom, and how. Other things being equal, formats which make no provisions for recording this information are less signable because they require additional agreement between the parties involved in order to make signatures interoperate.
Other formats may provide syntax for encoding this information in the file format itself. Such formats are more signable because interoperable signatures can be created simply by reference to the format itself; no additional agreements with verifiers are required.
This is a property of formats that are able to associate processor extensions (known as plugins or codecs) with specific parts of a document in order to encode and decode them more optimally than the processor's default approach would.
Some specific vocabularies contain data structures that can benefit from special treatment, typically in order to obtain higher compression ratios or much faster processing. However, it would naturally be highly impractical to include all such specific cases in the default format specification so that all processors would have to implement them, while only a small subset of users would exercise the functionality.
Therefore, a format may include the ability to reference predefined extensions to the processor (both encoder and decoder) that are tailored to a specific need and can therefore encode certain parts of the document in an optimized manner. This requires the format to be able to flag segments as encoded with additional software, and the processors to be able to read these segment via the use of extensions.
Though the presence of this property makes the format in general more suitable to a larger set of uses and less likely to include very specialized features of use to only a small fragment of the format's user base, it also carries a high cost in terms of interoperability as it requires all participants involved in the exchange to support the additional software that knows how to decode the specific extension.
While in practice there are subtleties as to the ways in which this property can be supported, it is more accurately measured as a boolean indicating whether a format supports it or not.
For a format, streamability is a property of the processor (serializer, parser) that converts a data model instance into the format and vice versa. A processor is streamable if it is able to generate correct partial output from partial input. A format is streamable if it is possible to implement streamable processors for it. (Note: in some industries this property is referred to as incremental processing.)
Streamability is needed in memory-constrained environments where it is important to be able to handle data as it is generated to avoid buffering of data inside the processor. It is crucial when a document is generated piecemeal by some outside process, possibly with indefinitely long breaks between consecutive parts. Examples of the former requirement are provided by the Web Services for Small Devices and the Multimedia XML Documents for Mobile Handsets for use cases. Examples of the latter requirement are provided by the Metadata in Broadcast Systems and the XMPP Instant Messaging Compression Use Cases.
A precise definition can be derived by assuming that the data model consists of atomic components, which are assembled into documents in some structured manner. The serialization of a document expressed in the data model is then simply a traversal of its atomic components in some defined order with each applicable component being translated to the output stream. Output streamability is the ability to create a correct initial sequence of the output stream from a partial traversal. Input streamability is the ability to create the corresponding partial traversal from such an initial sequence so that the application can process the results of this partial traversal as if it were traversing the complete document.
Streamability is also characterized by the amount of buffering that needs to be done in the processors. Buffer space is measured in the number of items in the input. For a serializer, the atomic components of the data model; for a parser, the elements (e.g., bytes) of the stream. A requirement for streamability is that both processors be implementable such that they only require constant buffer space, no matter what the input document is or how it is mapped to the data model.
Another important consideration not captured by the above is the need for lookahead in the parser. If the parser is required to look ahead in the input stream to determine where the atom currently being read ends, and it is possible that the lookahead is not available (e.g., due to the serializer concurrently streaming the output), streamability is lost.
Examples of non-streamable formats can be had by considering subsequences of the atomic component traversal during serialization. For some types of sequences it can be beneficial to have the length of the full serialized form of the sequence precede the actual sequence, so the serializer must buffer the whole sequence before outputting anything. If such sequences can be arbitrarily long, this sacrifices output streamability. A concrete example for XML would be having an index at each element start that indicates where the element ends in order to support the 5.1 Accelerated Sequential Access property.
The buffer space requirement precludes some serialization techniques, e.g., compression over the whole document. This shows a trade-off between streamability and the 5.2 Compactness property. The example above indicates that building an element index for accelerated sequential access on the fly may not be possible in all cases.
Streamability (both input and output) is always considered relative to a data model. Once the data model is fixed, streamability is defined to be a boolean property: a format/processor is either streamable or not streamable.
This property requires that error correcting codes can be applied to the representation of XML data model instances. Error correcting codes applied to a format: (a) enable to identify a section in which format errors can be located and (b) enable to recover the undistorted section from the erroneous one.
Representation formats of XML data model instances can be categorized in three classes: (i) no partitioning of the representation is possible, (ii) partitioning of the representation is possible and (iii) partitioning according to the importance of the information is possible.
Error correction requires that redundancy is contained in the format to allow for recovery even if errors occured during transmission. The redundancy can serve on one hand to identify that an error has occurred and in certain circumstances to correct the error.
Various algorithms exist that insert redundancy so that a decoder is capable of detecting and potentially correcting errors. These techniques are called forward error correction (FEC) since they do not require further backward communications between the receiver and the original sender. Examples of block based FEC algorithms include Hamming and Reed-Solomon codes; an example of a continual FEC is the Turbo Code.
In general the error correction requires methodologies known as channel coding. Usually these algorithms are applied separately from those of source coding for efficient, redundancy free information representation. Handling them separately enables the adaptation of the channel coding, i.e., the insertion of redundancy into the representation tuned to the expected channel characteristics. For instance a channel with large error bursts might require applying interleaving, i.e., a defined re-sorting of bits, so the error bursts are distributed over a larger section of the bit stream. This enables forward error correction mechanisms to be also applied in case of channels with error bursts such as wireless channels.
To support for error correction an XML format has to interface with common channel coding algorithms. For interfacing the format shall allow partitioning of the representation according to the importance of the represented information to support unequal error protection based on the importance of the information. For instance in EPG data rights information might be ranked more important than names of actors. Accordingly, being 5.12 Fragmentable is a prerequisite to support error correction.
A format is transport independent if the only assumptions of transport service are error-free and ordered delivery of messages without any arbitrary restrictions on the message length.
Formats should be independent from the transport service. A format must state its assumptions (if any) about characteristics of the transport service in addition to error-free and ordered delivery of messages without any arbitrary restrictions on the message length. Protocol binding specifies how a format is transmitted as payload in a specific transport (e.g., TCP/IP) or messaging (e.g., HTTP) protocol.
Forward compatibility supports the evolution to new versions of a format. XML has changed very little while data models have continued to evolve. Additional specifications have added numerous conventions for linking, schema definition, and security. A format must support the evolution of data models and must allow corresponding implementation of layered standards. Evolution of XML and its data models could mean additional character encodings, additional element/attribute/body structure, or new predefined behavior similar to ID attributes. Examples might be more refined intra-document pointers, type or hinting attribute standards, or support for deltas. An implementation should indicate how certain classes of model changes might be implemented. This resilience relates to properties like 5.10 Extension Points, 5.18 No Arbitrary Limits and 5.11 Format Version Identification.
A requirement on XML was "It shall be easy to write programs which process XML documents." This property covers the implementation of a generic tool chain, but not any application-specific processing code.
A low implementation cost may contribute to 6.5 Widespread Adoption in that if tools to process the format need to be implemented as a part of an application (e.g., because they do not exist for the target platform), a low-cost format is more likely to be adopted. To fulfill this requirement the format needs to be easy enough to implement so that this additional implementation is not an impediment to the actual application development. However, low implementation cost is not necessary to achieve widespread adoption.
A rough estimate of implementation cost can be made by considering how much time does it take for a solitary programmer to implement sufficiently robust processing of the format (the so-called Desperate Perl Hacker measure). A proposed upper limit in the case of XML was a couple of hacking sessions (though this limit has proven to be too optimistic). An alternate format needs to do at least as well as XML, and preferably better, to fulfill this requirement.
Another factor to consider is the kind of code that must be written to implement the format. If either input or output require sophisticated algorithms (e.g., specialized compression not ubiquitously available), this increases the format's implementation cost. If the format has several optional properties and, e.g., size-decreasing special serialization possibilities, the number of required possible code paths in the processors increases. This makes the processors harder to test comprehensively, and hence contributes to their fragility, requiring more time for a robust implementation.
A lowering of implementation cost for an alternate representation can be achieved by making its processing be XML-compatible at as low a level as possible. This helps by making it possible to utilize existing XML-based infrastructure to a larger extent. However, an incompatible interface at any level may allow more efficient handling than an existing one that may be unnatural for the format, thus justifying a higher implementation cost.
Free in the context of an XML format means that is free to create and use an XML format and the right to build tools and applications for the format is completely unencumbered and royalty-free.
If the format is unencumbered and royalty-free it can be recommended by the W3C and stands a better chance for adoption across the industry. These conditions can positively affect the potential for ubiquitous use of the format. A free format is also more likely to have free, open source code for processing it and free tools for building applications which use it, especially when its 6.2 Implementation Cost is low. This is another factor in the potential for ubiquitous use of the format.
There must be a single conformance class. Thus, a compliant implementation must support all features defined in the specification —yet it is not required to employ them all in every usage instance.
All implementations must demonstrate interoperability with a test suite testing each feature. Having only a single conformance class tends to lower implementation cost, decrease complexity for application development and support ubiquitous implementation. In the past, lite versions of standards have been created to support devices of limited capability. The need for these provisos at the standard level is rapidly decreasing as even commodity devices are quickly growing in capability.
A format is more ubiquitous to the extent it has been implemented on a greater range of computing devices (i.e., different architectures), on a greater number of computing devices (i.e., millions of devices implementing a single architecture), and used in a wider variety of applications.
While ubiquity can be measured for existing formats, it cannot be used as a point of comparison with new formats which are, by definition, not yet ubiquitous. Ubiquity may predicted by 6.2 Implementation Cost. A low implementation cost format is likely to create an environment where a very large community of developers will be willing and able to create a critical mass of tools, with a resultant feedback and amplification result that leads the marketplace toward ubiquitous implementation. XML is considered by most to be a good of example of a low cost/complexity format leading to ubiquity.
On the other hand, a high implementation cost format that can meet the environmental constraints of a large number of devices, applications, and/or use cases could also stand a good chance of becoming ubiquitous due to economically-motivated industry commitment. Counterexamples where ubiquity was obtained by trading off, to a greater or lesser extent, ease of implementation include PDF and MPEG-3. While each of these certainly have significant costs of implementation, each also addresses the needs of its user domain so well that each has attained ubiquity in that domain.