Efficient XML Interchange (EXI) Extended String

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document has been produced by the Efficient XML Interchange Working Group as part of the W3C XML Activity. The goals of the Efficient XML Interchange (EXI) Format are discussed in the Efficient XML Interchange (EXI) Format document. The authors of this document are the members of the Efficient XML Interchange Working Group.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Please send comments about this document to the public-exi@w3.org mailing list (Archives).

1. Introduction

The EXI String datatype representation is a length prefixed sequence of characters. The length usually indicates the number of characters in the string.

Moreover, EXI uses a string table to assign "compact identifiers" to some string values (e.g., value content items). Each value content item is assigned to two partitions, a "local" value partition and the global value partition. When a string value is found in the global or "local" partition, it is represented using a compact identifier. In the case of a string table the length prefix gets a slightly extended meaning.

When a string value is found in the "local" value partition, the String value may be represented as zero (0) encoded as an Unsigned Integer followed by the compact identifier of the String value in the "local" value partition.
When a string value is found in the global value partition, the String value may be represented as one (1) encoded as an Unsigned Integer followed by the compact identifier of the String value in the global value partition.
When a string value S is not found in the global or "local" value partition, its string literal is encoded as a String with the length L + 2 (incremented by two) where L is the number of characters in the string value.

Note:

The compact identifier is encoded as an n-bit unsigned integer (7.1.9 n-bit Unsigned Integer), where n is ⌈ log2m ⌉ and m is the number of entries in the associated partition.

In many use cases the String representation is sufficient. Some use cases, though, require an extended mechanism to represent string values by

leveraging well-known values from XML schema (or respectively EXI grammars)
exchanging or pre-populating initial string values

Note:

Partitions containing value items are initially empty.
splitting longer string values in consecutive strings to increase likelyhood of string table hits

This document provides an extended String representation by specifying an user-defined datatype representation namely exi:estring by extending the range of pre-defined String length identifiers.

2. Concept

Table 2-1 describes the meaning of the length identifier for the user-defined datatype representation exi:estring.

Length identifiers zero (0) and one (1) remain as is and refer to "local" and global value partition. What changes is that when a string value S is not found in the global or "local" value partition, its string literal is encoded as a String with the length L + 6 (incremented by six).

Length identifier two (2) is described in 3. Grammar String, length identifier three (3) is described in 4. Shared String, and length identifier four (4) is described in 5. Split String

Table 2-1. Extended String Length Identifier

Editorial note
Does it make sense to leave space for future extensions by undefined entries (see length identifier five (5)).

Length Identifier (UnsignedInteger)

Effect

see "local" value partition

see global value partition

grammar string (e.g., xsd:enumeration)

shared string (e.g., EXI Options)

split string

< undefined >

String literal: encoded as a String with the length L + 6 (incremented by six) where L is the number of characters in the string value.

3. Grammar String

When the CH event type that matched to process a string value is schema-informed and the production's event code is of length 1 (one), grammar strings may be available for representing the string value. One or more grammar strings can be associated with each schema datatype^XS2. Given a schema datatype^XS2 in effect at the time of processing a string that matched such a CH event type, it is the grammar strings (if any) associated with the schema datatype^XS2 that can be used for representing the string value.

In XML Schema, a datatype^XS2 definition can be given a list of grammar strings inside its annotation element in the form shown below. The variable m is the number of grammar strings defined for the datatype^XS2, and S_i is the i-th grammar string.

<xsd:simpleType>

<xsd:annotation>

<xsd:appinfo>

<xsd:restriction base="xsd:string">

<xsd:enumeration value=" S₀ " />

<xsd:enumeration value=" S₁ " />

...

<xsd:enumeration value=" S_m-1 " />

</xsd:restriction>

</xsd:appinfo>

</xsd:annotation>

...

</xsd:simpleType>

When a grammar string is used to represent a string value, the value is encoded as n-bit Unsigned Integers where n = ⌈ log ₂ m ⌉ and m is the number of items in the list of grammar strings associated with the schema datatype^XS2 in effect.

Editorial note
Do we only use enumerated values that match to String? In case of preserveLexicalValues other datatypes might be useful also.

Editorial note
What about other existing string entries such as uri, prefix, and local-name string table partitions.

Editorial note
[DB] proposed to use other strings from XML schema.

Editorial note
Do we want to sort the list in alphabetic order?

Editorial note
How do we want to define grammar strings for derived datatypes?

4. Shared String

This section describes how well-known values can be leveraged for string value encoding.

@TODO describe how the fixed list of entries is created.

<header>
 <lesscommon>
   <uncommon>
    <string>foo</string>
    <string>bla</string>
   </uncommon>
 </lesscommon>
</header>

Editorial note
Do we really want to define a solution/proposal that makes use of string values in the EXI options. By not integrating the EXI options (i.e., out-of-band) strings are shared without any overhead. The idea would be to extends the EXI Options schema in a future version to allow such `string` entries.

5. Split String

This section defines how a string can be split in multiple consecutive strings. One the one hand it MAY reduce memory and/or buffer requierements and on the other hand it MAY increase the likelyhood of string table value hits.

Similar to the List datatype string values are encoded as a length prefixed sequence of string values. The length is encoded as an Unsigned Integer and each value is encoded according to the exi:estring datatype.

Editorial note
What is a good approach to limit recursive use of exi:estring constructs