W3C

Efficient XML Interchange (EXI) Extended String

Editor's Draft 04 November 2016

This version:
??publoc??
Editors:
Takuki Kamiya, Fujitsu Laboratories of America, Inc.
Daniel Peintner, Siemens AG

Abstract

This document describes an extended String datatype representation that can be used with the EXI 1.0 format by leveraging strings from an XML schema, pre-populating initial string values, and/or splitting strings in consecutive character sequences.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document has been produced by the Efficient XML Interchange Working Group as part of the W3C XML Activity. The goals of the Efficient XML Interchange (EXI) Format are discussed in the Efficient XML Interchange (EXI) Format document. The authors of this document are the members of the Efficient XML Interchange Working Group.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Please send comments about this document to the public-exi@w3.org mailing list (Archives).


1. Introduction

The EXI String datatype representation is a length prefixed sequence of characters. The length usually indicates the number of characters in the string.

Moreover, EXI uses a string table to assign "compact identifiers" to some string values (e.g., value content items). Each value content item is assigned to two partitions, a "local" value partition and the global value partition. When a string value is found in the global or "local" partition, it is represented using a compact identifier. In the case of a string table the length prefix gets a slightly extended meaning.

Note:

The compact identifier is encoded as an n-bit unsigned integer (7.1.9 n-bit Unsigned Integer), where n is ⌈ log2m ⌉ and m is the number of entries in the associated partition.

In many use cases the String representation is sufficient. Some use cases, though, require an extended mechanism to represent string values by

This document provides an extended String representation by specifying an user-defined datatype representation namely exi:estring by extending the range of pre-defined String length identifiers.

2. Concept

Table 2-1 describes the meaning of the length identifier for the user-defined datatype representation exi:estring.

Length identifiers zero (0) and one (1) remain as is and refer to "local" and global value partition. What changes is that when a string value S is not found in the global or "local" value partition, its string literal is encoded as a String with the length L + 6 (incremented by six).

Length identifier two (2) is described in 3. Grammar String, length identifier three (3) is described in 4. Shared String, and length identifier four (4) is described in 5. Split String

Table 2-1. Extended String Length Identifier
Editorial note 
Does it make sense to leave space for future extensions by undefined entries (see length identifier five (5)).
Length Identifier (UnsignedInteger)Effect
0see "local" value partition
1see global value partition
2grammar string (e.g., xsd:enumeration)
3shared string (e.g., EXI Options)
4split string
5< undefined >
6String literal: encoded as a String with the length L + 6 (incremented by six) where L is the number of characters in the string value.

3. Grammar String

When the CH event type that matched to process a string value is schema-informed and the production's event code is of length 1 (one), grammar strings may be available for representing the string value. One or more grammar strings can be associated with each schema datatypeXS2. Given a schema datatypeXS2 in effect at the time of processing a string that matched such a CH event type, it is the grammar strings (if any) associated with the schema datatypeXS2 that can be used for representing the string value.

In XML Schema, a datatypeXS2 definition can be given a list of grammar strings inside its annotation element in the form shown below. The variable m is the number of grammar strings defined for the datatypeXS2, and S i is the i-th grammar string.

<xsd:simpleType>
  <xsd:annotation>
    <xsd:appinfo>
      <xsd:restriction base="xsd:string">
        <xsd:enumeration value=" S 0 " />
        <xsd:enumeration value=" S 1 " />
        ...
        <xsd:enumeration value=" S m-1 " />
      </xsd:restriction>
    </xsd:appinfo>
  </xsd:annotation>
  ...
</xsd:simpleType>

When a grammar string is used to represent a string value, the value is encoded as n-bit Unsigned Integers where n = ⌈ log 2 m ⌉ and m is the number of items in the list of grammar strings associated with the schema datatypeXS2 in effect.

Editorial note 
Do we only use enumerated values that match to String? In case of preserveLexicalValues other datatypes might be useful also.
Editorial note 
What about other existing string entries such as uri, prefix, and local-name string table partitions.
Editorial note 
[DB] proposed to use other strings from XML schema.
Editorial note 
Do we want to sort the list in alphabetic order?
Editorial note 
How do we want to define grammar strings for derived datatypes?

4. Shared String

This section describes how well-known values can be leveraged for string value encoding.

@TODO describe how the fixed list of entries is created.

<header>
 <lesscommon>
   <uncommon>
    <string>foo</string>
    <string>bla</string>
   </uncommon>
 </lesscommon>
</header>
Editorial note 
Do we really want to define a solution/proposal that makes use of string values in the EXI options. By not integrating the EXI options (i.e., out-of-band) strings are shared without any overhead. The idea would be to extends the EXI Options schema in a future version to allow such string entries.

5. Split String

This section defines how a string can be split in multiple consecutive strings. One the one hand it MAY reduce memory and/or buffer requierements and on the other hand it MAY increase the likelyhood of string table value hits.

Similar to the List datatype string values are encoded as a length prefixed sequence of string values. The length is encoded as an Unsigned Integer and each value is encoded according to the exi:estring datatype.

Editorial note 
What is a good approach to limit recursive use of exi:estring constructs

A References

Efficient XML Interchange (EXI) Format 1.0 (Second Edition)
Efficient XML Interchange (EXI) Format 1.0 (Second Edition), John Schneider, Takuki Kamiya, Daniel Peintner, Rumen Kyusakov, Editors. World Wide Web Consortium. The latest version is available at http://www.w3.org/TR/exi/. (See http://www.w3.org/TR/2014/REC-exi-20140211/.)