Copyright © 2016 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
This document describes an extended String datatype representation that can be used with the EXI 1.0 format by leveraging strings from an XML schema, pre-populating initial string values, and/or splitting strings in consecutive character sequences.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document has been produced by the Efficient XML Interchange Working Group as part of the W3C XML Activity. The goals of the Efficient XML Interchange (EXI) Format are discussed in the Efficient XML Interchange (EXI) Format document. The authors of this document are the members of the Efficient XML Interchange Working Group.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Please send comments about this document to the public-exi@w3.org mailing list (Archives).
The EXI String datatype representation is a length prefixed sequence of characters. The length usually indicates the number of characters in the string.
Moreover, EXI uses a string table to assign "compact identifiers" to some string values (e.g., value content items). Each value content item is assigned to two partitions, a "local" value partition and the global value partition. When a string value is found in the global or "local" partition, it is represented using a compact identifier. In the case of a string table the length prefix gets a slightly extended meaning.
When a string value is found in the "local" value partition, the String value may be represented as zero (0) encoded as an Unsigned Integer followed by the compact identifier of the String value in the "local" value partition.
When a string value is found in the global value partition, the String value may be represented as one (1) encoded as an Unsigned Integer followed by the compact identifier of the String value in the global value partition.
When a string value S is not found in the global or "local" value partition, its string literal is encoded as a String with the length L + 2 (incremented by two) where L is the number of characters in the string value.
Note:
The compact identifier is encoded as an n-bit unsigned integer (7.1.9 n-bit Unsigned Integer), where n is ⌈ log2m ⌉ and m is the number of entries in the associated partition.
In many use cases the String representation is sufficient. Some use cases, though, require an extended mechanism to represent string values by
leveraging well-known values from XML schema (or respectively EXI grammars)
exchanging or pre-populating initial string values
Note:
Partitions containing value items are initially empty.
splitting longer string values in consecutive strings to increase likelyhood of string table hits
This document provides an extended String representation by specifying an user-defined datatype
representation namely exi:estring
by extending the range
of pre-defined String length identifiers.
Table 2-1 describes the meaning of the length identifier for the user-defined datatype
representation exi:estring
.
Length identifiers zero (0) and one (1) remain as is and refer to "local" and global value partition. What changes is that when a string value S is not found in the global or "local" value partition, its string literal is encoded as a String with the length L + 6 (incremented by six).
Length identifier two (2) is described in 3. Grammar String, length identifier three (3) is described in 4. Shared String, and length identifier four (4) is described in 5. Split String
|
|||||
Length Identifier (UnsignedInteger) | Effect | ||||
---|---|---|---|---|---|
0 | see "local" value partition | ||||
1 | see global value partition | ||||
2 | grammar string (e.g., xsd:enumeration) | ||||
3 | shared string (e.g., EXI Options) | ||||
4 | split string | ||||
5 | < undefined > | ||||
6 | String literal: encoded as a String with the length L + 6 (incremented by six) where L is the number of characters in the string value. |
When the CH event type that matched to process a string value is schema-informed and the production's event code is of length 1 (one), grammar strings may be available for representing the string value. One or more grammar strings can be associated with each schema datatypeXS2. Given a schema datatypeXS2 in effect at the time of processing a string that matched such a CH event type, it is the grammar strings (if any) associated with the schema datatypeXS2 that can be used for representing the string value.
In XML Schema, a datatypeXS2 definition can be given a list of grammar strings inside its annotation element in the form shown below. The variable m is the number of grammar strings defined for the datatypeXS2, and S i is the i-th grammar string.
<xsd:simpleType> |
<xsd:annotation> |
<xsd:appinfo> |
<xsd:restriction base="xsd:string"> |
<xsd:enumeration value=" S 0 " /> |
<xsd:enumeration value=" S 1 " /> |
... |
<xsd:enumeration value=" S m-1 " /> |
</xsd:restriction> |
</xsd:appinfo> |
</xsd:annotation> |
... |
</xsd:simpleType> |
When a grammar string is used to represent a string value, the value is encoded as n-bit Unsigned Integers where n = ⌈ log 2 m ⌉ and m is the number of items in the list of grammar strings associated with the schema datatypeXS2 in effect.
Editorial note | |
Do we only use enumerated values that match to String? In case of preserveLexicalValues other datatypes might be useful also. |
Editorial note | |
What about other existing string entries such as uri, prefix, and local-name string table partitions. |
Editorial note | |
[DB] proposed to use other strings from XML schema. |
Editorial note | |
Do we want to sort the list in alphabetic order? |
Editorial note | |
How do we want to define grammar strings for derived datatypes? |
This section describes how well-known values can be leveraged for string value encoding.
@TODO describe how the fixed list of entries is created.
<header> <lesscommon> <uncommon> <string>foo</string> <string>bla</string> </uncommon> </lesscommon> </header>
Editorial note | |
Do we really want to define a solution/proposal that makes use of string
values in the EXI options. By not integrating the EXI options (i.e.,
out-of-band) strings are shared without any overhead. The idea would be to
extends the EXI Options schema in a future version to allow such
string entries. |
This section defines how a string can be split in multiple consecutive strings. One the one hand it MAY reduce memory and/or buffer requierements and on the other hand it MAY increase the likelyhood of string table value hits.
Similar to the List datatype string values are encoded as a length prefixed
sequence of string values. The length is encoded as an Unsigned
Integer and each value is encoded according to the
exi:estring
datatype.
Editorial note | |
What is a good approach to limit recursive use of exi:estring constructs |