[ contents ]
This document describes requirements for some important aspects of the character model for W3C specifications. The two aspects discussed are string identity matching and string indexing. Both aspects are considered to be vital for the seamless interaction of many components of the current and future web architecture.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is being published as a Working Group note in order to capture and preserve historical information. It contains requirements elaborated in 1998 for aspects of the character model for W3C specifications. It was developed and extensively reviewed by the Internationalization Working Group, and is being published by its successor, the Internationalization Core Working Group, part of the W3C Internationalization Activity. The wording of the 1998 version remains unchanged (except for correction of a small number of typographic errors), but the links to references have been updated prior to this publication.
Comments on this document can be sent to firstname.lastname@example.org (publicly archived), but it should be borne in mind that the note is being published to preserve historical information, and the viewpoints expressed in the document should be considered in that light.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
be conservative in what you send
Since [RFC 2070], [ISO 10646]/[Unicode] (hereafter denoted as UCS, Universal Character Set) has served as a common reference for character encoding in W3C specifications (see [HTML 4.0], [XML 1.0], and [CSS2]). This choice was motivated by the fact that the UCS:
As long as data transfer on the WWW was primarily unidirectional (from server to browser), and the main purpose was rendering, the direct use of the UCS as a common reference posed no problems.
However, from early on, the WWW included bidirectional data transfer (forms,...). Recently, purposes other than rendering are becoming more and more important. The WWW has traditionally been seen as a collection of applications exchanging data based on protocols. It can however also be seen as a single, very large application [Nicol]. The second view is becoming more and more important due to the following developments:
In this context, some properties of the UCS become relevant and have to be addressed. It should be noted that such properties also exist in legacy encodings, and in many cases have been inherited by the UCS in one way or another from such legacy encodings. In particular, these properties are:
This means that in order to ensure consistent behavior on the WWW, some additional specifications, based on the UCS, are necessary.
This document is written as part of the work of the I18N WG to provide internationalization guidelines for the authors of W3C specifications. Because of the importance of consistent behavior for the WWW, it should be expected that the resulting guideline components will become mandatory for W3C specifications.
The specifications that will be developed based on this document have a very wide range of potential users, which are listed below in three categories. For some of the users listed here, a short description of what they do and how the requirements described in this document are thought to apply to them is given in the Appendix. A need for specifications in the areas addressed by this document has directly been expressed, in particular at the Query Language Meeting in April 1998 in Brisbane (see the W3C member-only link to the meeting report), by the following W3C Working Groups or specifications:
Within the W3C, it may in addition be useful for:
Outside of the W3C, it may in addition be useful for things such as:
The following sections 2-4 each discuss the requirements for a particular aspect of the WWW character model. Each section in its first subsection briefly describes the problem addressed. The following subsections then discuss the various requirements. Section 2 is devoted to the requirements for string identity matching. Section 3 expands on string identity matching and discusses subrequirements for early uniform normalization, one way to address string identity matching. Section 4 discusses the requirements for string indexing. An appendix gives additional information about some of the users of the specification resulting from this document. A glossary gives additional explanations for some of the terms used in this document.
This document addresses only those parts of the character model that need exact specification and are extremely time-critical. To see exactly which parts are addressed, please see the first subsection of each of the following sections. A more general model, e.g. in the sense of the reference processing model in [RFC 2070], and general guidelines, e.g. similar to those in [RFC 2130] and [RFC 2277] for the work of the IETF, are not discussed here. Nevertheless, something like the reference processing model in [RFC 2070], which requires applications to behave as if they used the UCS, is assumed as a base.
For each problem, this document lists various requirements. Ideally, all requirements would be met equally well, and the degree to which they are being met could be measured equally well. However, some of the requirements take the form of more general design objectives, for which it is difficult to measure the degree to which they have been met. Also, some requirements conflict with each other. Where such conflicts are known, the conflict and a preference (i.e. which requirement has greater weight) is indicated.
String identity matching is a subset of the more general problem of string matching. String matching in general can be done with various degrees of specificity, from very approximate matching such as e.g. regular expressions or phonetic matching for English, to more specific matches such as case-insensitive or accent-insensitive matching. This document deals only with string identity matching. Two strings match as identical if they contain no user-identifiable distinctions. For more details on the meaning of user-identifiable distinctions, see the following explanations as well as subsection 2.3 and subsection 2.4. Any kind of less specific matching is not discussed in this document.
At various places in the WWW infrastructure, strings, and in particular identifiers, are compared for identity. If different places use different definitions of string identity matching, this results in undesired unpredictability. Such comparisons are unproblematic if the expectations of the users and the results of a simple binary comparison coincide, or can be made to coincide. For ASCII, such a coincidence is established and assumed, including some degree of user education, e.g. about the differences between the digit 0 and the uppercase letter O. For the full repertoire of the UCS, however, the aforementioned coincidence between user expectations and binary comparisons is not a priori guaranteed.
In order to ensure consistent behavior on the WWW, a character model for W3C specifications must make sure that the gap between user expectations and internal operation is bridged. A character model for W3C specifications must therefore specify how the problem of string identity matching is handled. The requirements for such a specification are listed in the following subsections. Please note that with the exception of subsection 2.7 and subsection 2.8, the following subsections assume the character processing model of [RFC 2070], i.e. they assume that applications behave as if they used the UCS internally. The section ends with subsection 2.10, which lays out some alternatives and motivates section 3.
In order to fulfill its purpose, a specification of string identity matching must not contain any ambiguities.
While in some cases, the addition of version numbers might help to make the specification unambiguous, carrying version numbers as parameters is in many cases highly undesirable and should therefore be avoided.
Typical examples where a gap between user expectations and internal operation can occur in the UCS are the duplicate encodings defined as canonical equivalences in [Unicode]. As an example, the UCS allows us to encode "ü" both as a single codepoint (U+00FC, LATIN SMALL LETTER U WITH DIAERESIS), or as the codepoint for "u" (U+0075, LATIN SMALL LETTER U) followed by the codepoint U+0308 (COMBINING DIAERESIS). Such equivalences are artifacts of the encoding method(s) chosen for the UCS.
It is expected that the canonical equivalences specified in the Unicode
standard will be an excellent starting point for defining the range of things
to be identified as duplicate encodings. This will make sure that the
experience of the Unicode Technical Committee with respect to character
equivalences is fully leveraged. Whether any changes are necessary will have
to be examined more closely. If such changes consist only of additions of
equivalences, implementations of W3C specifications would collectively conform
to conformance clause C9 given in [Unicode, p. 3-2]:
A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct. Additions may
include some presentation forms.
Another category where encoding differences are invisible to the user are the various control codes. W3C standards mostly deal with structured text (as opposed to plain text). It should therefore in most cases be possible to rely on explicit markup rather than on in-stream control codes.
String identity matching shall not treat as equivalent cases that can clearly be distinguished by a user because the difference may be significant in many cases. Examples are:
These differences can be handled by the (mainly native) users of the characters in question, and can at least be identified by users not familiar with the characters in question. Such similarities are explicitly not considered for string identity matching, because they do not need a coordinated solution for the entirety of the WWW.
Various forms of equivalence testing are needed for operations such as searching and sorting. But such operations will not be based on string identity matching. Also, it is felt that such operations do not need to behave uniformly across the web; that on the contrary, it is beneficial to have competition (e.g. for search engines and their user interfaces), that this has already been taken care of elsewhere (e.g. the work of ISO and Unicode on default and tailorable sorting), and that the requirements of language-dependence and user-configurability are stronger than the needs for consistent behavior.
It is impossible to predict what characters might be added to the UCS in the future. String identity matching should be specified so as to try to minimize the impact of future additions to the UCS on the specification and its implementations.
One category of additions that warrants particular attention, both because it has occurred relatively frequently in the past and because it affects string identity matching directly, is the addition of new precomposed forms for which decomposed equivalents are already available.
Because of the increased integration of the WWW, selecting different ways to solve the string identity matching problem for different components of the WWW would produce a fragmentation of users' and implementers' expectations, and the need for constant attention to minute differences that are rarely visible. Applicability to a broad range of W3C specifications and the widest number of components of the WWW means that a solution has to be feasible for all kinds of different systems, and different subsystems of larger applications, with different resources available. This in particular includes very small systems, and systems that do not have continuous network access.
Many components of the WWW have to work with data without access to the actual characters. This includes all kinds of schemes that make use of encryption techniques as well as schemes where the character encoding is in general left undefined, such as URIs [URI]. For things such as URIs, it should be possible to test two strings for identity even if their character encoding is unknown, given of course that in both cases the same character encoding has been chosen. Also, it should be possible to test two strings for identity if the actual data cannot be accessed directly because it is encrypted. Even in cases where the character encoding is known, and the data is accessible, treating data as opaque is often desirable, because an identity check might occur in an architectural component that has (or the implementers of which have) completely different concerns than internationalization. Examples of such components are firewalls and passwords.
An often cited maxim of Internet engineering is
be liberal in what you
accept; be conservative in what you send. The use of the appropriate kind
of equivalence at the receiving end easily allows you to
be liberal in what
you accept. However, without any kind of indication of the preferred way of encoding or the preferred character variant, there
is no way to
be conservative in what you send. This means that
potential benefits cannot be realized.
Several upcoming W3C specifications depend on a clear and uniform specification for string identity matching. Therefore, no time should be lost in preparing the string identity matching specification.
For a specification for string identity matching, the following issues have to be addressed:
The arguments for why early normalization may be needed, even if only in some cases, can be listed as follows:
be conservative in what you send
It therefore seems appropriate to address the requirements of early normalization in particular. This is done in the next section.
As discussed in subsection 2.10, there is a high probability that early normalization may become necessary, even if only for some selected cases. Early normalization means that data is normalized as close to its origin, or as close to its conversion to the UCS, as possible. This eliminates duplicate representations and other ambiguities. The actual string identity check can therefore be done without taking such ambiguities into account. In order for this to work, however, early normalization has to be uniform, i.e. all components of the WWW that normalize have to do so in one specific way.
In order for W3C specifications to attribute the responsibility for early uniform normalization to specific components, guidelines on where early uniform normalization should occur must be provided. Ideally, uniform normalization would occur at the time of data creation, e.g. by a keyboard driver. However, W3C specifications do not deal directly with things such as keyboard drivers. This means that more appropriate locations for requiring early uniform normalization have to be defined. As an example, it could be required that text transmitted via certain protocols, or text exposed in certain APIs, is normalized.
It should be noted that text is transmitted on the WWW in many encodings not based on the UCS. In these cases, uniform normalization ideally occurs when data is transcoded (or assumed to be transcoded according to the reference processing model of [RFC 2070]) from legacy encodings (such as [ISO 8859] or [ISO 6937]) to the UCS.
Ideally, early uniform normalization will spread out from the WWW to other parts of the information infrastructure. For example, early uniform normalization may only be specified for text actually sent out by a server, but the task of normalization may be transferred from the server to the document provider, and from there further to the editor tool and even to the keyboard driver. Such a transfer is indeed highly desirable in many cases, because to avoid generating unnormalized data is in many cases easier than to normalize such data later.
A wide range of text on the WWW will have to be normalized. This is easier to do if uniform normalization occurs towards the more popular representation than if a not so widely used representation is used as the normal form. It may also provide a bit more time, in that we are just defining what might happen naturally anyway instead of having to fight uphill from day one. Existing standards (such as the canonical ordering behavior for combining characters [Unicode, page 3-9]) should also be considered.
The views of experts on character coding, especially of members of the Unicode Technical Committee and of ISO/IEC JTC1/SC2/WG2 should be sought, with the goal of achieving a broad consensus. This requirement cannot, however, take precedence over all other requirements, especially Requirement 2.9, "The string identity matching specification shall be prepared quickly".
Where choices are available, early uniform normalization should be specified in a way which permits easy and compact implementations. It should however be remembered that the main benefit in terms of implementation simplification is achieved due to the concept of early uniform normalization itself, by relieving a large part of the WWW infrastructure of the need to consider equivalences when making comparisons, and by locating normalization at those places in the WWW architecture where most information on actually occurring codepoint combinations and most internationalization implementation expertise and concern are available.
To help in developing, understanding, implementing, and testing early uniform normalization, reference software shall be developed and provided to the public under W3C copyright. This software will cover all cases, whereas at a given point in the infrastructure (e.g. a transcoder or a keyboard driver), only some cases may have to be taken into account.
To help in developing, understanding, implementing, and testing early uniform normalization, test cases shall be developed and provided to the public under W3C copyright.
On many occasions, in order to access a substring or a character, it is
necessary to index characters in a string/sequence/array of characters. Where
character indices are exchanged between components of the WWW, there is a need
for a uniform definition of string indexing in order to ensure consistent
behavior. In the simplest cases, this boils down to questions such as
which position in a given string is a given character?,
is at a given position in a given string?, and even simpler,
length of a given string?.
Note: In many cases, it is highly preferable to use non-numeric ways of identifying substrings. The specification of string indexing for the WWW should not be seen as a general recommendation for the use of string indexing for substring identification. As an example, in the case of translation of a document from one language to another, identification of substrings based on document structure can be expected to be much more stable than identification based on string indexing.
Note: Because of the wide variability of scripts and characters, different operations may be required to work at different levels of aggregation or subdivision. String indexing as discussed in this section is only intended to provide a base for such operations; it cannot address all levels concurrently.
The issue of indexing origin, i.e. whether the first character in a string is indexed as character number 0 or as character number 1, will not be addressed here.
This is the basic functional requirement for indexing. It means that the specification has to be without options.
The basic consistency test is the following:
The requirement is fulfilled if the test is successful for all strings of characters and all combinations of systems.
Tools and programs are supposed to hide most of the indexing values from
the end users. However, the fact that direct editing/manipulation was possible
was one of the (unexpected) reasons for the success of the WWW. Also, in the
complex infrastructure of the WWW, it is impossible to define a clear and
strict boundary between what is manipulated by programs and what is seen and
manipulated by the users. Therefore, it is highly desirable that something
seen as one single character by the user is indeed counted as one character.
However, there may be cases where for the same characters, there are
differences in the perceptions of users using various languages, or even of
users using one and the same language. In this case, an ideal solution is not
possible. Preference should be given to a solution which, although not
corresponding to user expectations, can be understood by as many users as
treat each character in the Klingon alphabet as occupying
two index positions ).
This requirement may be in conflict with requirement 4.6 (because user expectations and actual encoding might be different). Because neither requirement is absolute, no indication of relative priorities has been given here.
Because of the variability of what a "character" can mean in different scripts and to different people (for the same script), string indexing should permit the designation of characters at various levels of resolution appropriate for the task at hand. This can in principle be achieved by indexing on the finest granularity possible, or by indexing of subelements. Although subelement indexing might not be defined in the first version of the character model, and might not be implemented everywhere, the necessary precautions for syntax extensibility and fallbacks should be taken care of and defined up-front wherever applicable.
It is impossible to predict what characters might be added to the UCS in the future. String indexing should be specified so as to try to minimize the impact of future additions to the UCS on the specification and its implementations.
One category of additions that warrants particular attention, both because it has occurred relatively frequently in the past and because it may affect string indexing directly, is the addition of new precomposed forms for which decomposed equivalents are already available.
Indexing into a string of characters is a very frequent operation. Ease of implementation is therefore crucial. If string indexing is based on early uniform normalization, then this may help to make implementation easier.
Several upcoming W3C specifications depend on a clear character model and in particular on clear definitions for string indexing. It is therefore crucial that no time is lost.
This appendix gives some additional details about users of the specification that will result from the requirements in this document. This is intended to give some very short background to readers not familiar with some of the work of the W3C, as well as to make sure that the requirements of these groups are well understood.
Note: The specifications discussed below are still in progress. The summaries are based on the current state, as publicly known. Changes may occur at any time.
This glossary does not provide exact definitions of terms but gives some background on how certain words are used in this document.