NormalizationProposal

From Internationalization

Proposal

Unicode canonical equivalence affects potential performance in a number of Recommendation track documents. A consistent approach to normalization of canonically equivalent text sequences is necessary to ensure global accessibility of W3C technologies.

The Internationalization WG proposes the following:

  1. Core specifications, such as XML, HTML, and CSS, MUST define whether canonically equivalent sequences are considered identical or not.
    1. They SHOULD require canonical equivalence to be considered identical, since users often cannot control how their keystrokes are converted into Unicode code points (either initially or due to conversion from a legacy encoding by a separate process).
    2. If canonical equivalence is not required, the Specification MUST include a health warning and recommendation suggesting the use of consistent code point sequences, preferably in NFC.
  2. Specifications that perform string matching for equality MUST specify that the comparison is done as if all strings were converted to Unicode Normalization Form C prior to the comparison. Note that actual normalization is not required and that a variety of performance-boosting strategies are available here.

Impact:

  • We would want an XML 1.0 sixth edition to define canonical equivalence as a requirement.
  • We want HTML 5 to define canonical equivalence as a requirement (?)
  • Failing that, CSS Selectors and any other specification that does string matching would be required to implement canonically equivalent matching.

Background

Unicode encodes a number of compatibility characters whose purpose is to enable interchange with legacy encodings and systems. In addition, Unicode uses characters called "combining marks" in a number of scripts to compose glyphs (visual textual units) that have more than one logical "piece" to them. Both of these cases mean that the same semantically identical "character" can be encoded by more than one Unicode character sequence.

A trivial example would be the character 'é' (a latin small letter 'e' with an acute accent). It can be encoded as U+00E9 or as the sequence U+0065 U+0301. Unicode defines a concept called 'canonical equivalence' in which two strings may be said to be equal semantically even if they do not use the same Unicode code points (characters) in the same order to encode the text. It also defines a canonical decomposition and several normalization forms so that software can transform any canonically equivalent strings into the same code point sequences for comparison and processing purposes.

Unicode also defines an additional level of decomposition, called 'compatibility decomposition', for which additional normalization forms exist. Many compatibility characters, unsurprisingly, have compatibility decompositions. However, characters that share a compatibility decomposition are not considered canonically equivalent. An example of this is the character U+2460 (CIRCLED DIGIT ONE). It has a compatibility decomposition to U+0031 (the digit '1'), but it is not considered canonically equivalent to the number '1' in a string of text.

For the purposes of this document, we are discussing only canonical equivalence and canonical decompositions.

Unicode has two particular rules about the handling of canonical equivalence that concern us here: C6 and C7. C6 says that implementations shall not assume that two canonically equivalent sequences are distinct [there exist reasons why one might not conform to C6]. C7 says that *any* process may replace a character sequence with its canonically equivalent sequence.

Normalization and the W3C

Unicode canonical equivalence affects W3C Specifications because most W3C document formats are based on Unicode---the formats are either defined as a sequence of Unicode code points ("characters") or derived from core language specifications such as XML, HTML, CSS, RDF, and so forth. If you are unfamiliar with Unicode terminology at W3C, one tutorial is Character Model for the World Wide Web: Fundamentals

Example: XML

This document uses XML as the primary example, since it is used in turn by many other document formats and because it also is the one that most directly addresses these issues.

XML uses the Universal Character Set (Unicode) as one of its basic foundations. An XML document, in fact, is a sequence of Unicode code points, even though the actual bits and bytes of a serialized XML file might use some other character encoding. Processing is always specified in terms of Unicode code points (logical characters). Like other W3C RECs, XML says nothing about canonical equivalence in Unicode. There exist recommendations to avoid compatibility characters (which includes some canonical equivalences) and 'Name' prevents starting the various named tokens with certain characters which include combining marks.

Most implementations of XML assume that distinct code point sequences are actually distinct, which is not in keeping with Unicode Requirement C6 [1]. That is, if I define one element <!ELEMENT &#xE9; EMPTY> and another element <!ELEMENT &#x65;&#x301; EMPTY>, they are usually considered to be separate elements, even though both define an element that looks like <é/> in a document and even though any text process is allowed to convert one sequence into the other--according to Unicode. For that matter, a transcoder might produce either of those sequences when converting a file from a non-Unicode legacy character encoding.

One might think that this would be a serious problem. However, most software systems consistently use a single Unicode representation to represent most languages/scripts, even though multiple representations are theoretically possible in Unicode. This form is typically very similar to Unicode Normalization Form C (or "NFC"), in which as many combining marks as possible are combined with base characters to form a single code point (NFC also specifies the order in which combining marks that cannot be combined appear; Unicode normalization forms do not guarantee that there will be no combining marks, as some languages/scripts cannot be encoded at all except via the use of combining characters). As a result, few users encounter issues with Unicode canonical equivalence. A recent survey of the Web concluded that over 99% of all content is in NFC.

However, some languages and their writing systems have features that expose or rely on canonical equivalence. For example, some languages make use of combining marks and the order of the combining marks can vary. Other languages use multiple accent marks and their input systems may pre-compose or not compose characters depending on the keyboard layout, operating system, fonts, or the software used to edit text. Vietnamese is an example of this. Since canonically equivalent text is (supposed to be) visually indistinguishable, users typically don't care that (for example) their Mac uses a different code point sequence than their neighbor's Windows computer. These languages are sensitive to canonical equivalence and rely on consistent normalization in order to be used with a technology such as XML. Further, many of the Ur-technologies are now used in combination. For example, a site might use XML for data interchange, XSLT to extract the data into an HTML page for presentation, CSS to style that page, and AJAX for the user to interact with the page.

CharMod Efforts: A basic history

With this potential problem in mind, eleven (!) years ago the I18N WG started to work on a Specification to address the use of Unicode in W3C specs. This work is collectively called the "Character Model for the World Wide Web" or 'CharMod' for short [2]. Initially, the WG approach was to recommend what was termed "early uniform normalization" (EUN). In EUN, virtually all content and markup was supposed to be in a single normalization form, specifically NFC. The WG identified the need for both document formats (e.g. a file called 'something.xml') to be in NFC as well as the parsed contents of the document (individual elements, attributes, or content within the document). This was called "fully normalized" content.

For this recommendation to work, tools, keyboards, operating environments, text editors, and so on would need to provide for normalizing data either on input (as with most European languages) or when processing, saving, or interacting with the resulting data. Specifications were expected to require normalization whenever possible. In cases where normalization wasn't required at the document format level, de-normalized documents would be 'valid', but could cause warnings to be issued in a validator or de-normalized content could be normalized by tools at creation time. The benefit to this approach was that specifications and certain classes of implementation could mostly assume that users mostly had avoided the problems with canonical equivalence by always authoring documents in the same way. Using this approach, there would be less need, for example, for CSS Selectors to consider normalization, since both the style sheet (or other source of the selector) and the document tree being matched would use the same code point sequence for the same canonically equivalent text. The user-agent could just compare strings. In the few cases where this wasn't the case, the user would be responsible for fixing it, but generally the user was responsible for carefully having constructed the issue in the first place (since their tools and formats would normally have used NFC).

I18N WG worked on this approach for a number of years while languages with normalization risks began to develop appreciable computing support and a nascent Web presence. CharMod contained strong recommendations but not requirements (with a notable exception that we'll cover in a moment) towards normalization. Since it didn't matter so long as content remained normalized and since the most common languages were normalized by default, specifications generally didn't require any level of normalization (although they "should" do so), implementations generally ignored normalization, tools did not implement it, and so forth.

There was one interesting exception to the recommendations in CharMod. String identity matching required (MUST) the use of normalization (requirement C312). This nod to canonical equivalence was also ignored by most specs, implementations, and thus content. It should be noted that CSS Selectors is a string identity matching case and not merely one of the "SHOULD" cases.

From being a mostly theoretical problem, normalization has become something that can be demonstrated in real usage scenarios in real languages. While only a quite small percentage of total content is affected, it quite directly impacts specific languages [3]. The I18N WG is engaged in finding out exactly how prevalent this problem is. It is possible that, despite having become a real problem, it is still so strictly limited that it can be dealt with best via other means that spec-level requirements.

In early 2005, the I18N WG decided that EUN as an approach was not tenable because many so many different system components, technologies, and so forth would be affected by a "requirement" to normalize; that some technologies (such as keyboarding systems) were not up to the task; and that, as a result, content would not be normalized uniformly. The decision was made to change from focusing on EUN towards a policy something like:

1. Recommend the use of normalized content ("EUN") as a "best practice" for content authors and implementers. 2. Since content might not be normalized, require specifications affected by normalization to address normalization explicitly.

Surprisingly, none of the requirements are actually changed by this difference in focus. Note that this did not mean that normalization would be required universally; it only meant that WGs would be asked to consider the impact or, in some cases, to change their specification.

In 2008, at the TPAC, I18N WG reviewed the current WD with an eye towards finally completing the normalization portion of the work (a separate working group in the Internationalization Activity has been chartered between 2005 and 2008 to do the work; this working group expired with no progress and "I18N Core" inherited the unfinished work). I18N's review revealed that the current document state was not sufficient for advising spec, content, or implementation authors about when and how to handle the new "late(r)" normalization. The same review produced general acknowledgement that there now existed significant need based on real content for normalization to be handled by W3C Specs.

At the very end of 2008, I18N WG also reviewed the Selectors-API draft produced by WebApps. In reviewing this document, the WG noted that Selectors, upon which API is based, did not address normalization. Other recent REC-track documents had also been advised about normalization and had ended up requiring the use of NFC internally. However, in the case of Selectors-API, the selectors in question were in CSS3 and were in a late working draft state. CSS WG responded to this issue and a long thread has developed on our combined mail lists, in a wiki, and elsewhere.

Over the past two-plus weeks, the I18N WG has solicited advice and comments from within its own community, from Unicode, and from the various CSS (style), XML, and HTML communities. We have embarked on a full-scale review of what position makes the most sense for the W3C to hold. In our most recent conference call (11 February), we asked members and our Unicode liaison to gather information on the overall scope of the problem on the Web today. We also are gathering information on the impact of different kinds of normalization recommendation. We had expected to complete our review at the 11 February concall, but feel we need an additional week.

There are a few points of emerging consensus within I18N. In particular, if normalization is required, such a requirement probably could be limited to identifier tokens, markup, and other formal parts of document formats. Content itself should not generally be required to be normalized (a recommendation should certainly be made and normalization, of course, is always permitted by users or some process--see Unicode C7), in part because there exist use cases for de-normalized content.

The other emerging consensus is that canonical equivalence needs to be dealt with once and for all. WGs should not have the CharMod sword hanging over them and implementers and content authors should get clear guidance. During this review, I18N is considering all possible positions, from "merely" making normalization a best practice to advocating the retrofitting of normalization to our core standards (as appropriate, see above).

One of the oft-cited reasons why normalization should not be introduced is implementation performance. Assuming, for a moment, that documents are allowed to be canonicalized for processing purposes, our experience suggests that overall performance impact can be limited. There exist strategies for checking and normalizing data that are very efficient, in part owing to the relative rarity of denormalized data, even in the affected languages. This document will not attempt to outline the performance cases for or against normalization, except to note that performance is an important consideration and must be addressed.