I18N-ACTION-40: CSS Selectors and Normalization (new response)

In several recent Internationalization WG Teleconferences, we discussed the question of normalization in CSS Selectors. Following is a proposed response.

---
This is a personal summary meant to help address ACTION-39 and ACTION-40 from our calls. This is a last draft based on our most recent WG consensus position. Basically the question is whether or not CSS Selectors [3] should use Unicode Normalization when selecting elements, attributes, and other text values. For example, if I write a selector:

  .\C5land { color : green }  // U+00C5 is Å

Which of the following does it select:

<p class="&#xC5;land">same form</p>
<p class="A&#x30a;land">combining ring above; i.e. matched by NFC</p> 
<p class="&#x212b;land">angstrom sign; a compatibility equivalent, i.e. matched by NFKC</p>

We looked at various solutions as a Working Group.

Historically, as in CharMod, this WG has championed "early uniform normalization" ("EUN"), that is, a recommendation that documents be authored, saved, stored, and interchanged in a consistent normalization form. Generally this is recommended to be Unicode Normalization Form C (aka "NFC"). However, beginning six years ago, this WG recognized that EUN, as a strategy, had failed. While this WG continues to recommend the creation and interchange of documents in a consistent normalization form, we don't believe that it should be a normative requirement for any REC nor should document formats rely on EUN.

Content authors are thus responsible for writing out files using a consistent normalization form. As a spec, CSS cannot really directly specify how tools, keyboards, input methods, etc. work. The most CSS can really say is "pre-normalization of all input is assumed". The DOM, JavaScript, CSS, HTML, and other document formats do nothing to ensure a match. In the above example, the selector matches only the first item in the HTML fragment shown. Interoperability is thus dependent on everyone using the same normalization form at all times.

When we consider the problem of selectors, however, the I18N WG feels strongly that normalization remains an important consideration. In most cases, the differences between normalized and non-normalized input is invisible to the user or is, at least, very difficult to discern. Selectors may not reside in the same file (since the stylesheet or some javascript script can be entirely separate files) and may only exist at runtime, given dynamic page assembly. A number of real world cases were cited by WG members for comparison issues arising from string denormalization.

Therefore, we would like to recommend that:

CSS define normatively that "two selectors compare as equal if they are canonically equivalent Unicode strings", that is, that the comparison must be done as if one of the canonical Unicode Normalization Forms (either NFC or NFD) had been applied to both prior to the comparison. This does not require that identifiers be interned, normalized, or use any particular normalization form, only that the comparison occur between normalized code point sequences.

We would be happy to work with the CSS WG to refine appropriate language for the spec. Please let us know if you have comments or concerns or wish to schedule a joint meeting to discuss this further.

/// .sig etc.

---

Does this capture our thoughts?

Addison


[1] http://www.w3.org/2011/05/04-i18n-minutes.html

[2] http://www.w3.org/wiki/I18N/CanonicalNormalization

[3] http://www.w3.org/TR/css3-selectors


Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 25 May 2011 04:47:30 UTC