CaseForCaselessCSS

From Internationalization
Revision as of 17:33, 3 January 2013 by Aphillip (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The Case for Caseless Comparison in CSS

The Problem

Case-sensitivity in CSS is ill-defined, as noted by this CSS blog entry and in various discussions between CSS and I18N.

Historically, users could write a style sheet like the following and expect it to work with the element shown:


   P.GREEN { COLOR: GREEN }

   <p class="green">I'm green</p>

When character encoding support in CSS was ill-defined, document authors almost always limited their style sheets (and matching values in HTML or XML documents) to use ASCII identifiers for interoperability reasons: there was no guarantee that non-ASCII values would work or work reliably. Thus, the need for a clear definition of "case-insensitive" comparison between the style sheet's tokens and those in the styled document was limited to the 26 letters in the ASCII alphabet.

International support has improved, though, and the artificial practice of restricting style sheets to ASCII is now merely customary rather than imposed by a practical technical limitation. CSS style sheets can now declare a character encoding, for example, and there is much better definition for how characters are interpreted, escaped, and so forth in both the style sheet and in the documents to be styled.

In addition, more content and more style sheets are defined, sometimes automatically, from user data in a world that is centered on Unicode rather than on (for example) Latin-1. As international support has developed a solid foundation, we must again consider the inexact definitions in CSS today and resolve what implementations should do.

Consider the following items:


  P.GRÜN { COLOR:GREEN } # Ü = U+00DC
  p.grün { COLOR:GREEN }
  P.Grün { COLOR:GREEN }
  p.grÜn { COLOR:GREEN }
  p.grÜn { COLOR:GREEN } # Ü = U+0055 U+0308

  <p class="GRÜN">
  <p class="grün">
  <p class="grün"> <!-- ü = U+0075 U+0308 -->
  <p class="grÜn"> <!-- Ü = U+0055 U+0308 -->

Here the German word for "green" is used with varying case distinctions. What should happen? What does happen?

The German case might be considered as "trivial", since it is possible for a careful content author to avoid accents in German. However, other "bicameral scripts" (scripts that have an upper/lowercase distinction) also exist. For example, users whose language uses the Cyrillic or Greek script would tend to want user-defined keywords in their own languages. What should happen in these languages? Should case distinctions be applied to class IDs in these languages (but not to those in ASCII)? For that matter, why should a word like "grün" require special treatment?

The Internationalization WG (I18N) and the CSS WG have been discussing this issue and this page lays out I18N's response to questions and concerns regarding case-insensitive comparison.

Normalization: A Sidebar

Unicode Normalization has been one of the "thorny issues" considered by the Internationalization WG, going back many (at least 14) years: cf. [CharReq].

The I18N WG for many years recommended and tried hard to mandate "early uniform normalization" to Unicode Normalization Form C ("EUN"). For detailed description, see: [CharModNorm].

However, beginning about five years ago, the I18N WG recognized that EUN was impossible to achieve. A last ditch effort was undertaken to consider Unicode normalization as an aspect of the web platform by defining "late normalization" as part of comparisons (such as those in CSS Selectors). See: [Normalization Proposal 2009].

After extensive discussions with other working groups and internally, the current position that I18N has taken, and which is the de facto state of nearly all Web technologies, is that normalization is the responsibility of content authors. Health warnings about normalization are necessary, because the resulting behavior is that strings with different normalization states do not match each other, even if the two are visually indistinguishable and/or represent the same "canonical" string in Unicode.

Since this is enshrined throughout the Web platform, introducing normalization into CSS comparisons would thus be a breaking change, since presumably page authors who have chosen different normalizations have done so deliberately, and since it would reopen items such as CSS Selectors.

Current Discussion

The current discussion about case sensitivity in CSS has focused on determining what current implementation is and what the "right" definition for case(less) matching should be. Some important emails and note sets include:

I18N TPAC minutes

John Daggett's notes: [1] [2] [3] <-- includes tests

I18N basically has parsed the discussion into these questions:

  • To what items do case-insensitive or case-sensitive comparison apply?
  • For items that are case-insensitive, is the case-insensitivty strictly for backwards compatibility of ASCII-only style sheets/document sets, or it is a feature that authors can rely on?
  • What is the exact definition of case-insensitive comparison?
  • What is the exact definition of case-sensitive comparison?

What items are case-sensitive/case-insensitive?

David Barron produced this list of items that might be matched on a case/caseless basis:

  • comparison of CSS identifiers used for counter styles, e.g.,
     list-style-type: decimal;
     list-style-type: DECIMAL;
     list-style-type: decımal;
     list-style-type: DECİMAL;
   and the same for:
     content: counter(foo, decimal);
  • comparison of CSS user identifiers, e.g., named counters:
     counter-increment: grün;
     content: counter(GRÜN);
  • HTML tag names, attribute names, and attribute values, e.g.:
     <input> vs <ınput> etc.
     <select multiple> vs <select multıple> etc.
     <input type="radio"> vs <input type="radıo"> etc.
  • user-defined identifiers in HTML, e.g., style sheet titles
  (defined as case sensitive):
   <link rel="stylesheet" title="grun" href="...">
   <link rel="stylesheet" title="grün" href="...">
   <link rel="stylesheet" title="GRUN" href="...">
   <link rel="stylesheet" title="GRÜN" href="...">
   (do these group as a single style sheet set or multiple?)

Types of Comparison

Case-sensitive

HTML5 defines case sensitive comparison as:

 "Comparing two strings in a case-sensitive manner means comparing 
 them exactly, code point for code point."

This seems completely straight-forward. Effectively, this is a call to strcmp. If the (encoding-normalized) bytes are the same, the two strings "match". This version of case-sensitive comparison side-steps the issue of normalization and, if it weren't for legacy compatibility, might be a suitable way to "fix" CSS case-dependent matching.

Unfortunately, switching to this form of comparison breaks examples such as "P.GREEN" shown above.

ASCII Case Insensitive (ACI)

ASCII case-insensitive matching (ACI) is defined in HTML5 as:

Comparing two strings in an ASCII case-insensitive manner means comparing 
them exactly, code point for code point, except that the characters in 
the range U+0041 to U+005A (i.e. LATIN CAPITAL LETTER A to LATIN CAPITAL 
LETTER Z) and the corresponding characters in the range U+0061 to U+007A 
(i.e. LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are considered to 
also match.

Coding ACI is relatively straight-forward, since there are only 26 ASCII letter values. Here's a sample written in JavaScript (and intended to be illustrative rather than efficient):

function mapCaseACI(caseStr)
{

   var out = "";
   for (var i=0; i < caseStr.length; i++) {
       var nextChar = caseStr.charCodeAt(i);
       if (nextChar > 0x40 && nextChar < 0x5B) {
          out = out + String.fromCharCode(nextChar + 0x20);
       } else {
          out = out + String.fromCharCode(nextChar);
       }
   }

   return out;
}

ACI is used extensively in HTML5—for cases where the namespace is strictly limited to only include ASCII characters (such as the names of elements). This makes sense and doesn't introduce non-ASCII character considerations. International users are not affected, since non-ASCII values that are affected by ACI in HTML5 are not permitted.

ACI might be the right choice, if case insensitivity is strictly for compatibility with existing style sheets in legacy browsers. However, it's clear from John Daggett's (and Addison's) test cases that this is not actually the case: that there exist a variety of non-ASCII case-insensitive implementations.

The Internationalization WG feels that, if case-insensitivity is a feature of CSS, then international users should be considered on par with ASCII-only users.

The CSS namespace is, in some places, constrained to ASCII and these places can (and should) use ACI for comparison.

In other places, case-sensitivity may be called for.

However, the Internationalization WG doesn't consider ACI as an appropriate solution for international users when they have an unconstrained non-ASCII namespace to work with.

For example, although HTML element and attribute names are pre-defined, all-ASCII, and are ASCII case-insensitive, XML documents can also be styled using CSS. In XML, the elements <p> and <P> are different elements. XML, further, allows non-ASCII element and attribute names. In the case of XML, case-sensitive comparison of element and attribute names is a requirement. So matching rules for element and attribute names depends on the document format and it defined externally to CSS.

Unicode Casefolding

Unicode casefolding and case insensitive matching is a bit more complex.

Most characters in bicameral scripts have simple case foldings that are universally applicable and that function much like ASCII casefolding: there is a one-to-one mapping between uppercase, title case, and corresponding lowercase code points.

However, there are a few special cases.

First, there are some characters which have case mappings that require more than one Unicode character. A well-known example is U+00DF (LATIN LETTER SMALL SHARP S). The "sharp S" (best known for its use in German has an uppercase mapping of two capital letters: "SS" and a case fold mapping in Unicode of two lower case letters: "ss". So a German word like "groß" would have a case-folded match to the string "gross" (or "GROSS").

Unicode also recognizes specific mappings that are "contextual". A well-known example of this are Turkic languages such as Turkish or Azerbaijani which make use of both dotted and dotless letters "I". These special case mappings are generally only applied to content in those languages. Unicode defines both the default case fold mapping and a special case mapping for these languages.

All casefold mappings are defined in the Unicode Character Database (UCD), specifically in the file CaseFolding.txt (UCD 6.2 version).

The Internationalization Working Group has recommended using the "C" (common) and "F" (full) case fold mappings for CSS.

There has been some question about the complexity of implementing this. It turns out that, if normalization is excluded (see above sidebar for the reason that it should be excluded), the implementation of the casefold is as simple as having a lookup table. The table isn't even very large.

Test Cases and Demo

I wrote a sample implementation of both ACI and Unicode C+F matching in JavaScript. The results can be seen here:

UAX#31

Recommendation of the Internationalization Working Group

The Internationalization Working Group recommends that CSS adopt Unicode C+F case folding without normalization as the basis for matching items that are user defined.

For tokens that are defined in an ASCII-only namespace, ACI can be used.