Case for Unicode Caseless Matching in HTML5

HTML5 References to "compatibility caseless"

The above bug relates to just two locations that I can find in the current HTML5 editor's copy, which I'll list just below. One issue is that it is a little odd to create a caseless matching scheme that is used so infrequently in HTML.

Section 2.4.9 Hash name reference.

Says this:

Return the first element of type type that has an id attribute whose value is a case-sensitive match for s or a name attribute whose value is a compatibility caseless match for s

This section refers to "hash name references", which URI folks will know as "fragment identifiers".

Quick and dirty test: HACK

Section 4.10.5.1.17 Radio button groups.

Says this about radio buttons in a group:

They both have a name attribute, their name attributes are not empty, and the value of a's name attribute is a compatibility caseless match for the value of b's name attribute.

Why Not "Compatibility Caseless"?

The definition of "compatibility caseless matching" is found in The Unicode Standard, v6.2 on page 120, requirement D146.

D146 A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))

This match is more complex than regular casefold matching because the compatibility decomposition may result in a string that itself may be casefolded. Notice that three separate decomposition (one canonical and two compatibility) and two separate casefolds are required on both the source and target string.

Unicode compatibility decompositions handle a wide range of different textual variation. For example, "squared" ideographs, circled numbers and letters, unit symbols like the Kelvin sign, Roman numerals, and East Asian width (wide or narrow) variations. The Hangul script used for Korean has a special relationship to Unicode decomposition as well. While some of these transformations can be helpful in finding matches, in most cases the application of compatibility decomposition is destructive—it removes some information from the text in question.

There is no apparent case for applying this kind of "destructive" process when trying to perform fragment or radio button group identification. Indeed, it seems harmful to apply it to fragment identifiers, as names that the user might reasonable consider distinct become unified.

Why Plain "Caseless"?

The normal caseless comparison in Unicode (D144, p.119, TUS6.2.0) applies Unicode C+F casefolding to the source and target strings before comparing them.

Existing implementations match these strings in an inconsistent fashion. As a result, users can only get consistent behavior if the fragment or radio button groups have mixed case in ASCII. It has been proposed that we use only ASCII case insensitive (ACI) matching on these values as a result.

However, throughout HTML ACI has only been applied to ASCII-only namespaces. This is a good policy, from I18N's point of view, as it makes casefold matching consistent and users don't develop inconsistent expectations. I18N recommends (Charmod-Norm) that Unicode namespaces not apply normalization but do perform untailored (language-insensitive) Unicode casefold matching when casefold matching is indicated.

Note that some languages, specifically the Turkic languages and Lithuanian, are disadvantaged by this, as they use a language-specific case fold matching. These casefolds are documented in Unicode's "SpecialCasing.txt" file and/or CLDR. Users of these languages may be confused why casefold matching applies for other language's cases but not their own—particularly for auto-generated links.