Case folding

From Internationalization
Jump to: navigation, search

Case Folding: An Introduction

One of the most common things that software developers do is "normalize" text for the purposes of comparison. And one of the most basic ways that developers are taught to normalize text for comparison is to compare it in a "case insensitive" fashion. In other cases, developers want to compare strings in a case sensitive manner. Unicode defines upper, lower, and title case properties for characters, plus special cases that impact specific language's use of text.

Many developers believe that that a case-insensitive comparison is achieved by mapping both strings being compared to either upper- or lowercase and then comparing the resulting bytes. The existence of functions such as 'strcasecmp' in some C libraries, for example, or common examples in programming books reinforces this belief:

  if (strcmp(toupper(foo),toupper(bar))==0) { // a typical caseless comparison

Alas, this model of case insensitive comparison breaks down with some languages. It also fails to consider other textual differences that can affect text. For example, [Unicode Normalization] could be needed to even out differences in e.g. non-Latin texts.

This document introduces case-folding and case insensitivity; provides some examples of how it is implemented in Unicode; and gives a few guidelines for spec writers and others who with to reference comparison using case folding.

More About the Problem

Simply mapping [a-z] to [A-Z] works for most simple ASCII-only text documents. However, it begins to break down as we explore other languages that use additional characters. It also doesn't take into account the fact that case mappings in some languages are not always algorithmic or static.

For example, if you case folded [a-z] -> [A-Z], a string like "Dürst" or "résumé" might end up looking a bit odd: "DüRST" or "RéSUMé".

A better solution would be to case fold all of the characters.

Variations in Handling Case Distinction

In some cases, there are alternatives to case folding. For example, sometimes Germans will use "ue" to represent the "ü" when the u-umlaut character is not available (Note: we call the 'umlaut' a diaerisis). So you might have a string such as "DUERST" where the original string was "Dürst".

But this isn't a general solution. While true for German, in Finnish the diaerisis marked characters cannot be converted to the base letter followed by 'e': the diaerisis doesn't have the same interpretation in that language.

In other languages, accents can be eliminated or ignored in upper case. So, in French, for example, uppercased text might omit the accents, with "résumé" becoming "RESUME". But note that this removes information and leads to strings becoming equal that were not originally equal. Here are four different French words that famously have the same base characters:

  cote (rating)
  coté (highly regarded)
  côte (coast)
  côté (side)

Other letters do not have a single uppercase equivalent. For example, the German language uses the "sharp-s" character in words like "groß". This letter's uppercase equivalent is a two letter sequence 'SS' ("GROSS").

Turkish i/I etc.

One interesting problem are the Turkic languages (such as Turkish and Azerbaijani) which has two distinct forms of the letter 'i'. One from has a dot above and the other does not.

In Western European languages, the letter 'i' (U+0069) upper cases to a dotless 'I' (U+0049). In Turkish, this letter upper cases to a dotted upper case letter 'İ' (U+0130). Similarly, 'I' (U+0049) lower cases to 'ı' (U+0131), which is a dotless lowercase letter i.

Other Aspects of the Problem

The case folding problem is actually a special case of overall Unicode normalization. For those scripts and writing systems (like Latin) that use case distinctions, case folding can be an important part of normalizing text for comparison purposes. But many scripts do not have case distinctions and some scripts (including Latin) can be represented in Unicode in more than one way. For more information, see CharMod-Norm.

Recommendations for Case Folding

If your application or specification needs to consider case folding, here are some general recommendations to follow:

  1. Consider Unicode Normalization in addition to case folding. If you mean to find text that is semantically equal, you may need to normalize the text beyond just case folding it. Note that Unicode Normalization does not include case folding: these are separate operations.
  2. Always use the language (locale) when case folding. Some languages have specific case folding idiosyncrasies. In particular, if you do not pass the language or locale to your case folding routine, you may get a default locale which might be Turkish (for example).
    1. Specify US English or "empty" (root) locale if you need consistent (internal, not for presentation) comparisons If your comparison should be the same regardless of language or locale, always pass the US English or empty (root, C, POSIX, null) locale to your case-folding function. This does not disable caseless comparison or case folding. It merely limits the effects to a well-known set of rules.
    2. Use case-less compare functions if provided If your application is comparing internal values for equality (as opposed to sorting lists or comparing values linguistically), you should use a consistent caseless compare function. For example, Java's java.lang.String class provides an equalsIgnoreCase function that is more convenient than using toLowerCase(#locale)... which provides consistent results across languages, although not consistent with the rules for any given language.
  3. For presentation, normalize case in a language sensitive manner The rules that one language uses for case will not necessarily match those used by another language. For example, the French novel by Marcel Proust À la recherche du temps perdu contains only the single, introductory capital letter, whereas the English title uses "titlecase": In Search of Lost Time. Code that assumes one form of capitalization is appropriate for another language may cause problems.