Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Normalization in HTML and CSS

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, Web project managers, and anyone who is unfamiliar with Unicode normalization and how it can affect the success of HTML and CSS authoring.

Question

What are normalization forms, and why do I need to know about them when creating HTML and CSS content?

Answer

Normalization is something you need to be aware of if you are authoring HTML pages with CSS style sheets in UTF-8 (or any other Unicode encoding), particularly if you are dealing with text in a script that uses accents or other diacritics.

What are normalization forms?

In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word világ. The fourth letter could be stored in memory as a precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single character) or as a decomposed sequence of U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (two characters).

The Unicode Standard allows either of these alternatives, but requires that both be treated as identical. To improve efficiency, an application will usually normalize text before performing searches or comparisons. Normalization, in this case, means converting the text to use all precomposed or all decomposed characters.

There are four normalization forms specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The C stands for (pre-)composed, and the D for decomposed. The K stands for compatibility. To improve interoperability, the W3C recommends the use of NFC normalized text on the Web.

What do I need to know about normalization?

Unfortunately, normalization doesn't always take place before content is compared. A particularly important case is the use of selectors and class names or ids in HTML and CSS. If the word világ is used in precomposed form in the HTML (eg. <span class="világ">), but in decomposed form in the CSS (eg. .világ { font-style: italic; }), then the selector won't match the class name.

What this means is that when producing content you should ensure that selectors and class or id names are character-for-character the same. This is particularly likely to be a issue if the markup and the CSS are being authored or maintained by different people.

The best way to ensure that these match is to use one particular Unicode normalization form for all authored content. As we said above, the W3C recommends NFC.

Most keyboards for European languages output text in NFC already, but this is less likely to be the case if dealing with many non-European languages.

In some cases your editor may allow you to save data in a choice of normalization forms. The picture below shows an option for setting a particular normalization form as the default when opening new files in Dreamweaver (NFC is selected). You are shown a similar choice when saving a document.

Unicode normalization form preferences on a dialog panel, showing NFC selected.

How can I check pages for problems?

You can find out whether an HTML page contains class names and id values that are not normalized according to NFC by using the W3C Internationalization Checker.

If you do have problems, you should find an editor or conversion tool that allows you to specify the normalization form, and use that to re-save your page.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2010-08-10 14:48. Last substantive update 2010-08-10 14:48 GMT. This version 2010-09-09 13:40 GMT

For the history of document changes, search for qa-html-css-normalization in the i18n blog.