Upgrading from language-specific legacy encoding to Unicode encoding

What should I consider when upgrading my web pages from legacy encoding to Unicode encoding?

You have heard that using Unicode is a good idea and that there are benefits such as standards compatibility, multilingual display on a single page, pan-organisation applications.

Numerous large organizations are beginning to switch to Unicode. This FAQ will attempt to list some of the considerations you need to take into account to upgrade your encoding to Unicode. .

Note that if you are using a content management system to generate web pages, you may need to consider your storage encoding, migration of legacy data, software support.

Which Unicode encoding for web pages?

Unicode is the Document Character Set for HTML and XML.

Unicode has three main encodings: UTF-8, UTF-16, UTF-32.

UTF-8 is the Unicode encoding consistently used for web pages:

UTF-16 is often used for the system back-end.

How well is Unicode supported for my end users?

This depends on:

Browser support. Modern browsers support Unicode:

Although many mobile phones support UTF-8, some do not. Additionally, if they use a legacy encoding, which encoding may vary with different devices. Investigation is required if you are targetting a large mobile phone market.

Suitable fonts. Correct script display requires Unicode support at the application or operating system level and availability on the machine of Unicode fonts.

CSS can help with font family fallbacks in the case where the user does not have a specific font, but another font will display the text readably. Do use CSS generic font family fallbacks, eg, serif, sans-serif, eg:

.headline {font-family:Verdana,Arial,Helvetica,sans-serif;}

Modern operating systems support Unicode:

Fonts not available in a standard installation can often be downloaded from free sites by users, and you can point to those sites from your pages. It is not desirable to embed fonts in pages because the technology for that is proprietary and browser-specific.

Commonly available Unicode fonts (commercial and open source) are TrueType and the more recent OpenType.

Unicode fonts or ‘font families’ provide a mapping from Unicode codepoints to the graphical representation of characters, ie, glyphs. Unicode fonts usually cover specific scripts. Applications such as browsers usually cover Unicode by using several fonts for different scripts and ranges.

Font display problems:

Rendering software. Multilingual text rendering engines are built into operating system and browser installation. This is typically needed for 'complex scripts' such as Arabic, Hindi, Urdu, Persian, ie languages which have characters that change appearance based on their context.

What I don’t need to worry about

Page weight / download cost is not really an issue: given that a large proportion of a web page is HTML mark-up, where characters remain 1 byte, then the difference between legacy encoding and Unicode encoding is quite negligible. In addition, many legacy encodings for complex scripts are already double-byte, eg, Chinese.

Same page weight as for legacy encodings:

Slightly heavier

Don't forget

Character encoding declaration. Ensure that you include or change the character encoding declaration from the legacy encoding to Unicode.

File encoding. Ensure that the file itself has the correct encoding. With a Unicode encoding, the source text should be readable and match the web page text, rather than with a legacy encoding where the source text is not readable and uses different characters to point to codepoints.

Combining data. Ensure that any file fragments that are included the web page, eg using technologies such as Apache SSI (server-side includes), where they will share the encoding of the parent page, are saved with the correct file type/encoding. The fragment encodings must match the parent web file encodings and upgrading to Unicode must happen simultaneously.

Forms. Server side applications, which deal with data returned from a form, must be able to deal with Unicode, or may need to be adapted before upgraded pages containing forms are published.