FAQ: Upgrading from language-specific legacy encoding to Unicode encoding
Question: Changing (x)html page encoding to UTF-8
[[RI Somehow the original question seems to have got lost. I think this should read: Question: What should I consider when upgrading my web pages from legacy encoding to Unicode encoding? ]]
You have heard that using Unicode is a good idea and that there are benefits such as standards compatibility, multilingual display on a single page, pan-organisation applications.
Numerous large organizations are beginning to switch to Unicode. This FAQ will attempt to list some of the considerations you need to take into account to upgrade your encoding to Unicode.
FAQ: Who uses Unicode? This FAQ will attempt to list some of the considerations you need to take into account to upgrade your encoding to Unicode.
Note that if you are using a content management system to generate web pages, you may need to consider your storage encoding, migration of legacy data, software support.
Which Unicode encoding for web pages?
Unicode is the Document Character Set for HTML and XML.
Unicode has three main encodings: UTF-8, UTF-16, UTF-32.
UTF-8 is the Unicode encoding consistently used for web pages:
- Better compatibility with legacy data, where that legacy data uses ASCII as the 128 codepoints in ASCII match the first 128 codepoints in UTF-8.
- No byte order problems.
UTF-16 is often used for the system back-end.
How well is Unicode supported for my end users?
This depends on:
- browser support
- suitable fonts
- rendering software
Browser support. Modern browsers support Unicode:
- Internet Explorer
- Netscape Navigator
Although many mobile phones support UTF-8, some do not. Additionally, if they use a legacy encoding, which encoding may vary with different devices. Investigation is required if you are targetting a large mobile phone market.
Suitable fonts. Correct script display requires Unicode support at the application or operating system level and availability on the machine of Unicode fonts.
CSS can help with font family fallbacks in the case where the user does not have a specific font, but another font will display the text readably. Do use CSS generic font family fallbacks, eg, serif, sans-serif, eg:
Modern operating systems support Unicode:
- Windows NT and its descendants Windows 2000 and Windows XP
- UNIX-like operating systems such as GNU/Linux
- Mac OS X
Fonts not available in a standard installation can often be downloaded from free sites by users, and you can point to those sites from your pages. It is not desirable to embed fonts in pages because the technology for that is proprietary and browser-specific.
Unicode fonts or ‘font families’ provide a mapping from Unicode codepoints to the graphical representation of characters, ie, glyphs. Unicode fonts usually cover specific scripts. Applications such as browsers usually cover Unicode by using several fonts for different scripts and ranges.
Font display problems:
- Legacy code pages (eg ISO-8859-1/windows-1252): an operating system or browser either has a font installed for that encoding or it doesn't, therefore either the page displays correctly or no characters display (question marks).
- Unicode: the operating system or browser has fonts for some, but not all, of the codepoints, so when displaying a Unicode page, some of the characters may display correctly whilst others don't because the browser has access to fonts for some of the codepoints but not all (empty rectangles).
Rendering software. Multilingual text rendering engines are built into operating system and browser installation. This is typically needed for 'complex scripts' such as Arabic, Hindi, Urdu, Persian, ie languages which have characters that change appearance based on their context.
- Windows: Uniscribe
- Macintosh: Apple Type Services for Unicode Imaging, which replaced the WorldScript engine for legacy encodings.
- Pango - open source
- Graphite - (open source renderer from SIL)
What I don’t need to worry about
Page weight / download cost is not really an issue: given that a large proportion of a web page is HTML mark-up, where characters remain 1 byte, then the difference between legacy encoding and Unicode encoding is quite negligible. In addition, many legacy encodings for complex scripts are already double-byte, eg, Chinese.
Same page weight as for legacy encodings:
- HTML markup
QUERY FOR (DRC): did you mean the following RFC? RFC: UTF-8, a transformation format of ISO 10646 I couldn't find the useful bit you mentioned re weight. Could you point me to it. ALSO QUERY FOR ALL - should we point to a page weight tool here. DRC I was mistaken in the original source, but this may be of use http://www-128.ibm.com/developerworks/unicode/library/utfencodingforms/index.html#h2
- Latin languages: characters, eg, e acute, outside the ASCII range (128 codepoints), are represented by one byte in ISO-8859-1, but typically two bytes in UTF-8, so a small, but acceptable, increase in page size should be expected.
- Characters that do not fall into the ASCII range, such as Chinese, Arabic, Russian, may use 2 or even 3 bytes. Chinese encodings already use more than 1 byte per character with legacy encodings, where they use double bytes.
Character encoding declaration. Ensure that you include or change the Tutorial: character encoding declaration from the legacy encoding to Unicode.
- HTTP header content-type, eg, Content-Type: text/html; charset=utf-8
- HTML head, eg, <meta http-equiv"Content-Type" content"text/html; charset=utf-8"/>
File encoding. Ensure that the file itself has the correct encoding. With a Unicode encoding, the source text should be readable and match the web page text, rather than with a legacy encoding where the source text is not readable and uses different characters to point to codepoints.
Combining data. Ensure that any file fragments that are included the web page, eg using technologies such as Apache SSI (server-side includes), where they will share the encoding of the parent page, are saved with the correct file type/encoding. The fragment encodings must match the parent web file encodings and upgrading to Unicode must happen simultaneously.
Forms. Server side applications, which deal with data returned from a form, must be able to deal with Unicode, or may need to be adapted before upgraded pages containing forms are published.
- FAQ: Who uses Unicode?
- Document Character Set for HTML and XML
- Settings to change to resolve display problems in Unicode
- Information about TrueType font
- Information about OpenType font
- Unicode fonts and specific scripts
- Complex scripts
- Unicode Consortium
- Tutorial: Character sets & encodings in XHTML, HTML and CSS
- Document Character Set for HTML and XML
- RFC: UTF-8, a transformation format of ISO 10646
- Unicode & multilingual web browsers
- Unicode & HTML