Converting2

From Internationalization

Collaborative editing page

Follow the conventions for editing this page.

Status: Initial Draft ie. please focus on technical content, rather than wordsmithing at this stage.

See the [I18n Core home page].

Author: Andrew Cunningham

Upgrading from language-specific legacy encoding to Unicode encoding

You have heard that using Unicode is a good idea and that there are benefits such as standards compatibility, multilingual display on a single page, pan-organisation applications.

Numerous large organizations have switched to Unicode. This document will point you to resources that will assist your migration to Unicode.

A migration to Unicode involves more than just converting your HTML or XHTML templates to an appropriate Unicode encoding. Yo will need to plan your migration. Internal data sorces will need to be converted to Unicode and external data sources may been to be transcoded before integration into your site. You also need to know the state of Unicode support in software components you rely on.

Detailed information is available in the article on Unicode migration


How well is Unicode supported for my end users?

This depends on:

  • browser support
  • suitable fonts
  • rendering software

Modern browsers support Unicode:

  • Internet Explorer
  • Firefox
  • Opera
  • Safari

Although many mobile phones support UTF-8, some do not. Additionally, if they use a legacy encoding, which encoding may vary with different devices. Investigation is required if you are targeting a large mobile phone market.

What do I need to do?

Convert X/HTML, XML and CSS files to UTF-8

Unicode has three main encodings: UTF-8, UTF-16, UTF-32. UTF-8 is the Unicode encoding consistently used for web pages. It provides:

  • Better compatibility with legacy data, where that legacy data uses ASCII as the 128 codepoints in ASCII match the first 128 codepoints in UTF-8.
  • No byte order problems.

It is necessary to:

  1. HTML, XHTML templates and XML files need to be converted to UTF-8. CSS and javascript files need to be reviewed.
  2. Convert data to an appropriate Unicode encoding.
  3. Declare the encoding in HTML, XHTML, XML and CSS files.
  4. Ensure that your server does the right thing; check HTTP Response header.
  5. Test your web site.

You will need to specify the encoding of your documents. There is a tutorial on Character sets & encodings in XHTML, HTML and CSS. Basic principles are:

For HTML and XHTML served as text/html: always use a <meta> element

For XHTML served as text/html: where practical use an XML declaration with an encoding attribute

For XML files and XHTML served as XML: always use an XML declaration with an encoding attribute

For CSS style sheets: use the @charset rule

Other useful documents include:

Convert scripts and database tables to Unicode

Unicode offers three encoding forms: UTF-8, UTF-16, and UTF-32. The software and databases you use may require specific Unicode encodings. Although UTF-8 is used as the encoding for web pages, UTF-16 is often used in the back-end.

Language tagging

Marking up the primary language of a document and any change of language of a document is good internationalization proactice. It can also be critical to correctly culturally appropriate rendering of CJK data.