Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Migrating to Unicode

Intended audience: Web developers, programmers, site administrators, and others wishing to migrate a Web site or Web-based content from a legacy (non-Unicode) character encoding to Unicode.

Updated 2008-04-10 10:19

This article provides guidelines for the migration of software and data to Unicode. It covers planning the migration, and design and implementation of Unicode-enabled software. A basic understanding of Unicode and the principles of character encoding is assumed. Some sources for information about these include:

Why migrate to Unicode?

There are a number of reasons for adopting Unicode:

Note that simply changing the character encoding of your pages to Unicode will not eliminate all character encoding problems. In fact, during the migration there is a significantly increased risk of such bugs, because existing data must be converted to Unicode, and the current encoding is not always known. This document provides tips on how to minimize this risk, and how to provide mechanisms to correct character conversion issues.

Planning your migration

To scope the migration to Unicode, you need to understand the use of character encodings in your current setup and decide on the internal and external use of character encodings for the Unicode-based design. You also need to know the state of Unicode support in software components you rely on, and where needed, the migration plans for these components. This enables you to plan the upgrade of your software to be based on Unicode, and the conversion of existing data to Unicode encodings.

A project to migrate to Unicode may also be a good time to improve internationalization in general. In particular, you should consider whether you can use the multilingual capabilities of Unicode to break down unnecessary barriers between different audiences, cultures, or languages. Especially for sites or applications that enable communication between users and thus host or transmit user-generated content, it may make sense to have a single worldwide site with shared multilingual content, despite having several localized user interfaces.

Understanding the current use of character encodings

As a starting point, you need to thoroughly understand how character encodings are used in your current software. Identify the components of the software and the data containers: front end, back end, storage, APIs, web interfaces, and so on, and clarify their use of encodings:

The last question may be surprising, but is particularly important. Lack of correct information about the character encoding used for text that is coming in from outside the site (such as content feeds or user input) or that's already in your data collections is a common problem, and needs particular attention. (Actually, you need to pay attention to such things even if you're not converting to Unicode.) There are variety of ways this lack of correct information may come about:

To deal with such situations, character encoding detection is commonly used. Encoding detection attempts to determine the encoding used in a byte sequence based on characteristics of the byte sequence itself. In most cases it's a statistical process that needs long input byte sequences to work well, although you may be able to improve its accuracy by using other information available to your application. Because of the high error rate, it's often necessary to provide ways for humans to discover and correct errors. This requires keeping the original byte sequence available for later reconversion. Examples of encoding detection libraries include:

Checking the foundations

Software often depends on other software for its implementation:

You need to check whether the software you depend on supports Unicode, or at least doesn't put obstacles into your adopting it. It will commonly be necessary to upgrade to newer versions of underlying platforms, and in a few cases it will be necessary to migrate from obsolete platforms to newer ones.

Deciding on character encoding use for internal use

Unicode offers three encoding forms: UTF-8, UTF-16, and UTF-32. For transportation over the network or for storage in files UTF-8 usually works the best because it is ASCII-compatible, while the ASCII-look-alike bytes contained in UTF-16 and UTF-32 text are a problem for some network devices or file processing tools. For in-memory processing, all three encoding forms can be useful, and the best choice often depends on the programming platforms and libraries you use: Java, JavaScript, ICU, and most Windows APIs are based on UTF-16, while Unix systems tend to prefer UTF-8. Storage size is rarely a factor in deciding between UTF-8 and UTF-16 because either one can have a better size profile, depending on the mix of markup and European or Asian languages. UTF-32 is inefficient for storage and therefore rarely used for that purpose, but it is very convenient for processing, and some libraries, such as Java and ICU, provide string accessors and processing API in terms of UTF-32 code points. Conversion between the three encoding forms is fast and safe, so it's quite feasible and common to use different encoding forms in different components of large software systems.

Storage of text whose character encoding is not known with certainty is an exception from the Unicode-only rule. Such text often has to be interpreted using character encoding detection. And character encoding detection is not a reliable process. Thus, you should keep the original bytes around (along with the detected character encoding) so that the text can be reconverted if a human corrects the encoding selection.

Deciding on character encoding use for external interfaces

For communicating with the world outside your application, UTF-8 should be used wherever possible. However, there are situations where you can't control the encoding, or need to communicate with systems that don't support UTF-8. Here are recommendations for common cases:

Generally, incoming text should be converted to a Unicode encoding as soon as possible, and outgoing text, if it has to be sent in a non-Unicode encoding, converted from Unicode to that other encoding as late as possible. However, if the encoding of incoming text cannot be determined with certainty, then the original text must be stored along with information about the likely encoding. This enables corrective action if it turns out that the encoding was wrong.

Creating a road map

For very simple sites or applications it may be possible to change the entire software to be based on Unicode, convert all data to a Unicode encoding, and switch over from the pre-Unicode version to the Unicode version in one instant. But many sites or applications offer external interfaces, have large bodies of code, and have accumulated huge data sets, so their conversion is a big project with multiple dependencies that needs to be carefully planned. Here's a breakdown into likely sub-projects:

Some of these sub-projects can be executed in parallel or in a different order, depending on the specific situation of your product. For example, migration of the implementation of your product may be held up by dependencies on other software components that haven't sufficiently progressed in their migration yet. On the other hand, SQL databases can be migrated to Unicode much earlier because the client component of the database software insulates clients from the encoding used in the database and performs character encoding conversion when necessary. Migrating databases early on has benefits: It simplifies testing, because the database can be tested independently from the software using it, while testing higher-level software typically requires a database, and it may allow you to merge multiple separate databases using legacy encodings into a single multilingual database.

Designing for Unicode

Character encoding specifications

Byte sequences can only be correctly interpreted as text if the character encoding is known. Many applications are written such that they just move around byte sequences, without naming the character encoding. As discussed above, this has always caused problems. But it happened to work in many cases in which users all speak the same language or are willing to adapt to some content being incorrectly rendered on the page. During the transition to Unicode, however, each language will be handled in at least two encodings, the legacy encoding for that language and UTF-8, so specifying the encoding for each byte sequence will be critical in order to avoid an avalanche of data corruption bugs.

Character encodings can be specified in a variety of ways:

Character Encoding Names

There is a standard for naming character encodings on the internet, RFC 2978, and an associated IANA charset registry. However, actual use is often different. Many encodings come in different variants or have siblings supporting extended character sets, and different software often uses different names for the same encoding or the same name for different encodings. For example, the name ISO-8859-1 is often used to describe data that actually uses the encoding windows-1252. This latter encoding (Microsoft Windows code page 1252) is very similar to ISO 8859-1 but assigns graphic characters to the range of bytes between 0x80 and 0x9F. Many Web applications (such as browsers, search engines, etc.) treat content bearing the ISO 8859-1 label as using the windows-1252 encoding instead, since, for all practical purposes, windows-1252 is a "superset" of ISO 8859-1. Other applications, such as encoding converters (like iconv or ICU) are pretty literal, and you must specify the right encoding name in order to get the proper results.

Character Encoding Determination

Whenever a byte sequence is interpreted as text and processed, its character encoding must be known. In many cases determining the character encoding is so trivial that it's not even thought about - for example, when processing a string in a programming language that specifies that strings are encoded in UTF-16. However, in other cases, no clear specification of the character encoding is available, or the text comes from a source that may not be fully trusted to provide a correct specification. In such cases, a more complicated process is necessary to determine the character encoding and to enable later correction of mistakes made:

Character Encoding Selection and Declaration

When sending text, the appropriate character encoding needs to be selected based on the data format and the recipient. The section Deciding on Character Encoding Use for External Interfaces discusses encoding use based on data formats. In most cases, a Unicode encoding is recommended. However, there are two major exceptions:

No matter which encoding is used, the character encoding really must be unambiguously specified using one of the mechanisms described in the Character Encoding Specifications section.

Character Encoding Conversion

Whenever text is expected to be in one character encoding in one place and in a different character encoding in the next place, encoding conversion is necessary. Some commonly used libraries for character encoding conversion are ICU and iconv, however, some platforms, such as Java and Perl, bring along their own conversion libraries.

When using the libraries, it is important to use the right encoding names for the specific library. See the section Character Encoding Names above for more information.

There are some specific conversion issues that may affect particular products:

Normalization

Some characters have more than one way of being represented in Unicode. Unicode defines several ways of eliminating these differences when they do not matter to text processing. For more information on Normalization, see CharMod-Norm.

Unicode does not prescribe when to use a specific Unicode normalization form. However, a number of processes work better if text is normalized, in particular processes involving text comparison such as collation, search, and regular expression processing. Some libraries performing these processes offer normalization as part of the process; otherwise, you should ensure that text is normalized before using these processes. Generally, normalization form C (NFC) is recommended for web applications. However, some processes, such as internationalized domain names, use other normalization forms.

Some languages will require normalization before processing since different input methods may generate different sequences of Unicode codepoints. Vietnamese is a prime example where the Vietnamese keyboard layout in Windows 2000 onwards produces a different character sequence than most other third party Vietnamese input software does. Similar issues arise in a number of African languages, Yoruba being the first that comes to mind.

Text Size Issues

Storing text as Unicode often takes more space than storing it in legacy encodings. The exact amount of expansion depends on the language and particular text involved. Expansions for some common encodings might be as much as:

Source Encoding

Languages

UTF-8

UTF-16

ASCII English, Malay, ... 0% 100%
ISO-8859-1 Western European 10% 100%
ISO-8859-7, plain text Greek 90% 100%
ISO-8859-7, 50% markup Greek 45% 100%
TIS-620, plain text Thai 190% 100%
TIS-620, 50% markup Thai 95% 100%
EUC-KR, plain text Korean 50% 5%
EUC-KR, 50% markup Korean 25% 55%

At a macro level, this doesn't really matter much. Network bandwidth and storage nowadays are dominated by videos, images, and sound files, while text only consumes a fraction. There may be an impact on storage systems that store only text. If text size is really a concern, it can be reduced using compression.

At the micro level, however, the increased storage size has a number of implications:

Using libraries

To work with Unicode, it is often advantageous to use software libraries that are dedicated to Unicode support. Older libraries may support Unicode less well, or not at all.

Language determination and declaration

While Unicode enables multilingual applications and documents, there are many processes that require knowledge about the actual language in use. Such processes range from simple case-folding to searching and spelling checking.

Unicode-based APIs should therefore enable the specification of the language(s) used wherever such knowledge may be needed, and the language of user-generated content should be recorded where possible. Where the language cannot be captured at the source, a language detection library may be useful.

To help other applications, the language of web content, where known, should be declared using the HTTP Content-Language header or the HTML/XML lang attributes.

Font issues

Web sites using Unicode need to be more careful about specifying fonts than web sites using legacy encodings. Many languages have unique or specific writing traditions, even though they share a script with other languages. In other cases, fonts support can be a barrier because the fonts necessary to display specific scripts are not installed on most systems.

For example, the Chinese and Japanese writing systems share a large number of characters, but have different typographic traditions, so that Chinese fonts are generally not acceptable for Japanese text and vice versa. For example, here is the same character being displayed using a Chinese and a Japanese font (along with the HTML code used to generate the screen capture):

Picture of the same Unicode ideographic character, but with two different glyph representations.

<span style="font-size:3em;font-family:sans-serif;">
<span lang="zh-Hans-CN" style="font-family: simsun, hei, sans-serif;">直</span>
<span lang="ja" style="font-family: 'ms gothic', osaka;">直</span>
</span>

When legacy encodings are used, browsers often guess the language from the encoding and pick an appropriate font.

Since Unicode supports both Chinese and Japanese, this trick doesn't work for Unicode-encoded pages, and the result can be an inappropriate font or even an ugly mix of fonts being used to render the content.

One solution is to keep track of the language in use, and communicate both the language and the preferred fonts for the language to the browser. For monolingual pages, using a language specific style sheet is a simple and effective approach. For multilingual pages, you should use the lang attribute on HTML tags to identify the language; a few browsers use this information as guidance in selecting the right font. For precise control over the font you can also use classes to identify the language and class selectors in the style sheet to set the font. The CSS 2.1 language pseudo class selectors, which would select directly based on language attribute(s), isn't supported by Internet Explorer and so is of limited usefulness. See Test results: Language-dependent styling.

Migrating Data

Converting the data associated with a product will in many cases be the biggest challenge in migrating the product to Unicode. For example, some applications own or access a number of databases, some of which are managed by database engines such as Oracle or MySQL. Others use custom file formats and access mechanisms. These databases, regardless of type, need to be migrated to support Unicode.

Migration of the data to Unicode is also a good time to consider consolidating databases that were previously separate because of different character encodings. Using a single database worldwide or just a few for the main regions may simplify deployment and maintenance, and may enable content sharing between different markets, and Unicode is ideal for this because it can represent text in all languages. Consolidation will however have to keep in mind that other restrictions on content sharing may remain - such as language availability, licensing conditions, and legal or cultural restrictions on publishing material related to politics, religion, sex, and violence.

Strategies for conversion of the data will vary based on a number of factors:

Because of variations in these factors, there's no simple recipe that can be followed in converting the databases of a product. The following is a discussion of common considerations; however, it will generally be necessary to create a tailored conversion plan for each product. Such a plan will likely have several phases for analysis, conversion with checking of the conversion results, and recovery if something goes wrong.

Dealing with text size issues

As mentioned in Text Size Issues (above), converting text to Unicode generally results in expanded storage requirements, and you need to carefully consider whether to measure text lengths in bytes, characters, or UTF-16 code units.To reduce the impact of increased field sizes, it may make sense to switch CHAR fields in SQL databases to VARCHAR, thus allowing the database to allocate just as much space as needed.

On text measurement, some databases don't give you a choice. For example, MySQL always measures in terms of Unicode BMP characters, resulting in 3 bytes per character. Others, such as Oracle, let you choose between character or byte semantics. Other storage systems that impose size limits are likely to measure in bytes.

During the migration, which involves encoding conversion, be careful to avoid truncation. In some cases, unfortunately, you may not be able to do so because of external constraints, such as Oracle's limit of 30 bytes for schema object names in data dictionaries (use of ASCII characters for schema names helps avoid this issue). In such cases, at least make sure to truncate at a character boundary.

Also: note that there can be text expansion due to translation. See Text size in translation.

Identifying ASCII data

It is worthwhile identifying data sets (files, database tables, database columns) that are entirely in ASCII. If the desired Unicode encoding is UTF-8, no conversion is necessary for such data sets because ASCII byte sequences are identical to the corresponding UTF-8 byte sequences. Also, indices over ASCII text fields are also valid for the corresponding UTF-8 or UTF-16 text fields, unless they are based on language sensitive sort orders. However, you have to be strict in identifying ASCII data sets. The term "ASCII" is often mistakenly used for things that aren't ASCII, such as plain text (in any encoding) or for text in the ISO 8859-1 or Windows-1252 encodings. In addition, a number of character encodings have been designed to fit into 7-bit byte sequences while representing completely different character sets from ASCII.

To verify that the data set is indeed in ASCII, check for the following:

Dealing with uncertainty

As mentioned earlier, it sometimes occurs that databases contain text whose encoding isn't known. Character encoding detection can be used to get an idea of the encoding, but this process is not reliable. To deal with the uncertainty, a number of additional steps may be necessary:

For simplicity, the following sections assume that the encoding can be determined with certainty and conversion therefore is a one-time event. Where this is not the case, strategies need to be adjusted.

Making sense of Numeric Character References

Databases holding user-generated content often contain numeric character references (NCRs) for non-ASCII characters that users have entered, such as "&#x0152;" (Œ) or "&#x20AC;" (€). Many browsers generate the NCRs when users enter text into form fields that cannot be expressed in the form's character encoding. NCRs work fine if the text is subsequently redisplayed in HTML. They do not work, however, for other processes because they don't match the text they represent in searching, they get sorted in the wrong place, or they're ignored by case conversion. Migration to Unicode is therefore also a good time to convert NCRs to the corresponding Unicode characters. You'll need to be careful however to avoid conversions that change the meaning of the text (as might a conversion of "&amp;" to "&") or conversion to text that would have been filtered out for security reasons.

Using the BOM

During the migration from legacy encodings to Unicode, it's common to use legacy encodings and Unicode in parallel, and you need to be able to distinguish between them. In the general case, this requires character encoding specifications. However, if you need to distinguish between just one specific legacy encoding (such as the site's old default encoding) and UTF-8, you might use the Unicode byte order mark (BOM) as a prefix to identify UTF-8 strings. This is particularly handy if there is no provision for a character encoding specification, for example in plain text files or in cookies. The BOM in UTF-8 is the byte sequence 0xEF 0xBB 0xBF, which is very unlikely to be meaningful in any legacy encoding.

A reader for data that identifies its encoding in this way reads the first three bytes to determine the encoding. If the bytes match the BOM, the three bytes are stripped off and the remaining content returned as UTF-8. If they don't, the entire content is converted from the legacy encoding to UTF-8. This stripping, however, isn't automatic and it interferes with some platforms or languages. For example, a PHP file that starts with a BOM won't be interpreted properly by the PHP processor. So this trick is best confined to well-known parts of your site or code.

Converting plain text files

Plain text files that use a single character encoding are easy to convert. For example, the iconv tool is available on most Unix/Linux systems. On systems that don't have it, a convenient approach is to install a Java Development Kit and use its native2ascii tool:

native2ascii -encoding _''sourceencoding'' ''sourcefile'' | native2ascii -reverse -encoding ''targetencoding'' > ''targetfile''

For small numbers of files, editors can also be used: TextPad on Windows, TextEdit on Mac, or jEdit on any platform are just a few editors that can convert files. Note that some editors, such as Notepad, like to prefix Unicode files with a Unicode byte-order mark (BOM), which in the case of UTF-8 files is unnecessary and may cause problems with software reading the files.

Converting structured files

Structured files in this context means any files, other than SQL databases, that have components that might have different encodings or that have length limitations. Examples are: log files, where different entries may use different encodings; email messages, where different headers and MIME body components may use different encodings, and the headers have length limitations; and cookies, which are often treated as having multiple fields. For such files, every component has to be converted separately, and length limitations have to be dealt with separately for each component.

Converting SQL databases

An SQL database really consists of two components: A server component, which actually manages the data, and a client component, which interfaces with other software (such as a PHP or Java runtime) and communicates with the server components. The character encoding that the client uses to communicate with the server can be set separately from the character encodings used by the server; the server will convert if necessary.

Depending on the size of a database and its uptime requirements, various strategies for conversion are possible:

The SQL language and documentation have the unfortunate habit of using the term "character set" for character encodings, ignoring the fact that UTF-8 and UTF-16 (and even GB18030) are different encodings of the same character set.

Oracle specifics

Oracle has general Unicode support starting with version 8, but support for supplementary characters is only available starting with version 9r2, and support for Unicode 3.2 only starting with version 10. Also, the use of the NCHAR and NVARCHAR data types before version 9 is somewhat difficult. Oracle provides comprehensive Globalization Support Guides for versions 9r1, 9r2, 10r1, and 10r2. The chapters on Character Set Migration and Character Set Scanner are particularly relevant.

The character encoding selected for an Oracle database is set for the entire database, including data, schema, and queries, with one exception: The NCHAR and NVARCHAR types always use Unicode. Different Unicode encodings are offered for the database as a whole and the NCHAR and NVARCHAR data types. For the database, there are correct UTF-8 under the name AL32UTF8, and a variant UTF8 that encodes supplementary characters as two 3-byte sequences. For databases migrating to Unicode, you should use AL32UTF8 (databases that already use UTF8 can in most cases continue to do so - the difference between these encodings may affect collation and indexing within the database, but in general it doesn't matter much since the client interface converts UTF8 to correct UTF-8). For the NCHAR and NVARCHAR data types, UTF-16 is available under the name AL32UTF8, along with the variant UTF8 encoding. The semantics of length specifications for the CHAR, VARCHAR2, and LONG data types can be set using the NLS_LENGTH_SEMANTICS, with byte semantics as the default, while the NCHAR and NVARCHAR data types always use character semantics.

For correct conversion between the encoding(s) used within the database to the client's encoding it is essential to define the NLS_LANG environment variable on the client side. This variable describes the language, territory, and encoding used by the client OS. Oracle has numerous other settings to specify locale-sensitive behavior in the database; these can generally be set separately from the encoding as long as the encoding can represent the characters of the selected locale. Unicode supports all locales.

Oracle provides built-in support for several conversion strategies. The Character Set Scanner tool helps in identifying possible conversion and truncation problems in the pre-conversion analysis. The Export and Import utilities help in implementing a dump and reload strategy. Adding Unicode columns is easy because the NCHAR and NVARCHAR data types support Unicode independent of the database encoding. Converting in place with an encoding tag is possible if the database itself doesn't interpret the text - the ALTER DATABASE CHARSET statement can be used to inform the database of the actual encoding once conversion has completed.

There are reports that the NCHAR data types are not supported in the PHP Oracle Call Interface.

MySQL specifics

To get Unicode support in MySQL databases, you'll need to use MySQL 4.1 or higher. For information on upgrading to this version and on possible compatibility issues, see Upgrading Character Sets from MySQL 4.0. For detailed information on character encoding support in MySQL, see the Character Set Support chapter of the MySQL documentation. The character encoding for the database content can be set separately at the server, database, table, or column level. Where the encoding isn't set explicitly, it's inherited from the next higher level.

MySQL's default encoding is latin1, that is, ISO-8859-1. The supported Unicode encodings are called utf8 and ucs2. Usually the recommended character encoding for MySQL should be utf8. Both utf8 and ucs2 are limited to the characters of the Unicode Basic Multilingual Plane (BMP), so there's no support for supplementary characters in MySQL. As a result, utf8 isn't a wholly compliant implementation of UTF-8 (although for most purposes it is fine). The NCHAR and NVARCHAR data types always use utf8.

Length specifications for character data types are interpreted to be in Unicode BMP characters, so a specification CHAR(5) CHARACTER SET utf8 will reserve 15 bytes. Metadata, such as user names, is always stored in UTF-8, so non-Latin names can be used. The character encoding for the client connection can be set separately for client, connection, and results, but to avoid confusion, it's best to set them all together using SET NAMES 'utf8'. The ucs2 encoding is not supported for the client connection, so there's no good reason to use this encoding for the database content either.

Collations are related to character encodings, so they should always be set at the same time as the encoding. If utf8 is used without specifying a collation, the default collation utf8_general_ci is used. This is a legacy collation algorithm that's not good for any particular language. The collation utf8_unicode_ci is a better default, since it implements the Unicode Collation Algorithm (UCA) and works for many languages that are not specifically supported by a named collation. You can also select one of the language-named UTF-8 collations to get language-specific collation "tailoring" based on UCA. See the list of collations for Unicode Character Sets. MySQL supports the CONVERT function, which allows the results of a query to be converted from one encoding to another. MySQL also supports in-place conversion from one encoding to another using the ALTER statement: ALTER TABLE table CONVERT TO CHARACTER SET utf8 COLLATE collation;.

In some cases, the encoding of a column may be incorrectly declared in the schema - for example, UTF-8 data may have been stored in a MySQL database under the encoding name latin1 before MySQL really supported UTF-8, or Japanese data may have been labeled sjis when it was actually using the Windows version of Shift-JIS, which MySQL calls cp932 (see The cp932 Character Set for more information on this case). In such cases, a column can be relabeled without conversion by changing its type to the binary equivalent (BINARY, VARBINARY, BLOB), then back to characters (CHAR, VARCHAR, TEXT) with the correct encoding name, e.g., for a TEXT column: ALTER TABLE table CHANGE column column BLOB; ALTER TABLE table CHANGE column column TEXT CHARACTER SET utf8 COLLATION collation;. You can and should change all columns of one table together in order to minimize the overhead of rebuilding the table.

Note: The PHP client for MySQL by default specifies latin1 as the connection encoding for each new connection, so it is necessary to insert a statement SET NAMES 'utf8' for each new connection.

Converting file names

Several server operating systems (for example, !FreeBSD, Red Hat) store file names as simple byte sequences whose interpretation is up to higher level processes. Server processes may interpret the byte sequences according to the character encoding of the locale they run in, or just pass them on to client processes. The actual encoding must therefore be determined by evaluating how the name was created, which might be through a web page in the default encoding for the particular site or user. If that's not conclusive, character encoding detection may also be used.

If the encoding of a file name can be determined with certainty, it can be converted to UTF-8, and a Byte Order Mark can be used to mark it as converted. If the encoding is uncertain, it may be necessary to create a database parallel to the file system to record the detected encoding and possibly the UTF-8 version, so that the original file name can be kept around for later correction of the encoding.

Testing with Unicode

Testing Unicode support with ASCII text is useless. Make sure that you test handling of user data with text in a variety of the languages you will support:

Testing with these languages will require your computer to be configured to support them. Learn To Type Japanese and Other Languages has information on how to do this for all common operating systems.

To show that text in the user interface of your application is handled correctly, pseudo-localization is a useful testing strategy. Pseudo-translation tools can automatically replace ASCII characters in user interface messages with equivalent full width Latin characters from the Unicode range U+FF01 to U+FF5E (English -> English) or with variant letters with diacritic marks from the complete Latin range (English -> Ëñgłíšh).

Issues that need particular attention when testing Unicode support include:

Some guidelines for testing Unicode support and internationalization are available at: International Testing Basics

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Norbert Lindenberg, Yahoo!; Editor: Addison Phillips, Yahoo!.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2008-04-11 14:23. Last substantive update 2008-04-10 10:19 GMT. This version 2011-05-03 19:47 GMT

For the history of document changes, search for article-unicode-migration in the i18n blog.