i18n/l10n: HTML - base character set

Internationalization / Localization

This page is no longer maintained and may be inaccurate. For more up-to-date information, see the Internationalization Activity home page.

Unicode / ISO/IEC 10646

Unicode and ISO/IEC 10646 in parallel define the Universal Character Set (UCS). The UCS is a Coded Character Set that assigns unique numbers to (currently) about 50,000 of the worlds characters. Its repertoire of characters is a superset of all widely used standard character repertoires, including ASCII, ISO-8859-1 (Latin-1), ISO-2022-JP, etc. Unicode is used by all W3C specifications since late 1996.

The IETF recomends in RFC 2277 that all (new) Internet protocols & formats that deal with text use the UCS, and in particular its UTF-8 encoding (in full: Character Encoding Scheme).

Unicode and ISO/IEC 10646 are codepoint by codepoint identical and developed in close synchronization. The difference between ISO/IEC 10646 and Unicode is that Unicode adds some rules about how the characters are to be used.

The Unicode Standard is defined by the Unicode Consortium. The Unicode Standard is available as a book: The Unicode Standard, Version 3.0, 2000, ISBN 0-201-61633-5. There are Unicode conferences every six months, and W3C is a regular sponsor. ISO/IEC 10646 is developed by ISO/IEC JTC1/SC2/WG2. ISO/IEC 10646-1:2000 is the newest version.

Martin Dürst, i18n coordinator
Webmaster
Last updated $Date: 2008/05/07 17:53:19 $