i18n: HTML - base character set

Internationalization

This page is no longer maintained. For more up-to-date information, see the Internationalization Activity home page.

Base character set

The base character set or document character set of HTML 4.0 and XML is ISO/IEC 10646 (aka. Unicode, the Universal Character Set). This does not mean that all HTML and XML documents have to be encoded in Unicode (and there would still be various encodings to choose from, such as e.g. UTF-8 and UTF-16). But it means that the logical model describing how HTML and XML are processed is described in terms of the UCS. The most important consequence is that numeric character references (&#dddd; and &#xhhhh;) are interpreted as Unicode.

HTML 2.0 defined that all characters in an HTML document are to be interpreted relative to ISO 8859-1 (aka. ISO Latin 1), but also announced that all future versions of HTML will use a superset of that, viz. Unicode (or ISO 10646), which means that a vast number of the world's characters are available.

The discussions about the right way to use Unicode on the Internet ( RFC 2130 , April 1997, and RFC 2070 , Jan 1997) were not finished yet when HTML 3.2 came out (Jan 1997), so inclusion of Unicode into HTML had to wait for HTML 4.0 .

Martin Dürst , i18n coordinator
Webmaster
Last updated $Date: 2008/05/07 13:05:41 $