Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Document character set

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, XSLT developers, Web project managers, and anyone who wants to understand what the Document Character Set is, and how that relates to encodings used in a document or page.

Question

What is the 'Document Character Set' for XML and HTML, and how does it relate to the encodings I use for my documents?

Answer

Note: For simplicity, and in line with common practice, when we refer to Unicode in this FAQ we are referring to the character set defined by both Unicode and ISO/IEC 10646.

The document character set or base character set of XML and HTML (from version 4.0) is the Universal Character Set (UCS) defined by both ISO/IEC 10646 and Unicode standards, which are code for code identical.

This means that the logical model describing how XML and HTML are processed is described in terms of the Unicode character set.

It does not mean that all HTML and XML documents have to be encoded as Unicode, but it does mean that these documents can only contain characters defined by Unicode. Note that character sets and character encodings are different things - for example, the full Unicode repertoire can be encoded in more than one way, eg. UTF-8, UTF-16 and UTF-32. Any character encoding can be used for your document as long as it is properly declared and the characters it represents are a subset of the Unicode repertoire. (It would be extremely unusual to find one that wasn't a subset.)

On the other hand, it is a good idea to use a Unicode encoding wherever possible, since it simplifies many aspects of Web internationalization and is supported widely by HTML user agents, and by all XML processors.

An important consequence of the document character set is that values of numeric character references (such as ǵ and ǵ for LATIN SMALL LETTER G WITH ACUTE) are interpreted as Unicode characters - no matter what encoding you use for your document. This is a common source of error among those who are not clear about the distinction.

By the way

In practice, not all Unicode characters can be used everywhere in XML and HTML. For example, certain characters are excluded from things like element tag names, and certain control characters are excluded from content. Note, however, that XML 1.1 allows the use of many more characters for such things as element tag names than XML 1.0.

Historical information

HTML 2.0 defined that all characters in an HTML document are to be interpreted relative to ISO 8859-1 (also known as ISO Latin 1), but also announced that all future versions of HTML will use a superset of that, viz. Unicode (or ISO 10646), which means that a vast number of the world's characters are available.

The discussions about the right way to use Unicode on the Internet ( RFC 2130, April 1997, and RFC 2070, Jan 1997) were not finished yet when HTML 3.2 came out (Jan 1997), so inclusion of Unicode into HTML had to wait for HTML 4.0 (Dec 1997).

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Martin Dürst & Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2003-10-09. Last substantive update 2004-06-28 09:02 GMT. This version 2008-06-09 17:07 GMT

For the history of document changes, search for qa-doc-charset in the i18n blog.