Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, XSLT developers, Web project managers, and anyone who wants to understand what the Document Character Set is, and how that relates to encodings used in a document or page.
What is the 'Document Character Set' for XML and HTML, and how does it relate to the encodings I use for my documents?
The document character set or base character set of XML and HTML (from version 4.0) is the Universal Character Set (UCS) defined by both ISO/IEC 10646 and Unicode standards, which are code for code identical.
This means that the logical model describing how XML and HTML are processed is described in terms of the Unicode character set.
It does not mean that all HTML and XML documents have to be encoded as Unicode, but it does mean that these documents can only contain characters defined by Unicode. Note that character sets and character encodings are different things - for example, the full Unicode repertoire can be encoded in more than one way, eg. UTF-8, UTF-16 and UTF-32. Any character encoding can be used for your document as long as it is properly declared and the characters it represents are a subset of the Unicode repertoire. (It would be extremely unusual to find one that wasn't a subset.)
On the other hand, it is a good idea to use a Unicode encoding wherever possible, since it simplifies many aspects of Web internationalization and is supported widely by HTML user agents, and by all XML processors.
An important consequence of the document character set is that values of numeric character references (such as ǵ and ǵ for LATIN SMALL LETTER G WITH ACUTE) are interpreted as Unicode characters - no matter what encoding you use for your document. This is a common source of error among those who are not clear about the distinction.
In practice, not all Unicode characters can be used everywhere in XML and HTML. For example, certain characters are excluded from things like element tag names, and certain control characters are excluded from content. Note, however, that XML 1.1 allows the use of many more characters for such things as element tag names than XML 1.0.
HTML 2.0 defined that all characters in an HTML document are to be interpreted relative to ISO 8859-1 (also known as ISO Latin 1), but also announced that all future versions of HTML will use a superset of that, viz. Unicode (or ISO 10646), which means that a vast number of the world's characters are available.
The discussions about the right way to use Unicode on the Internet ( RFC 2130, April 1997, and RFC 2070, Jan 1997) were not finished yet when HTML 3.2 came out (Jan 1997), so inclusion of Unicode into HTML had to wait for HTML 4.0 (Dec 1997).
Tell us what you think (English).
Content first published 2003-10-09. Last substantive update 2004-06-28 09:02 GMT. This version 2008-06-09 17:07 GMT
For the history of document changes, search for qa-doc-charset in the i18n blog.
Copyright © 2003-2008 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.