Document character set

Question

What is the 'Document Character Set' for XML and HTML, and how does it relate to the encodings I use for my documents?

Answer

For simplicity, and in line with common practice, when we refer to Unicode in this FAQ we are referring to the character set defined by both Unicode and ISO/IEC 10646.

The document character set or base character set of XML and HTML (from version 4.0) is the Universal Character Set (UCS) defined by both ISO/IEC 10646 and Unicode standards, which are code for code identical.

This means that the logical model describing how XML and HTML are processed is described in terms of the Unicode character set.

It does not mean that all HTML and XML documents have to be encoded as Unicode, but it does mean that these documents can only contain characters defined by Unicode. Note that character sets and character encodings are different things - for example, the full Unicode repertoire can be encoded in more than one way, eg. UTF-8, UTF-16 and UTF-32. Any character encoding can be used for your document as long as it is properly declared and the characters it represents are a subset of the Unicode repertoire. (It would be extremely unusual to find one that wasn't a subset.)

On the other hand, it is a good idea to use a Unicode encoding wherever possible, since it simplifies many aspects of Web internationalization and is supported widely by HTML user agents, and by all XML processors.

An important consequence of the document character set is that values of numeric character references (such as ǵ or ǵ for LATIN SMALL LETTER G WITH ACUTE) are interpreted as Unicode characters – no matter what encoding you use for your document. This is a common source of error among those who are not clear about the distinction.

By the way

In practice, not all Unicode characters can be used everywhere in XML and HTML. For example, certain characters are excluded from things like element tag names, and certain control characters are excluded from content. Note, however, that XML 1.1 allows the use of many more characters for such things as element tag names than XML 1.0.

Historical information

HTML 2.0 defined that all characters in an HTML document are to be interpreted relative to ISO 8859-1 (also known as ISO Latin 1), but also announced that all future versions of HTML will use a superset of that, viz. Unicode (or ISO 10646), which means that a vast number of the world's characters are available.

The discussions about the right way to use Unicode on the Internet ( RFC 2130, April 1997, and RFC 2070, Jan 1997) were not finished yet when HTML 3.2 came out (Jan 1997), so inclusion of Unicode into HTML had to wait for HTML 4.0 (Dec 1997).