6 HTML Document Character Set

Contents

  1. The Document Character Set
  2. Character entities

6.1 The Document Character Set

Human languages make use of a large number of text characters and human beings have invented a wide variety of systems for representing these characters in a computer. Unless proper precautions are taken, differing character representations may not be understood by user agents in all parts of the world.

To promote interoperability, SGML requires that each application (including HTML) define its document character set. A document character set is a set of abstract characters (such as the Cyrillic letter "I", the Chinese character meaning "water", etc.) and a corresponding set of integer references to those characters. SGML considers a document to be a sequence of references in the document character set.

The document character set for HTML is the Universal Character Set (UCS) of [ISO10646]. This set is character-by-character equivalent to Unicode 2.0 ([UNICODE]). Both of these standards are updated from time to time with new characters and the amendments should be consulted at the respective Web sites. In the current specification, references to ISO/IEC-10646 or Unicode imply the same document character set. However, the current document also refers to the Unicode specification for other issues such as the bidirectional text algorithm.

Authors are not required to write HTML documents as a series of integer references to the document character set. Instead, they may input characters in a chosen character encoding, which encodes some subset of the document character set. Authors are not required to enter a document's text in a character encoding (such as UCS-4) that covers the entire document character set --- this would be inconvenient and make the document larger. The choice of character encoding often depends on the characters than can be easily input via a keyboard or by the way files are stored on secondary storage devices. Any convenient encoding that covers most of the characters the author will employ can be used, provided it is correctly labeled (as explained below). Occasional characters that fall outside this encoding may still be entered, in the form of numeric character references which always refer to the document character set, not the encoding.

Conforming user agents must correctly map to Unicode all characters in any character encodings ("charsets") that they recognize (or they must behave as if they did). Names for characters encodings are generally case insensitive.

Note. A list of recommended character encodings for various scripts and languages will be provided in a separate document.

Character encodings such as ISO-8859-1 (commonly referred to as "Latin-1" since it encodes most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), and EUC-JP (another Japanese encoding) save bandwidth for uncompressed text by representing only slices of the document character set. When compressed, however, there is little advantage over full encodings of Unicode such as UTF-16.

How does a user agent know which character encoding has been used to encode a given document?

In many cases, before a Web server sends an HTML document over the Web, it tries to figure out the character encoding (by a variety of techniques such as examining the first few bytes of the file, checking against a database of known files and encodings, etc.). The server transmits the document and the name of the character encoding to the receiving user agent by way of the charset parameter of the HTTP "Content-Type" field. For example, the following HTTP header announces that the character encoding is "EUC-JP".

Content-Type: text/html; charset=EUC-JP

The value of the "charset" parameter must be the name of a "charset" as defined in [RFC2045].

Unfortunately, not all servers send information about the character encoding (even when the character encoding is different from the widely used ISO-8859-1 encoding). HTML therefore allows authors a way to tell user agents which character encoding has been used by specifying it explicitly in the document header with the META element. For example, to specify that the character encoding of the current document is "euc-jp", include the following META declaration:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves at least until the META element is parsed. In this case, conforming user agents must correctly interpret the META element.

To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

  1. Explicit user action to override erroneous behavior.
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  4. The "charset" attribute set for the A and LINK elements.
  5. User agent heuristics and user settings. For example, user many agents use a heuristic to distinguish between the various encodings used for Japanese text. Also, user agents typically have an user-definable local default character encoding which they apply in absence of other indicators.

Note: some HTTP/1.1 [RFC 2068, Section 3.7.1] servers may be using the absence of a charset parameter in a Content-Type field incorrectly to mean "this is iso-8859-1." Recipients wishing to defeat this behavior may assume the existence of a Content-Type field with "charset=iso-8859-1" and ignore steps 3 to 5 above.

In all cases, the value of the "charset" attribute or parameter must be the name of a "charset" as defined in [RFC2045].

If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.

Note: Modern web servers can be configured with information about which document is using which character encoding. Web masters should use these facilities but should take pains to configure the server properly.

6.2 Character entities

Your hardware and software configuration probably won't allow you to enter all Unicode characters through simple input mechanisms, so SGML offers character encoding-independent mechanisms for specifying any character from the document character set.

Numeric character references specify the integer reference of a Unicode character. A numeric character reference with the syntax &#D; refers to Unicode decimal character number D. A numeric character reference with the syntax &#xH; refers to Unicode hexadecimal character number H. The hexadecimal representation is a new SGML convention and is particularly useful since character standards generally use hexadecimal representations.

Here are some examples:

To give authors a more intuitive way to refer to characters in the document character set, HTML offers a set of named character entities. Named character references replace integer references with symbolic names. The named entity &aring; refers to the same Unicode character as &#229;. There is no named entity for the Cyrillic capital letter "I". The full list of named character entities recognized in HTML 4.0 is included in this specification.

Four named character entities deserve special mention since they are frequently used to escape special characters: For text appearing as part of the content of an element, you should escape "<" (ASCII decimal 60) as &lt; to avoid possible confusion with the beginning of a tag (start tag open delimiter). The ampersand character "&" (ASCII decimal 38) should be escaped as &amp; to avoid confusion with the beginning of an entity reference (entity reference open delimiter).

You should also escape ampersand within attribute values since entity references are allowed within CDATA attribute values. In addition, you should escape ">" (ASCII decimal 62) as &gt; to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when coming across this character in quoted attribute values.

Rather than worry about rules for quoting attribute values, its often easier to encode any instance of " by &quot; and to always use " for quoting attribute values. Many people find it simpler to always escape these four characters in element content and attribute values:

Names of named character entities are case-sensitive. Thus, &Aring; refers to a different character (upper case A, ring) than &aring; (lower case a, ring).

Note: In SGML, it is possible to eliminate the final ";" after a numeric or named character reference in some cases (e.g., at a line break or directly before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.