HTML Document Representation

In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.

The section on the document character set addresses the issue of what abstract characters may be part of an HTML document. Characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.

The section on character encodings addresses the issue of how those characters may be represented in a file or when transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms, called character references, for referring to any character.

Since there are a great number of characters throughout human languages, and a great variety of ways to represent those characters, proper care must be taken so that documents may be understood by user agents around the world.

5.1 The Document Character Set

To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:

An SGML document (including HTML) is a sequence of characters from the repertoire. Computer systems identify each character by its code position; for example, in the ASCII character set, code positions 65, 66, and 67 refer to the characters 'A', 'B', and 'C', respectively.

The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.

This set is character-by-character equivalent to Unicode 2.0 ([UNICODE]). Both of these standards are updated from time to time with new characters and the amendments should be consulted at the respective Web sites. In the current specification, references to ISO/IEC-10646 or Unicode imply the same document character set. However, the HTML specification also refers to the Unicode specification for other issues such as the bidirectional text algorithm.

The document character set, however, does not suffice to allow user agents to correctly interpet HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.

5.2 Character encodings

What this specification calls a "character encoding" is known by different names in other specifications (which may cause some confusion). However, the concept is largely the same across the Internet. Also, protocol headers, attributes, and parameters refering to character encodings share the same name -- "charset" -- and use the same values from the [IANA] registry (see [CHARSETS] for a complete list).

The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms.

A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as [ISO10646]. There are several different encodings of parts of [ISO10646] in addition to encodings of the entire character set (such as UCS-4).

5.2.1 Choosing an encoding

The choice of character encoding mostly depends on the tools (i.e., text editors) available when authoring the document, and on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. Occasional characters that fall outside this encoding may still be represented by numeric character references. These always refer to the document character set, not the character encoding.

While authoring tools may encode HTML documents in the character encoding of their choice, servers and proxies may change this character encoding (called transcoding) on the fly to meet the requests of user agents (see section 14.2 of [RFC2068], the "Accept-Charset" HTTP request header). Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set.

Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO 10646 using a different number of bytes for different characters). Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.

This specification does not mandate which character encodings a user agent must support.

Notes on specific encodings

When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1.

Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed becomes hexadecimal FFFE, a character guaranteed to be never assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text.

The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used. For information about ISO 8859-8 and the bidirectional algorithm, please consult the section on bidirectionality and character encoding.

5.2.2 Specifying the character encoding

How does a user agent know which character encoding has been used? The most straightforward way for a server to inform the user agent about the character encoding of the document is to use the "charset" parameter of the "Content-Type" header field of the HTTP protocol ([RFC2068]) For example, the following HTTP header announces that the character encoding is EUC-JP:

How does a server determine which character encoding applies for a document it serves? Some servers examine the first few bytes of the document, or check against a database of known files and encodings. Many modern servers give Web masters more control over charset configuration that old servers do. Web masters should use these mechanisms to send out a "charset" parameter whenever possible, but should take care not to identify a document with the wrong "charset" parameter value.

The HTTP protocol ([RFC2068]) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.

To address server or configuration limitations, HTML document headers may include explicit information about the document's character encoding; the META element can be used to provide user agents with this information.

For example, to specify that the character encoding of the current document is "EUC-JP", a document should include the following META declaration:

The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves at least until the META element is parsed. META declarations should appear as early as possible in the HEAD element.

For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. With these, a document provider can greatly improve the chances that, when the user retrieves a resource, the user agent will handle it correctly.

To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese text. Also, user agents typically have a user-definable local default character encoding which they apply in the absence of other indicators.

User agents may provide a mechanism that allows users to override incorrect "charset" information. However, if a user agent offers such a mechanism, it should only offer it for browsing and not for editing, to avoid the creation of Web pages marked with an incorrect "charset" parameter.

Conforming user agents must correctly map to Unicode all characters in any character encodings that they recognize (or they must behave as if they did).

This specification does not mandate which character encodings a user agent must accept and understand.

Note. If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.

5.3 Character references

A given character encoding may not be able to express all characters of the document character set. For such encodings, or when hardware or software configurations do not allow users to input document characters directly, SGML entity references may be used. Entity references are a character encoding-independent mechanism for entering any character from the document character set.

Numeric character references specify the integer reference of a character in the document character set. A numeric character reference with the syntax "&#D;", where D is a decimal number, refers to the Unicode decimal character number D. A numeric character reference with the syntax "&#xH;" or "&#XH;", where H is an hexadecimal number, refers to the Unicode hexadecimal character number H. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention is particularly useful since character standards generally use hexadecimal representations. Hexadecimal numbers in numeric character references are case-insensitive.

å (in decimal) represents the letter "a" with a small circle above it (used, for example, in Norwegian).
å (in hexadecimal) represents the same character.
&#Xe5; (in hexadecimal) represents the same character as well.
И (in decimal) represents the Cyrillic capital letter "I".
水 (in hexadecimal) represents the Chinese character for water.

In order to give authors a more intuitive way to refer to characters in the document character set, HTML offers a set of character entity references. Character entity references replace integer references with symbolic names. The character entity reference å refers to the same Unicode character as å. There is no character entity reference for the Cyrillic capital letter "I".

Character entity references are case-sensitive. Thus, Å refers to a different character (upper case A, ring) than å (lower case a, ring).

Four character entity references deserve special mention since they are frequently used to escape special characters:

For text appearing as part of the content of an element, authors should escape "<" (ASCII decimal 60) as < to avoid possible confusion with the beginning of a tag (start tag open delimiter). The ampersand character "&" (ASCII decimal 38) should be escaped as & to avoid confusion with the beginning of an entity reference (entity reference open delimiter).

Authors should also escape ampersand within attribute values since entity references are allowed within CDATA attribute values. In addition, authors should escape ">" (ASCII decimal 62) as > to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when coming across this character in quoted attribute values.

Some authors use the character entity reference " to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

Character entities within comments have no special meaning; they are comment data only.

Note. HTML provides other ways to present character data, in particular inline images.

Note: In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or directly before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

5.4 Undisplayable characters

A user agent may not be able to render all characters in a document meaningfully, for instance, because the user agent lacks a suitable font, a character has a value that may not be expressed in the user agent's internal character encoding, etc.

Because there are many different things that may be done in such cases, this document does not prescribe any specific behavior. Depending on the implementation, undisplayable characters may also be handled by the underlying display system and not the application itself. In the absence of more sophisticated behavior, for example tailored to the needs of a particular script or language, we recommend the following behavior for user agents:

5 HTML Document Representation