The HTML 4.0 document character set, in the SGML sense, is the Universal Character Set (UCS) of [ISO10646]. This is code-by-code identical with the [UNICODE] standard.
HTML documents can be transmitted in a variety of encodings as described in the section "HTML Document Character Set" near the beginning of this specification. Characters outside the range of the encoding need to be represented as entity references. This is unnecessary with a more direct encoding of Unicode such as UTF-8 or UTF-16. After compression the resultant file sizes are close to that for character encodings such as ISO-8859-1 and EUC-JP.
When HTML text is transmitted directly in UTF-16 (charset="UTF-16"), text data should be transmitted in big-endian byte order (high order byte first) in accordance with ISO 10646 Section 6.3 and Unicode 2.0, clause C3, page 3-1 (see [UNICODE]).
Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF) which, when byte-reversed becomes number FFFE, a character guaranteed to be never assigned. Thus, a user-agent receiving an FFFE as the first octets of a text would know that bytes have to be reversed for the remainder of the text.
The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used.
Note that ISO Registration Number 177 strictly speaking refers to the original state of ISO 10646 in 1993, while in this specification, we always refer to the most up-to-date form of ISO 10646. Changes since 1993 have been the addition of characters and a one-time operation reallocating a large number of codepoints for Korean Hangul (Amendment 5).
<!SGML "ISO 8879:1986"
--
SGML Declaration for HyperText Markup Language version 4.0
With support for Unicode UCS-4 and increased limits
for tag and literal lengths etc.
--
CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 2147483486 160
--
In ISO 10646, the positions with hexadecimal
values 0000D800 - 0000DFFF, used in the UTF-16
encoding of UCS-4, are reserved, as well as the last
two code values in each plane of UCS-4, i.e. all
values of the hexadecimal form xxxxFFFE or xxxxFFFF.
These code values or the corresponding numeric
character references must not be included when
generating a new HTML document, and they should be
ignored if encountered when processing a HTML
document.
--
CAPACITY SGMLREF
TOTALCAP 150000
GRPCAP 150000
ENTCAP 150000
SCOPE DOCUMENT
SYNTAX
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
BASESET "ISO 646IRV:1991//CHARSET
International Reference Version
(IRV)//ESC 2/8 4/2"
DESCSET 0 128 0
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-" -- ?include "~/_" for URLs? --
UCNMCHAR ".-"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTSPLEN 65536 -- These are the largest values --
LITLEN 65536 -- permitted in the declaration --
NAMELEN 65536 -- Avoid fixed limits in actual --
PILEN 65536 -- implementations of HTML UA's --
TAGLVL 100
TAGLEN 65536
GRPGTCNT 150
GRPCNT 64
FEATURES
MINIMIZE
DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL YES
>