21 SGML Declaration

Contents

  1. The Document Character Set
    1. Data transfer
  2. The SGML Declaration

21.1 The Document Character Set

The HTML 4.0 document character set, in the SGML sense, is the Universal Character Set (UCS) of [ISO10646]. This is code-by-code identical with the [UNICODE] standard.

21.1.1 Data transfer

HTML documents can be transmitted in a variety of encodings as described in the section "HTML Document Character Set" near the beginning of this specification. Characters outside the range of the encoding need to be represented as entity references. This is unnecessary with a more direct encoding of Unicode such as UTF-8 or UTF-16. After compression the resultant file sizes are close to that for character encodings such as ISO-8859-1 and EUC-JP.

When HTML text is transmitted directly in UTF-16 (charset="UTF-16"), text data should be transmitted in big-endian byte order (high order byte first) in accordance with ISO 10646 Section 6.3 and Unicode 2.0, clause C3, page 3-1 (see [UNICODE]).

Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF) which, when byte-reversed becomes number FFFE, a character guaranteed to be never assigned. Thus, a user-agent receiving an FFFE as the first octets of a text would know that bytes have to be reversed for the remainder of the text.

The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used.

Note that ISO Registration Number 177 strictly speaking refers to the original state of ISO 10646 in 1993, while in this specification, we always refer to the most up-to-date form of ISO 10646. Changes since 1993 have been the addition of characters and a one-time operation reallocating a large number of codepoints for Korean Hangul (Amendment 5).

21.2 The SGML Declaration

<!SGML  "ISO 8879:1986"
   --
        SGML Declaration for HyperText Markup Language version 4.0

        With support for Unicode UCS-4 and increased limits
        for tag and literal lengths etc.
   --

   CHARSET
         BASESET  "ISO Registration Number 177//CHARSET
                   ISO/IEC 10646-1:1993 UCS-4 with
                   implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET  0   9     UNUSED
                  9   2     9
                  11  2     UNUSED
                  13  1     13
                  14  18    UNUSED
                  32  95    32
                  127 1     UNUSED
                  128 32    UNUSED
                  160 2147483486 160
  --
        In ISO 10646, the positions with hexadecimal
        values 0000D800 - 0000DFFF, used in the UTF-16
        encoding of UCS-4, are reserved, as well as the last
        two code values in each plane of UCS-4, i.e. all
        values of the hexadecimal form xxxxFFFE or xxxxFFFF.
        These code values or the corresponding numeric
        character references must not be included when
        generating a new HTML document, and they should be
        ignored if encountered when processing a HTML
        document.
  --

CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000
                ENTCAP          150000

SCOPE    DOCUMENT
SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
           17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0

         FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-"    -- ?include "~/_" for URLs? --
                  UCNMCHAR ".-"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 150
                  GRPCNT   64

FEATURES
  MINIMIZE
    DATATAG  NO
    OMITTAG  YES
    RANK     NO
    SHORTTAG YES
  LINK
    SIMPLE   NO
    IMPLICIT NO
    EXPLICIT NO
  OTHER
    CONCUR   NO
    SUBDOC   NO
    FORMAL   YES
>