[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

On 22 May 2008, at 12:40, Ian Hickson wrote:

> would you say that what the spec says now is what browsers 
> implement? What should we change?

The current table seems to cover the mappings between different common
compatible 8-bit encodings as implemented in IE7, yes.  The table at
<http://coq.no/character-tables/mime/en> gives a bit more detail,
most of which is better kept outside HTML5 itself. However, the following
observations can be made:

1.  Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
    IE7, on the other hand, simply ignores the high bit (as it does for
    a few other 7-bit encodings, by the way).  Perhaps this
    alias could be dropped from the other browsers.

2.  Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as per HTML5),
    whereas Safari seems to do the same for text/plain; charset=ISO-8859-11
    instead [Version 3.1.2 (5525.20.1)].  Bug?

3.  For certain character sets, different browsers map to different, but visually
    similar Unicode characters.  Sometimes, one mapping is old/outdated,
    but this is not always the case.

4.  Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite inconsistently;
    different browsers do different things for the same encoding, and the same
    browser gives analogous encodings different treatment.

    (For the early ISO-8859-* encodings, the IANA registry points to RFC 1345,
    which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really
    seem to regard this feature as an essential part of the character set:

        the charset is often coded with both
        graphical and control character sets.  If the coded character set is
        a 96-character set, it is tabled with the relevant GL set (normally
        ISO-IR-6) and with ISO 6429 as C0 and C1

    As for the Windows-* encodings, Microsoft documentation treats bytes
    in this range as unassigned unless they are mapped to graphical characters,
    whereas Microsoft products return the underlying byte value in this case.)

5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former
    is probably more reasonable (assuming that letters are more important than
    line-drawing characters), but neither is actually correct given that the encodings
    are, strictly speaking, incompatible.  This issue will of course look a bit different
    if it can be shown that documents containing the letter ?/? (only in KOI8-RU)
    are frequently mislabelled as KOI8-U.

> Do you have input on the EUC-JP issue?

Not yet, but you can expect some input on CJK encodings at some point in
the future.

-- 
?istein E. Andersen

Received on Tuesday, 29 July 2008 15:55:25 UTC