Fwd: Re: HTML5 review comments

[resent to get it into tracker]
i18n-ISSUE-77: HTTP and defaulting to UTF-16LE

Date: Thu, 21 Jul 2011 18:41:45 +0900
From: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
To: Richard Ishida <ishida@w3.org>
CC: public-i18n-core@w3.org <public-i18n-core@w3.org>

Hello Richard,

On 2011/07/20 23:55, Richard Ishida wrote:

> 8.2.2.2 Character encodings
> http://www.w3.org/TR/html5/parsing.html#character-encodings-0
>
> "When a user agent is to use the UTF-16 encoding but no BOM has been
> found, user agents must default to UTF-16LE."
>
> If the HTTP header declares the file to be UTF-16BE, which I believe it
> can, and in which case a BOM should *not* be used, then I think that
> this would not be true.

This strictly depends on what "the UTF-16 encoding" means in the
sentence you cite. If it means "the encoding labeled as 'UTF-16'", then
this doesn't include encodings labeled UTF-16BE, and therefore there is
no problem. If "the UTF-16 encoding" means "any encoding that works like
UTF-16, independent of the label and other details", then you are right.

My impression from reading "8.2.2.2 Character encodings" is that it's
talking about the encoding labeled "UTF-16", but it might be helpful to
check and/or clarify.

UTF-16 is a very special case (UTF-32 has similar issues, but is much
less important in practice, in particular across the network), because
it's easy to mix up UTF-16 the general encoding method used for Unicode
with code units of 16 bits and 'UTF-16' the character encoding (charset)
label. (Also, in implementations, it's sometimes important to be able to
separately set "BOM/noBOM", "LE/BE", and the actual label, which is
difficult if a converter or output routine only takes a 'charset' label
as a parameter.)

> If the HTTP header declares the file to be
> UTF-16, then there must be a BOM, so I assume that this is a recovery
> mechanism if someone does declare UTF-16 in HTTP but omits the BOM. I'd
> think that some kind of error message would be in order though.

You want an error message like "missing BOM on UTF-16 page"? That's good
for a validator, but not for a browser.


...

Regards,    Martin.

Received on Monday, 25 July 2011 07:28:35 UTC