some review of HTML 5 charset details w.r.t. W3C Character Model

forwarded with permission...

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
gpg D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E

Forwarded message 1

  • From: Martin Duerst <duerst@it.aoyama.ac.jp>
  • Date: Wed, 31 Oct 2007 19:10:47 +0900
  • Subject: RE: HTML 5 defaults to Windows-1252, where charmod requiresUTF-8/UTF-16
  • To: Dan Connolly <connolly@w3.org>, Richard Ishida <ishida@w3.org>, public-i18n-core@w3.org
  • Cc: "'www-archive'" <www-archive@w3.org>, "'Chris Wilson'" <Chris.Wilson@microsoft.com>
  • Message-Id: <6.0.0.20.2.20071031181638.099346e0@localhost>
Hello Dan,

Please forward/cc as appropriate.

At 02:51 07/10/30, Dan Connolly wrote:
>
>
>On Mon, 2007-10-29 at 17:42 +0000, Richard Ishida wrote:
>> Hi Dan,
>> 
>> Please send questions like this to public-i18n-core list, so that the i18n
>> WG can reply.
>
>OK. done.
>
>> It's not clear to me from a quick look that there's a conflict.  CharMod
>> says that you must define one or both of UTF-8 and UTF-16 as *a default*,
>> and HTML5 is defining minimum set of encodings that must be supported,
>> rather than a default (as I read it).  CharMod doesn't proscribe recogition
>> of other encodings.

I agree with this. I think the subject of this mail is misleading,
as far as I have read this mail and the HTML5 spec, I don't see
that it defaults to Windows-1252 (i.e. for example requires that
pages not labeled at all are interpreted as Windows-1252).

>> I think the appropriate charmod criterion for the html5 text in section
>> 8.2.2.2 is http://www.w3.org/TR/charmod/#C026 "If the unique encoding
>> approach is not chosen, specifications MUST designate at least one of the
>> UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings
>> and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms
>> (encoding forms that MUST be supported by implementations of the
>> specification)." - which I think section 8.2.2.2 of html5 supports.
>> 
>> >From my reading, the 'defaults to win1252' bit comes only if the user
>> specifies that a page is in ISO latin1 - ie. Assume that people don't know
>> the difference between those two.

Yes, but this is where there is a serious problem with CharMod conformance.
See more below.

>> It's not a general default.  I don't see
>> where html5 specifies what to default to if the encoding is completely
>> unknown.
>
>I suppose it's in 8.2.2.1. Determining the character encoding:
>
>"Otherwise, return an implementation-defined or user-specified default
>character encoding, with the confidence tentative. Due to its use in
>legacy content, windows-1252 is recommended as a default in
>predominantly Western demographics. In non-legacy environments, the more
>comprehensive UTF-8 encoding is recommended instead. Since these
>encodings can in many cases be distinguished by inspection, a user agent
>may heuristically decide which to use as a default."

I think that's borderline, but *much* better than the completely
useless "iso-8859-1" default in the current HTTP spec (hopefully
to be fixed in the current update). It correctly states that
windows-1252 is dependent on the region, and wouldn't be appropriate
for Asia or Eastern Europe, for example.

Some specific comments on the text while we are at it:

In general, the priority list in 8.2.2.1 is correct.

In point 3., it's not completely clear whether the encoding
returned is e.g. "UTF-16BE BOM" or "UTF-16BE". Probably the
best thing editorially is to move the word BOM from the description
column of the table to the text prior to the table.

In point 7, what I find unnecessary is the repeated mention of heuristic
algorithms, which are already mentioned previously in point 6.
(I'm really interested what document [UNIVCHADET] is going to point to.)

What I find missing/unclear is that the user can overwrite the
page encoding manually. What is mentioned is a user-specificed
default, which makes sense (e.g. "well, I'm mostly viewing Chinese
pages, so I set my default to GB2132"). However, what we also need
is the possibility for a user to override the encoding of a specific
page (not changing the default). This is necessary because some
pages are still mislabeled. When such an override is present,
it should come before what's currently number 1.

In 8.2.2.2, what I find unnecessary is that encodings such as UTF-7 are
explicitly forbidden. I agree that these are virtually useless. However,
I don't think implementing them would create any harm, and I don't think
they should be dignified by even mentioning them. The best thing is to just
forget them. Also, there are other charsets registered at IANA that
I'd never implement, and tell everybody not to implement, but putting
them in a spec seems counterproductive. If you really need to give
detailled implementation advice, a list of well-used charsets would
be much more productive.

In 8.2.2.4, I have no idea what's the reason or purpose of point 1,
which reads "If the new encoding is UTF-16, change it to UTF-8.".
I suspect some misunderstanding.


>>   According to charmod, this is when you should choose utf-8 or
>> utf-16.  (There may be something about that later in html5.)
>> 
>> Does that make sense?
>
>I suppose so; I'm happy with any conclusion that says I don't
>need to do more work. ;-)

Well, now let's get back to CharMod, and to the place where I think
you need to do more work. HTML5 currently says "treat data labeled
iso-8859-1 as windows-1252". This conflicts with C025 of CharMod
(http://www.w3.org/TR/charmod/#C025):

C025  [I]  [C]  An IANA-registered charset name MUST NOT be used to label text data in a character encoding other than the one identified in the IANA registration of that name.

and also C030 (http://www.w3.org/TR/charmod/#C030):
C030  [I]  When an IANA-registered charset name is recognized, receiving software MUST interpret the received data according to the encoding associated with the name in the IANA registry.

So the following sentence:
"When a user agent would otherwise use the ISO-8859-1 encoding, it must instead use the Windows-1252 encoding."
from HTML5 is clearly not conforming to CharMod. Please note that the
above items (C025 and C030) say that they only affect implementations
([I]) and content ([C]), but I think the main reason for this is that
we never even immagined that a spec would say "you must treat FOO as BAR".

I don't disagree with 'widely deployed', but I think one main reason
for this is that it took ages to get windows-1252 registered.
I think there are other ways to deal with this issue than a MUST.
One thing that I guess you could do is to just describe current
practice.


This brings me to another point: The whole HTML5 spec seems to be written
with implementers, and implementers only, in mind. This is great to help
get browser behavior aligned, but it creates an enormous problem: The
majority of potential users of the spec, namely creators of content, and
of tools creating content, are completely left out. As an example,
trying to reverse-engineer how to indicate the character encoding
inside an HTML5 document from point 4 in 8.2.2.1 is completely impossible
for content creators, webmasters, and the like.


Regards,   Martin.

>> Cheers,
>> RI
>> 
>> ============
>> Richard Ishida
>> Internationalization Lead
>> W3C (World Wide Web Consortium)
>>  
>> http://www.w3.org/International/
>> http://rishida.net/blog/
>> http://rishida.net/
>> 
>>  
>> 
>> 
>> > -----Original Message-----
>> > From: Dan Connolly [mailto:connolly@w3.org] 
>> > Sent: 29 October 2007 17:22
>> > To: Richard Ishida
>> > Cc: www-archive; Chris Wilson
>> > Subject: HTML 5 defaults to Windows-1252, where charmod 
>> > requiresUTF-8/UTF-16
>> > 
>> > Richard,
>> > 
>> > These conflict:
>> > 
>> > "C027   [S]  Specifications that require a default encoding 
>> > MUST define
>> > either UTF-8 or UTF-16 as the default, or both if they define 
>> > suitable means of distinguishing them."
>> >  -- http://www.w3.org/TR/charmod/#C027
>> > 
>> > "User agents must at a minimum support the UTF-8 and 
>> > Windows-1252 encodings, but may support more." -- 8.2.2.2. 
>> > Character encoding requirements http://www.w3.org/html/wg/html5/ 
>> > 
>> > I don't think that aspect of the HTML 5 spec is going to 
>> > change; it's already ubiquitously deployed:
>> > 
>> >  "Many web browsers treat the MIME charset ISO-8859-1 as 
>> > Windows-1252 "
>> > -- http://en.wikipedia.org/wiki/Windows-1252 
>> > 
>> > Any suggestions on what to do about the conflict? It's not 
>> > clear to me why C027 is a MUST. Which WG(s) should we be talking to?
>> > 
>> > p.s. note the cc to www-archive; i.e. feel free to 
>> > copy/cite/forward anywhere.
>
>-- 
>Dan Connolly, W3C http://www.w3.org/People/Connolly/
>gpg D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Wednesday, 31 October 2007 22:25:16 UTC