Re: Encoding document approach [I18N-ACTION-117] from Norbert Lindenberg on 2012-04-24 (public-i18n-core@w3.org from April to June 2012)

From: Norbert Lindenberg <w3@norbertlindenberg.com>
Date: Tue, 24 Apr 2012 12:27:22 -0700
To: Addison Phillips <addison@lab126.com>
Cc: Norbert Lindenberg <w3@norbertlindenberg.com>, public-i18n-core@w3.org, Anne van Kesteren <annevk@opera.com>
Message-Id: <1D3857F3-6443-49AC-A762-A976C802D5DD@norbertlindenberg.com>
I'm somewhat concerned about positioning the Encoding document [1] as a standard. I think it would be very helpful to describe the issues around encodings in web content and provide recommendations for handling the more commonly used encodings and use cases. The existing document has a lot of useful information in that direction. However, I don't think it's feasible to create a standard that completely prescribes the handling of all legacy encodings on the web - the swamp is just way too big.

Here are the main issues I see:

1) The document seems to be based solely on observing the behavior of browsers. There are other user agents that access web content, such as search engines or (HTML) email processors. These operate under different constraints than browsers, including the lack of a user who could override incorrect encoding labels by selecting a different encoding. They're also more difficult to experiment with.

2) The document assumes a strict mapping from labels to encodings, and doesn't say where labels come from. This may cause readers to assume that labels are directly taken from the documents or transmission protocols. In reality, many documents on the web, and even more so in emails, are mislabeled, and so some user agents use encoding detection algorithms that interpret labels as just one of several hints. (As noted above, browsers let the user override the encoding).

3) The document assumes that encodings are labeled with encoding names. In reality, some web sites rely on font encodings, sometimes with site-specific fonts, and so technologies such as the Padma extension [2] interpret font names as encoding identifiers.

4) I doubt that the owners of user agents would accept the requirement "User agents must not support any other encodings or labels", which would make it impossible for them to interpret content that happens to be encoded in a different form.

5) Similarly, I doubt that all owners of content will suddenly comply with the requirement "New content and formats must exclusively use the utf-8 encoding", and so user agents will not be able to rely on it. This should probably be aligned with HTML5 section 4.2.5.5, "Specifying the document's character encoding".

6) The document generally uses the Windows extension for encodings that have been extended. For some encodings, especially Japanese and Chinese encodings, there are multiple incompatible extensions, so assuming the Windows extension may cause mojibake. Also, where a web application labels its pages with the name of a standard encoding (such as iso-8859-1), it may not be prepared to handle characters from the corresponding Windows encoding (here windows-1252).

Regards,
Norbert

[1] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
[2] http://padma.mozdev.org/


On Apr 22, 2012, at 13:40 , Phillips, Addison wrote:

> All,
> 
> Following is a draft addressing named action item. Any comments?
> 
> Addison
> 
> ====
> 
> All,
> 
> In our last teleconference I was tasked [1] with sending a note discussing the approach that the encoding document [2] it taking and the Internationalization Core WG's "temperature" on this work.
> 
> The working group is generally supportive of the idea of documenting the handling of legacy character encodings and appreciates the work completed so far. Indeed, the WG would be happy to host publication and advancement of the document, provided our charter is amended to allow us to publish on the REC track.
> 
> Some concerns have been expressed about the approach the document takes. There are two main ones that we think should be addressed in the short term:
> 
> 1. The document describes various character encoding schemes without placing, we feel, the correct emphasis on migrating from legacy encodings to Unicode. More attention should be paid to this and to leveraging CharMod [3].
> 
> The above is fairly minor. A more impactful concern is:
> 
> 2. The document proceeds from observations of how character encodings *appear* to be handled in various browsers/user-agents. Implementers may find this documentation useful, but several important user-agents are thought to be implemented in ways that are divergent from this document. We think that more direct information about character encoding conversion from implementers should be sought to form the description of various encoders/decoders.
> 
> We hope to see advancement of this document in the near future.
> 
> Yours/etc.
> 
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N WG)
> 
> Internationalization is not a feature.
> It is an architecture.
>
Received on Tuesday, 24 April 2012 19:27:59 UTC