Re: Encoding document approach [I18N-ACTION-117]

On Tue, 24 Apr 2012 21:27:22 +0200, Norbert Lindenberg  
<w3@norbertlindenberg.com> wrote:
> I'm somewhat concerned about positioning the Encoding document [1] as a  
> standard. I think it would be very helpful to describe the issues around  
> encodings in web content and provide recommendations for handling the  
> more commonly used encodings and use cases. The existing document has a  
> lot of useful information in that direction. However, I don't think it's  
> feasible to create a standard that completely prescribes the handling of  
> all legacy encodings on the web - the swamp is just way too big.

I think I might be a bit more ambitious. I think it is feasible.


> 1) The document seems to be based solely on observing the behavior of  
> browsers. There are other user agents that access web content, such as  
> search engines or (HTML) email processors. These operate under different  
> constraints than browsers, including the lack of a user who could  
> override incorrect encoding labels by selecting a different encoding.  
> They're also more difficult to experiment with.

Indeed, so a proper standard should help them even more. I think this is  
largely analogous to other problems we have solved, such as HTML parsing.  
By defining how the majority consumers of HTML actually consume HTML (and  
get them closer to each other in the process) the whole ecosystem benefits.


> 2) The document assumes a strict mapping from labels to encodings, and  
> doesn't say where labels come from. This may cause readers to assume  
> that labels are directly taken from the documents or transmission  
> protocols. In reality, many documents on the web, and even more so in  
> emails, are mislabeled, and so some user agents use encoding detection  
> algorithms that interpret labels as just one of several hints. (As noted  
> above, browsers let the user override the encoding).

This is actually not true. Only when labels are not recognized (i.e. when  
the "get an encoding" algorithm returns failure) do browsers resort to  
encoding sniffing. This is defined in detail in HTML. text/plain and HTML  
are the only places sniffing is required. Browsers do not resort to  
encoding sniffing for external scripts, style sheets, or XML.


> 3) The document assumes that encodings are labeled with encoding names.  
> In reality, some web sites rely on font encodings, sometimes with  
> site-specific fonts, and so technologies such as the Padma extension [2]  
> interpret font names as encoding identifiers.

I don't think this influences the architecture that is in place. You can  
use PUA/custom fonts and violate all kinds of standards, just like you can  
use display:none on your root element, but that does not mean encoders or  
decoders work any differently.


> 4) I doubt that the owners of user agents would accept the requirement  
> "User agents must not support any other encodings or labels", which  
> would make it impossible for them to interpret content that happens to  
> be encoded in a different form.

If there is evidence that supporting an additional encoding is beneficiary  
I am sure the other user agents would be happy to support it too. The goal  
here is to foster interoperability and most definitely not step away from  
hard problems before we are even confronted with them.


> 5) Similarly, I doubt that all owners of content will suddenly comply  
> with the requirement "New content and formats must exclusively use the  
> utf-8 encoding", and so user agents will not be able to rely on it. This  
> should probably be aligned with HTML5 section 4.2.5.5, "Specifying the  
> document's character encoding".

There is a difference in what content must do and what user agents must do  
as user agents must deal with the errors that content has. The requirement  
on content using utf-8 is because many APIs and new formats only work if  
you are using utf-8. If you use anything but utf-8 you are going to have a  
bad time. (Web Workers will not work, WebSocket will not work,  
XMLHttpRequest.send() will not work, application manifests will break, URL  
query parameters will be strangely encoded, form submission will have  
ambiguity with respect to whether &#...; was entered by the user or was an  
unrecognized code point, etc.)

A bug has been filed on HTML to have it make use of the Encoding Standard.  
If HTML cannot live with the requirements in it and they do not find the  
above argument persuasive enough the requirements will be reevaluated.


> 6) The document generally uses the Windows extension for encodings that  
> have been extended. For some encodings, especially Japanese and Chinese  
> encodings, there are multiple incompatible extensions, so assuming the  
> Windows extension may cause mojibake.

All browsers (including those on Mac) use the Windows extensions as  
Microsoft has been the dominant force in that area of the world for quite  
some time. Apart from big5 most browsers are pretty close to each other.  
They usually differ in a few code points, PUA exposure, labels that are  
supported, and error handling details.


> Also, where a web application labels its pages with the name of a  
> standard encoding (such as iso-8859-1), it may not be prepared to handle  
> characters from the corresponding Windows encoding (here windows-1252).

As I mentioned before they *rely* on handling characters from the  
corresponding Windows encoding.


> On Apr 22, 2012, at 13:40 , Phillips, Addison wrote:
>> 1. The document describes various character encoding schemes without  
>> placing, we feel, the correct emphasis on migrating from legacy  
>> encodings to Unicode. More attention should be paid to this and to  
>> leveraging CharMod [3].

I have tried putting emphasis by marking everything but utf-8 as legacy.  
Suggestions are more than welcome however.


>> 2. The document proceeds from observations of how character encodings  
>> *appear* to be handled in various browsers/user-agents. Implementers  
>> may find this documentation useful, but several important user-agents  
>> are thought to be implemented in ways that are divergent from this  
>> document. We think that more direct information about character  
>> encoding conversion from implementers should be sought to form the  
>> description of various encoders/decoders.

The work is based to large extent on reverse engineering and when data is  
available picking the best alternative. (Though I so far have avoided  
specifying anything that maps to PUA.)

Getting more feedback from implementors would be great of course. I myself  
have a pretty good channel with Opera, and engineers from Mozilla have  
been filing bugs as well as well as making changes to Gecko. I have  
attempted to reach out to Chromium's expert, but have had no luck reaching  
him. Shawn Steele from Microsoft said he did not have resources to look at  
the standard and I have not approached Apple thus far.

Kind regards,


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Wednesday, 25 April 2012 11:13:00 UTC