This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 26618 - Fix mappings of legacy single byte encoding test failures for consistency with browsers
Summary: Fix mappings of legacy single byte encoding test failures for consistency wit...
Status: RESOLVED WORKSFORME
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: Unsorted
Assignee: Richard Ishida
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-08-20 17:45 UTC by Addison Phillips
Modified: 2014-11-12 14:17 UTC (History)
11 users (show)

See Also:


Attachments

Description Addison Phillips 2014-08-20 17:45:40 UTC
This is related to [I18N-ISSUE-371] (http://www.w3.org/International/track/issues/371)

The various tests linked from this page:

http://www.w3.org/International/tests/repository/encoding/indexes/results-indexes.en.php

... show errors and discrepancies in various legacy single byte encodings, mainly in the handling of unassigned code units. A cursory investigation suggests that the variations in these encodings stem mainly from differences in whether browsers pass unmapped bytes through, generate the U+FFFD replacement character, or generate a PUA code point. Please examine what the majority of browsers are currently doing combined with what makes sense and adapt the mapping tables appropriately.
Comment 1 Anne 2014-08-26 13:07:07 UTC
Per http://lists.w3.org/Archives/Public/www-international/2014JulSep/0189.html no changes are required. Seems there might have been a mistake in earlier editions of the tests.
Comment 2 Martin Dürst 2014-08-29 10:47:23 UTC
I'm absolutely okay if somebody points out actual mistakes in earlier editions of my tests, but as long as we have unexplained discrepancies between different versions of tests (see http://lists.w3.org/Archives/Public/www-international/2014JulSep/0198.html), it's premature to close this bug. I have therefore reopened it (I hope this is temporary).
Comment 3 Anne 2014-08-31 15:36:59 UTC
Are those tests publicly available now? It might be better to file a separate bug with links to your tests, as comment 0 does not point to them.

(I have similar results to Richard btw in http://dump.testsuite.org/encoding/single-byte-test.html which is why I thought your tests might be buggy.)
Comment 4 Richard Ishida 2014-08-31 15:49:02 UTC
My tests so far tested for the characters actually listed in the index files. Where there are gaps in the index files, Martin's tests checked for FFFD, which makes them tests for the decoding algorithm.  I'm planning to change my tests and results to also check for FFFD where there is no correspondance listed in the index file.
Comment 5 Richard Ishida 2014-09-01 12:23:39 UTC
My tests and results have been updated to check what happens if there is no line for a pointer in the index file. According to the single-byte decoding algorithm, this should produce U+FFFD. See the updated results at
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases

I have tried to indicate, where the pass is only partial, how many errors were due to U+FFFD not being served, vs. how many were due to unexpected characters being served that are not those in the tables. I did that in the summary. For details, open the test in the relevant browser (by clicking on the link to the left of the row). See for example
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#iso-8859-6

The main differences are for windows-1253 and windows-874 and Chrome/Safari/Opera, but also 6 more IE boxes turned orange.
Comment 6 Anne 2014-09-02 13:45:33 UTC
Per http://lists.w3.org/Archives/Public/www-international/2014JulSep/0286.html I believe no changes to be required. However, getting WebKit / Chromium engineers to comment on windows-1253 and windows-874 would be great.
Comment 7 Anne 2014-09-02 13:47:29 UTC
Also, perhaps Travis can give input from Microsoft's side since they deviate the most for single-byte encodings?

(Even if this bug is closed once you get to it, please do comment, we can always revisit given better data.)
Comment 8 Alexey Proskuryakov 2014-09-02 16:49:19 UTC
WebKit just uses ICU, so I would advise contacting the ICU project if there is a desire to have a custom variation of these encodings for the Web.

ICU is the place where encodings live :)
Comment 9 Jungshik Shin 2014-09-02 19:13:30 UTC
I already commented in the mail thread. 

If we test encoding as well, there'd be a lot more discrepancy between ICU and the encoding spec. For instance, ICU's single-byte tables for windows-12xx and windows-874 map the full-width ASCII block (U+FFxx) to the corresponding position 
in [0x20 - 0x7E].  

Anyway, there's an ICU bug to add tables to match the encoding spec. 
See http://www.icu-project.org/trac/ticket/10303

http://www.icu-project.org/trac/ticket/11231 deals with windows-874-specific issue of mapping (encode-only) box drawing and a bunch of other characters to [0x80, 0xFF] that are used for Thai characters.
Comment 10 Simon Montagu 2014-09-02 19:42:24 UTC
(In reply to Jungshik Shin from comment #9)
> If we test encoding as well, there'd be a lot more discrepancy between ICU
> and the encoding spec. For instance, ICU's single-byte tables for
> windows-12xx and windows-874 map the full-width ASCII block (U+FFxx) to the
> corresponding position 
> in [0x20 - 0x7E].  

I think that it would not be a bad thing if the Encoding spec explicitly forbade mapping any codepoint in the ASCII block to any codepoint outside it, in either direction. My gut reaction is that doing that with web content is a security bug waiting to happen.
Comment 11 Anne 2014-09-03 13:10:36 UTC
Jungshik, it's unclear to me whether those ICU encoder extensions are actually interoperable. E.g. I know for a fact Opera before Chromium does not have them. I believe I tested other browsers as well, but I can't find my test right now.

Simon, hopefully that falls out of the respective algorithms. No need to make a redundant requirement. Also note that it can never be true for utf-16le/utf-16be, and replacement's decoder.
Comment 12 Jungshik Shin 2014-09-05 21:27:02 UTC
Anne, I'm not arguing for changing the encoding spec so that U+FF01-U+FF5E is converted to 0x21 - 0x7E in windows-12xx (encoding-only mapping). I just added an observation that there are discrepancies other than found in the test result mentioned in comment 0 because that test suite only tested for decoding. 

I'm not sure about the security implication of encoding the full-width ASCII to the ASCII range, though. 

Anyway, perhaps Blink will get rid of that encoding-only mapping for U+FF01 - U+FF5E from windows-12xx.  Then, Blink will be aligned with the spec.
Comment 13 Anne 2014-09-07 07:15:44 UTC
Reassigning to Richard so he can make sure the test suite covers encoders as well.
Comment 14 Jungshik Shin 2014-09-08 20:56:40 UTC
FYI, the chromium bug was filed to get rid of the encoding-only mapping (of U+FF01 - U+FF5E to 0x21 to 0x7E) as well as the discrepancy in windows-874 and windows-1253; http://crbug.com/412053
Comment 15 Anne 2014-11-08 10:05:13 UTC
Richard, any progress on this? https://github.com/w3c/web-platform-tests/pull/1367 demonstrates how you can test an encoder from JavaScript. It shouldn't be that hard to extrapolate something for more extensive testing, although it's a bit cumbersome since the quirks of URL parsing have an impact as well.
Comment 16 Anne 2014-11-12 14:17:07 UTC
I submitted single-byte decoder tests to web-platform-tests: https://github.com/w3c/web-platform-tests/pull/1384

Apart from the document.characterSet API, Chrome and Firefox pass all tests. For the document.characterSet API there are some casing differences. I'm hoping we can make it consistent with TextEncoder.prototype.encoding, but if not I'm happy to get a new bug report.

I have not yet written encoder tests. (The interaction with either <form> or URL makes those trickier.)

Given that comment 0 seems addressed I'm going to close this.