This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Section 3.6.2 of Unicode Technical Report 36 says that conversion must use replacements or cause an error or even for "unrecognized or 'empty' state-change sequences". But this does not happen in the current encoding algorithms. For example, in the "hz-gb-2312" algorithm: 0x7E 0x7B 0x7E 0x7D 0x20 results in U+0020, rather than a decoder error and 0x20 (since I presume that the empty shift sequence is illegal.) Similarly, 0x7E 0x7D 0x7E 0x7B causes no decoder error for being an empty shift sequence. In the "iso-2022-jp" algorithm: 0x1b 0x24 0x40 0x1b 0x28 0x42 0x20 (and other sequences like it) results in U+0020, rather than a decoder error and 0x20 (since I presume that the empty shift sequence is illegal.) In the "iso-2022-kr" algorithm: The byte sequence 0x0E 0x0E 0x0E ... results in no characters, rather than one or more decoder errors (at least for reaching the end of the stream with no characters). The byte sequence 0x0F 0x0F 0x0F ... results in no characters, rather than one or more decoder errors (at least for reaching the end of the stream with no characters). 0x0E 0x0F 0x20 results in U+0020, rather than a decoder error and 0x20. All the cases above indicate empty shift sequences not currently treated as decoder errors. Should the encoding algorithms be changed to emit a decoder error if there are no characters in between shift sequences in "iso-2022-jp" and "iso-2022-kr"? Or are the algorithms like this for compatibility? Another issue is how to deal with unrecognized ISO 2022 escape sequences; I feel that the current encoding algorithms don't deal with that well enough.
* We do not want to change algorithms except where that leads to further convergence. * Convergence with Unicode Technical Reports is a non-goal. Unicode Technical Reports can be updated if the reality is different. * All sequences in iso-2022-* are handled as far as I can tell. What's the problem?
I will test these sequences with different browsers and report back.
I've made a test page at this address: http://upokecenter.com/projects/iso2022.htm and tested it with Safari 5.1.7, Internet Explorer 10, Opera 12, Google Chrome 26, and Firefox 19. The results included the following: - Safari and Chrome showed the same results, one of the consequences of having the same browser engine -- Webkit. - Firefox and Webkit emit 0xFFFD when it reaches a shift sequence immediately after another shift sequence, but not IE or Opera. - No browser showed a decoder error if a shift sequence occurs at the very end of the string, so this case should probably be ignored. - There was different behavior across browsers on how unrecognized escape sequences are handled. In ASCII mode, Opera and Webkit emit 0xFFFD to replace the first bytes of the sequences, while IE and Firefox emit the 0x1B escape character and the rest of the sequence as ASCII. I will collect all the test results on another page and report it here.
Here's the test report: http://upokecenter.dreamhosters.com/articles/2013/04/differences-in-the-iso-2022-jp-encoding-between-browsers/
*** This bug has been marked as a duplicate of bug 27256 ***