This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Per Jungshik Chromium and WebKit follow ICU, which follows https://tools.ietf.org/html/rfc1468 whereas Gecko has its own implementation which is quite a bit different. I forgot what Opera did, but it no longer seems relevant. Internet Explorer is quite special as well, allowing nested shift_jis. Given this, I'm inclined to align the decoder with the aforementioned RFC, though keeping error and EOF handling of course. Note that by fixing bug 26885 we already took one step in this direction.
*** Bug 21556 has been marked as a duplicate of this bug. ***
*** Bug 16685 has been marked as a duplicate of this bug. ***
*** Bug 26886 has been marked as a duplicate of this bug. ***
It seems we do not want to follow the RFC exactly. I found numerous mismatches between browsers and the RFC: * Start with an ESC sequence is not an error in browsers. * EOF in two-byte mode is not an error in browsers. * EOF after ESC sequence is not an error in browsers. Here is an outline of how I plan to rewrite this: * Add Roman state * Turn SI / SO / invalid ESC sequence (only replace ESC) into U+FFFD * Invalid ESC sequence means switch to ASCII state * ESC sequence after ESC sequence is invalid ESC sequence (triggers ASCII) This is also based in part on great research from a duplicate bug: http://upokecenter.dreamhosters.com/articles/2013/04/differences-in-the-iso-2022-jp-encoding-between-browsers/
Also, despite not being documented in that RFC, "ESC ( J" for Katakana is supported in Chromium and Gecko, so we keep that (Wikipedia suggests it's from iso-2022-jp-3).
Notes on reverse engineering error handling: * An erroneous escape sequence results in ASCII state in Firefox. Chrome replaces the first byte (the ESC) with U+FFFD and reinterprets what follows. (There's an edge case where if the input is 0x1B 0x24, Chrome replaces both with U+FFFD.) * A repeated escape sequence results in U+FFFD in Chrome, as well as switching to the newly indicated state. Firefox just switches. * Firefox does not seem to support the ASCII escape sequence in the ASCII state. * Chrome recognizes escape sequences when looking at trail bytes. I think I have a model now that roughly aligns with Chrome, but corrects some obvious mistakes. I have also written tests which I'll submit to web-platform-tests. Still need to write it out.
Tests: https://github.com/w3c/web-platform-tests/pull/1367 Commit: https://github.com/whatwg/encoding/commit/19b0ebf0e48c3a607ab7623b5b272642dd59d6e7 Thank you for listening.