27256 – revamp iso-2022-jp decoder/encoder

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27256 - revamp iso-2022-jp decoder/encoder

Summary: revamp iso-2022-jp decoder/encoder

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Duplicates (3):	16685 21556 26886 (view as bug list)
Depends on:
Blocks:

Reported:	2014-11-06 10:44 UTC by Anne
Modified:	2016-02-22 08:23 UTC (History)
CC List:	9 users (show)

See Also:

Attachments

Description Anne 2014-11-06 10:44:14 UTC

Per Jungshik Chromium and WebKit follow ICU, which follows

  https://tools.ietf.org/html/rfc1468

whereas Gecko has its own implementation which is quite a bit different. I forgot what Opera did, but it no longer seems relevant. Internet Explorer is quite special as well, allowing nested shift_jis.

Given this, I'm inclined to align the decoder with the aforementioned RFC, though keeping error and EOF handling of course. Note that by fixing bug 26885 we already took one step in this direction.

Comment 1 Anne 2014-11-06 10:44:31 UTC

*** Bug 21556 has been marked as a duplicate of this bug. ***

Comment 2 Anne 2014-11-06 10:44:37 UTC

*** Bug 16685 has been marked as a duplicate of this bug. ***

Comment 3 Anne 2014-11-06 10:44:41 UTC

*** Bug 26886 has been marked as a duplicate of this bug. ***

Comment 4 Anne 2014-11-06 17:40:17 UTC

It seems we do not want to follow the RFC exactly. I found numerous mismatches between browsers and the RFC:

* Start with an ESC sequence is not an error in browsers.
* EOF in two-byte mode is not an error in browsers.
* EOF after ESC sequence is not an error in browsers.

Here is an outline of how I plan to rewrite this:

* Add Roman state
* Turn SI / SO / invalid ESC sequence (only replace ESC) into U+FFFD
* Invalid ESC sequence means switch to ASCII state
* ESC sequence after ESC sequence is invalid ESC sequence (triggers ASCII)

This is also based in part on great research from a duplicate bug: http://upokecenter.dreamhosters.com/articles/2013/04/differences-in-the-iso-2022-jp-encoding-between-browsers/

Comment 5 Anne 2014-11-06 21:52:51 UTC

Also, despite not being documented in that RFC, "ESC ( J" for Katakana is supported in Chromium and Gecko, so we keep that (Wikipedia suggests it's from iso-2022-jp-3).

Comment 6 Anne 2014-11-07 14:31:16 UTC

Notes on reverse engineering error handling:

* An erroneous escape sequence results in ASCII state in Firefox. Chrome replaces the first byte (the ESC) with U+FFFD and reinterprets what follows. (There's an edge case where if the input is 0x1B 0x24, Chrome replaces both with U+FFFD.)
* A repeated escape sequence results in U+FFFD in Chrome, as well as switching to the newly indicated state. Firefox just switches.
* Firefox does not seem to support the ASCII escape sequence in the ASCII state.
* Chrome recognizes escape sequences when looking at trail bytes.

I think I have a model now that roughly aligns with Chrome, but corrects some obvious mistakes. I have also written tests which I'll submit to web-platform-tests.

Still need to write it out.

Comment 7 Anne 2014-11-08 09:53:02 UTC

Tests: https://github.com/w3c/web-platform-tests/pull/1367

Commit: https://github.com/whatwg/encoding/commit/19b0ebf0e48c3a607ab7623b5b272642dd59d6e7

Thank you for listening.