16685 – iso-2022-jp decoder feedback

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16685 - iso-2022-jp decoder feedback

Summary: iso-2022-jp decoder feedback

Status:	RESOLVED DUPLICATE of bug 27256

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Windows 3.1

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Duplicates (1):	21055 (view as bug list)
Depends on:
Blocks:	26886
	Show dependency tree / graph

Reported:	2012-04-10 16:25 UTC by Anne
Modified:	2014-11-06 10:44 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Anne 2012-04-10 16:25:56 UTC

From Øistein:

-- Escape start 2, Escape middle 6, Escape final 2:  Am I right in thinking that no ESC character will be emitted if the escape sequence is incomplete or unrecognised?  Should it not be?
 
-- Lead state: Are you sure \n (0x1A) should switch to ASCII?  I am not sure about this, but the HZ algorithm seems to say something similar, which is probably wrong, hence my suspicion.
 
-- Trail state: Is it correct that ESC should not be recognised as the start of an escape sequence?  (It probably is, I just do not know and mention it whilst I remember.)
 
Perhaps you can get away with not supporting SO/SI and 8-bit characters as ways of encoding katakana.
 
The escape sequence ESC ( H should perhaps be supported.

Comment 1 pub-w3 2012-05-06 15:13:32 UTC

Tests for various encoding errors:

<p>Space (' ') between two-byte sequences:  \x1b$@VP VPVP  VPVP VP\x1b(J
<p>Newline ('\\n') between two-byte sequences:  \x1b$@VP\nVPVP\n\nVPVP\nVP\x1b(J
<p>Space (' ') inside two-byte sequences:  \x1b$@VPV PVP VPVP\x1b(J
<p>Newline ('\\n') inside two-byte sequences:  \x1b$@VPV\nPVP\nVPVP\x1b(J

<p>Aligned escape sequence: \x1b$@VPVP\x1b(IVP\x1b(JVP
<p>Misaligned escape sequence: \x1b$@VPV\x1b(IVP\x1b(JVP

<p>Incomplete escape sequence:  VP\x1BWQ\x1B\$XR\x1B\$@VP\x1b(JVP

<p>Single 8-bit katakana in 2-byte mode:  \xa6\xa7\x1b$@VP\xa6\xa7VP\xa6VP\x1b(JVP\x1b(JVP
<p>2 shift-in katakana in 2-byte mode:  VP\x0E&'\x0FVP\x1b$@VP\x0E&'\x0FVPVP\x1b(J
<p>3 shift-in katakana in 2-byte mode:  VP\x0E&'(\x0FVP\x1b$@VP\x0E&'(\x0FVPVP\x1b(J

<p>Misaligned shift-in katakana in 2-byte mode:  VPV\x0E&\x0FVP\x1b$@VPV\x0E&\x0FVP\x1b(J
<p>Aligned shift-out in 2-byte mode:  \x1b$@VP\x0FVP\x1b(J
<p>Misaligned shift-out in 2-byte mode:  \x1b$@VPV\x0FVP\x1b(J


Testing in IE (IE6 & IE9), Safari, Firefox and Opera gives the following results:

Firefox switches to ASCII whenever whitespace appears in 2-byte mode; Opera switches to ASCII when it sees '\n', but not ' ';  Safari when it encounters '\n' between 2-byte sequences, but not inside;  IE not at all.  Only Safari allows ' ' between two-byte sequences (IE does the same for ISO-2022-KR, and it would be nice to avoid this difference between KR and JP).  The resulting variation between browsers is further compounded by IE's Shift-JIS-inspired handling of undefined bytes. Removing special handling of '\n' would make the draft more in line with IE and Safari.

Opera and Safari do recognise misaligned escape sequences (i.e., ones starting with an ESC byte in the trail state), but Firefox and IE do not.

As for ISO-2022-KR, IE and Firefox convert incomplete/unrecognised escape sequences to characters byte for byte with no (other) indication of error.  This seems reasonable.

No browsers allow 8-bit katakana in two-byte mode, so the potential alignment problem created by an odd number of 8-bit katakana in two-byte mode can probably be ignored.

On the other hand, IE allows shift-in/shift-out to be used to encode katakana in two-byte mode.  If this is not taken into account, an odd number of katakana will put the decoder out of synch.  Safari seems to handle this somehow; Opera misaligns; Firefox ends up in the ASCII state. 

IE does not interpret misaligned shift-out or shift-in characters.  A superfluous shift-out in two-byte mode is ignored by IE, results in a U+FFFD in Safari but no other problems, misalignment in Opera, and ASCII mode in Firefox.

Not tested:  the effect of other control characters and 8-bit characters.

Comment 2 Jungshik Shin 2013-08-28 18:13:33 UTC

ICU (as used by Chrome) matches RFC 1448 ( http://www.ietf.org/rfc/rfc1468.txt ) when it comes to handling ESC, SI and SO in the middle of a single byte sequence. The current spec (and Firefox behavior) is different from that.

Comment 3 Anne 2013-09-05 14:42:16 UTC

*** Bug 21055 has been marked as a duplicate of this bug. ***

Comment 4 Anne 2013-12-04 17:33:45 UTC

iso-2022-jp-* demos in http://dump.testsuite.org/encoding/ show that in IE ASCII state is basically "shift_jis state".

Comment 5 Jungshik Shin 2014-04-30 23:34:17 UTC

(In reply to Anne from comment #4)
> iso-2022-jp-* demos in http://dump.testsuite.org/encoding/ show that in IE
> ASCII state is basically "shift_jis state".

Ick... To me, that's going too far. Mislabelling SJIS pages as  ISO-2022-JP is not what I'd be generous about. I'd rather break those pages and encourage them to switch to UTF-8 or correct their label (to Shift_JIS).

Comment 6 Anne 2014-11-06 10:44:37 UTC


*** This bug has been marked as a duplicate of bug 27256 ***