<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>27256</bug_id>
          
          <creation_ts>2014-11-06 10:44:14 +0000</creation_ts>
          <short_desc>revamp iso-2022-jp decoder/encoder</short_desc>
          <delta_ts>2016-02-22 08:23:43 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Anne">annevk</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>budryan23</cc>
    
    <cc>e.mojarro11</cc>
    
    <cc>hsivonen</cc>
    
    <cc>jshin</cc>
    
    <cc>mike</cc>
    
    <cc>poccil</cc>
    
    <cc>smontagu</cc>
    
    <cc>travil</cc>
    
    <cc>www-international</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>114591</commentid>
    <comment_count>0</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-06 10:44:14 +0000</bug_when>
    <thetext>Per Jungshik Chromium and WebKit follow ICU, which follows

  https://tools.ietf.org/html/rfc1468

whereas Gecko has its own implementation which is quite a bit different. I forgot what Opera did, but it no longer seems relevant. Internet Explorer is quite special as well, allowing nested shift_jis.

Given this, I&apos;m inclined to align the decoder with the aforementioned RFC, though keeping error and EOF handling of course. Note that by fixing bug 26885 we already took one step in this direction.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114593</commentid>
    <comment_count>1</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-06 10:44:31 +0000</bug_when>
    <thetext>*** Bug 21556 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114595</commentid>
    <comment_count>2</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-06 10:44:37 +0000</bug_when>
    <thetext>*** Bug 16685 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114597</commentid>
    <comment_count>3</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-06 10:44:41 +0000</bug_when>
    <thetext>*** Bug 26886 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114612</commentid>
    <comment_count>4</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-06 17:40:17 +0000</bug_when>
    <thetext>It seems we do not want to follow the RFC exactly. I found numerous mismatches between browsers and the RFC:

* Start with an ESC sequence is not an error in browsers.
* EOF in two-byte mode is not an error in browsers.
* EOF after ESC sequence is not an error in browsers.

Here is an outline of how I plan to rewrite this:

* Add Roman state
* Turn SI / SO / invalid ESC sequence (only replace ESC) into U+FFFD
* Invalid ESC sequence means switch to ASCII state
* ESC sequence after ESC sequence is invalid ESC sequence (triggers ASCII)

This is also based in part on great research from a duplicate bug: http://upokecenter.dreamhosters.com/articles/2013/04/differences-in-the-iso-2022-jp-encoding-between-browsers/</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114620</commentid>
    <comment_count>5</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-06 21:52:51 +0000</bug_when>
    <thetext>Also, despite not being documented in that RFC, &quot;ESC ( J&quot; for Katakana is supported in Chromium and Gecko, so we keep that (Wikipedia suggests it&apos;s from iso-2022-jp-3).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114656</commentid>
    <comment_count>6</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-07 14:31:16 +0000</bug_when>
    <thetext>Notes on reverse engineering error handling:

* An erroneous escape sequence results in ASCII state in Firefox. Chrome replaces the first byte (the ESC) with U+FFFD and reinterprets what follows. (There&apos;s an edge case where if the input is 0x1B 0x24, Chrome replaces both with U+FFFD.)
* A repeated escape sequence results in U+FFFD in Chrome, as well as switching to the newly indicated state. Firefox just switches.
* Firefox does not seem to support the ASCII escape sequence in the ASCII state.
* Chrome recognizes escape sequences when looking at trail bytes.

I think I have a model now that roughly aligns with Chrome, but corrects some obvious mistakes. I have also written tests which I&apos;ll submit to web-platform-tests.

Still need to write it out.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114712</commentid>
    <comment_count>7</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-08 09:53:02 +0000</bug_when>
    <thetext>Tests: https://github.com/w3c/web-platform-tests/pull/1367

Commit: https://github.com/whatwg/encoding/commit/19b0ebf0e48c3a607ab7623b5b272642dd59d6e7

Thank you for listening.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>