<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>28141</bug_id>
          
          <creation_ts>2015-03-04 23:33:24 +0000</creation_ts>
          <short_desc>treatment of invalid 2-byte sequence is different in different CJK encodings</short_desc>
          <delta_ts>2015-08-19 12:51:12 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Jungshik Shin">jshin</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>jsbell</cc>
    
    <cc>mike</cc>
    
    <cc>philipj</cc>
    
    <cc>www-international</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>118330</commentid>
    <comment_count>0</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2015-03-04 23:33:24 +0000</bug_when>
    <thetext>Per bug 16691 comment 15, I&apos;m tightening Blink&apos;s encoding tables for CJK encodings to handle unmappable 2-byte sequence in a safe manner. 



The current spec has the following provision after looking up |pointer|. 

* EUC-KR decoder
   If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream.


* Big5 decoder
  
   If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream.

* Shift_JIS decoder
   If pointer is null, prepend byte to stream.

* EUC-JP decoder
   If byte is not in the range 0xA1 to 0xFE, prepend byte to stream.


* GB18030 decoder
   If pointer is null, prepend byte to stream.

For now, let&apos;s put aside EUC-JP and GB18030. 

I don&apos;t see a reason to make SJIS decoder behave differently than EUC-KR and Big5 decoder. One possible reason may be that [xA1, xDF] is a character by itself. 

In SJIS, &quot;0xFC 0xE0&quot; [1] is turned to U+FFFD, but the second byte (0xE0) becomes the lead of what follows.

In EUC-KR, &quot;0xFE 0xE0&quot; is turned to U+FFFD and the next lead byte is taken from the byte *after* 0xE0. 

Shouldn&apos;t we change the part of SJIS decoder quoted above to the following? 

  If pointer is null and byte is in the range of 0x00 - 0x7F, prepend byte to the stream.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>118332</commentid>
    <comment_count>1</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2015-03-05 00:06:26 +0000</bug_when>
    <thetext>The current EUC-JP spec makes sense so that there&apos;s no need to change it. 

I haven&apos;t taken a look at GB18030, yet. 

Anyway, so far SJIS is the only one that we have to consider changing.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>118403</commentid>
    <comment_count>2</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2015-03-06 19:18:51 +0000</bug_when>
    <thetext>Another piece of information: 

I was tightening Chromium&apos;s Big5&apos;s table and found that it has a lot of &quot;holes&quot; in the trail byte in the ASCII range. Below is what I found (all in hexadecimal). 

lead: trail byte holes in the ASCII range 
87: 76
89: 42 44 45 4A 4B
8A: 42 63 75
8B: 54
8D: 41
9B: 61
9F: 4E
A0: 54 57 5A 62 72

They&apos;re all in [a-zA-Z]. So, arguably, the XSS risk is lower than &apos;punctuation-mark-like characters&apos; in the ASCII range. 

In case of EUC-KR (windows-949), the trail byte in the ASCII range is limited to [a-zA-Z]. So, without &apos;adding back to the stream&apos; clause, we&apos;d only eat up [a-zA-Z]. 


Unless we&apos;re sure that [a-zA-Z] is harmless when eaten up, we should keep &apos;adding back to the stream if the trail is [0, 7F]&quot; clause (in case of ICU, perhaps the overall memory/perf impact of keeping the current spec is neutral to a small net-loss; haven&apos;t compared yet). 

Anyway, it occurred to me that we might think about this, too.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>118576</commentid>
    <comment_count>3</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2015-03-13 03:32:15 +0000</bug_when>
    <thetext>What do existing implementations do for SJIS?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>118668</commentid>
    <comment_count>4</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2015-03-18 21:13:31 +0000</bug_when>
    <thetext>ICU treats an &apos;illegal&apos; byte sequence differently from a byte sequence &apos;unassigned&apos; to a Unicode character. 

For instance, in EUC-KR (windows-949), &lt;FE A1&gt; is a valid byte sequence, but is not assigned any character. So, the sequence as a whole is turned to U+FFFD. 

Without tightening the vaild trail byte range for EUC-KR [1], &lt;FE 41&gt; is a valid byte sequence  and is converted to U+FFFD (exactly the same treatment as &lt;FE A1&gt;). 

OTOH, &lt;FE 22&gt; has an illegal trail byte (because 0x22 is outside the trail byte range for EUC-KR/Windows-949) and is turned to &lt;U+FFFD, U+0022&gt;  


The same is true of Shift_JIS. Because [80-FC] is the valid trail byte range, &lt;EB 9F&gt; is turned to U+FFFD (there&apos;s no mapped character at this position) instead of &lt;U+FFFD&gt; being emitted and &apos;0x9F&apos; being added back to the stream 



[1] Blink is just tightening up the valid trail byte range so that &apos;x41&apos; will not be valid any more if lead is C8 or higher.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>118683</commentid>
    <comment_count>5</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2015-03-19 14:07:08 +0000</bug_when>
    <thetext>Hmm, OK. If there&apos;s a spec change you want to (or have already) implement that&apos;s likely to be Web compatible and closer to what ICU already does, that probably won&apos;t be controversial. Concretely, is it only the SJIS bit that should be changed in the spec?

(Anne has the final say of course, I&apos;m just trying to move things along.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>122661</commentid>
    <comment_count>6</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2015-08-19 12:51:12 +0000</bug_when>
    <thetext>https://github.com/whatwg/encoding/issues/5 changed big5 to check the code point rather than the pointer.

shift_jis had that problem too, but indeed, we should eat the trail byte for shift_jis if it is not an ASCII byte.

euc-kr seems wrong too based on that.

gb18030 too.

So I fixed shift_jis, euc-kr, and gb18030.

https://github.com/whatwg/encoding/commit/640bf69847a17fd98df027fd6cd5ae384ac82dab</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>