<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>21057</bug_id>
          
          <creation_ts>2013-02-20 14:19:27 +0000</creation_ts>
          <short_desc>Introduce additional labels for the replacement encoding</short_desc>
          <delta_ts>2015-08-21 07:39:13 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>MOVED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard>blocked on implementer research</status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Henri Sivonen">hsivonen</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>jsbell</cc>
    
    <cc>jshin</cc>
    
    <cc>mike</cc>
    
    <cc>VYV03354</cc>
    
    <cc>www-international</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>83390</commentid>
    <comment_count>0</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2013-02-20 14:19:27 +0000</bug_when>
    <thetext>Problem statement:

1) The Encoding Standard removes the ISO-2022-CN encoding. This will make sites that rely on that encoding being supported vulnerable to XSS the way Yahoo search was vulnerable in Chrome when Chrome removed ISO-2022-KR. See https://code.google.com/p/chromium/issues/detail?id=15701

2) There exist ASCII-incompatible encodings in the world outside the Encoding Standard and support for those encodings might be exposed if server-side libraries. Sites that are naïve enough to allow the user to specify the output encoding that the site uses and this past the user-supplied encoding name to server-side library without white listing ASCII-compatible encodings are vulnerable to EBCDIC attacks: An attacker can request that the site use an EBCDIC-based encoding and the site responds with EBCDIC which isn&apos;t recognized by non-IE browsers and browsers fall back on an ASCII-compatible encoding resulting in the EBCDIC bytes being interpreted in a dangerous way. See http://zaynar.co.uk/docs/charset-encoding-xss.html for a reference to an actual search engine that was vulnerable to this attack.

Proposed solution:
Define a replacement encoding that decodes all possible byte values to the REPLACEMENT CHARACTER. Make the known labels for ASCII-incompatible encodings that exist but aren&apos;t part of the Encoding Standard aliases for the replacement encoding.

Additional info:
This solution would pave the way for safe removal of ISO-2022-KR and hz-gb-2312 from the set of encodings supported by the Encoding Standard.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83391</commentid>
    <comment_count>1</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-02-20 14:42:51 +0000</bug_when>
    <thetext>We should be conservative with this list I suppose as sites might rely on a fallback encoding being in play.

We should probably include these:

* iso-2022-cn
* iso-2022-cn-ext

Less sure about:

* EBCDIC labels
* utf-7
* utf-32</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83392</commentid>
    <comment_count>2</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2013-02-20 15:01:01 +0000</bug_when>
    <thetext>Hmm. Might want to allow 0x20 to decode as U+0020 to avoid accidentally DoSing layout.

(In reply to comment #1)
&gt; Less sure about:
&gt; 
&gt; * EBCDIC labels

To the extent IE currently recognizes these, at least in theory:
 * Relying on falling back to ASCII doesn&apos;t work today in IE.
 * IE would become less XSS-resilient if it dropped knowledge of those labels without aliasing them to a replacement encoding.

&gt; * utf-7
&gt; * utf-32

These might plausibly be relying on fallback currently.

Others to consider:
 * CESU-8
 * BOCU-1
 * SCSU</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83393</commentid>
    <comment_count>3</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-02-20 15:12:13 +0000</bug_when>
    <thetext>We could emit a single U+FFFD and terminate I think. Pretend as if all bytes were consumed.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83421</commentid>
    <comment_count>4</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2013-02-20 19:20:09 +0000</bug_when>
    <thetext>Chrome implements a &quot;fake&quot; ISO-2022-CN decoder which always emit U+FFFD for all double-byte characters.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83422</commentid>
    <comment_count>5</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2013-02-20 19:26:08 +0000</bug_when>
    <thetext>But I don&apos;t think ISO-2022-CN problem is really exploitable in the read world.
Gecko&apos;s ISO-2022-CN decoder has a bug for a long time which is exploitable. I even wrote it in the public bug. But nobody didn&apos;t care.
https://bugzilla.mozilla.org/show_bug.cgi?id=470523
So Gecko completely ignores ISO-2022-CN label since Firefox 19.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83543</commentid>
    <comment_count>6</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-02-22 10:55:57 +0000</bug_when>
    <thetext>We also have to decide what to do for TextDecoder. And the encoder story for &lt;form accept-charset&gt;, script injecting a link into a iso-2022-kr &lt;iframe&gt;, and maybe more.

I think the encoder story can be utf-8. Supporting it in TextDecoder does not seem problematic. TextEncoder is already prohibited.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83582</commentid>
    <comment_count>7</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2013-02-22 19:11:52 +0000</bug_when>
    <thetext>I think we should also remove them from TextDecoder for consistency. If people really need to decode those encodings, they can implement the decoder using gbk/euc-kr decoders.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83583</commentid>
    <comment_count>8</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2013-02-22 19:14:25 +0000</bug_when>
    <thetext>(In reply to comment #2)
&gt; Others to consider:
&gt;  * CESU-8
&gt;  * BOCU-1
&gt;  * SCSU

I don&apos;t think we need to consider about encodings no browsers have ever been supported. If by any chance some pages relied on those encodings, they are already vulnerable.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83585</commentid>
    <comment_count>9</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-02-22 19:34:42 +0000</bug_when>
    <thetext>What is the rationale for that? They might already be vulnerable, but would it not be better if they were less vulnerable going forward?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83586</commentid>
    <comment_count>10</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2013-02-22 19:37:54 +0000</bug_when>
    <thetext>If the vulnerable page is actually present in the real world at all.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83587</commentid>
    <comment_count>11</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-02-22 19:41:52 +0000</bug_when>
    <thetext>As a trial balloon: https://github.com/whatwg/encoding/commit/8329a2e768caea6908d600debd3cc8a6dc59c3c3 (I.e. not final, but gives us a thing to discuss.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83614</commentid>
    <comment_count>12</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-02-23 08:03:18 +0000</bug_when>
    <thetext>So going forward everything under EBCDIC in http://wiki.whatwg.org/wiki/Web_Encodings#Encodings_3 should be added. Let me know if you disagree.

Then once implementations remove iso-2022-kr I will add that one too.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83646</commentid>
    <comment_count>13</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2013-02-25 11:43:53 +0000</bug_when>
    <thetext>(In reply to comment #8)
&gt; (In reply to comment #2)
&gt; &gt; Others to consider:
&gt; &gt;  * CESU-8
&gt; &gt;  * BOCU-1
&gt; &gt;  * SCSU
&gt; 
&gt; I don&apos;t think we need to consider about encodings no browsers have ever been
&gt; supported. If by any chance some pages relied on those encodings, they are
&gt; already vulnerable.

The threat scenario is that the server accepts an encoding name from a query string and passes it to a server-side library that implements encodings that browsers don&apos;t support.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96961</commentid>
    <comment_count>14</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-12-02 14:17:38 +0000</bug_when>
    <thetext>Per http://mxr.mozilla.org/mozilla-central/source/dom/encoding/labelsencodings.properties Gecko seems to match the specification. Do we want to add any of the other ones or should I resolve this as FIXED?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>97109</commentid>
    <comment_count>15</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2013-12-04 10:15:55 +0000</bug_when>
    <thetext>(In reply to Anne from comment #14)
&gt; Do we want to add any of
&gt; the other ones or should I resolve this as FIXED?

I think we should add
 * BOCU-1
 * SCSU
 * Known EBCDIC labels.
...as labels of the replacement encoding in order to mitigate the attack described in http://zaynar.co.uk/docs/charset-encoding-xss.html . If Google Translate works for http://masatokinugawa.l0.cm/2013/06/accounts.google.com-utf-32-xss.html , it appears that Google, who really should know better, allowed the output encoding to be controlled by the request URL.

UTF-7 is not on the list, because it&apos;s not dangerous to interpret UTF-7 as ASCII and there&apos;s some value in seeing the ASCII decoding of UTF-7 for Latin-script text.

UTF-32 is not on the list, because the BOM taking precedence and the little-endian UTF-32 sniffing as UTF-16LE would make aliasing to replacement a mere placebo. Furthermore, interpreting UTF-32 as non-UTF-32 doesn&apos;t appear to be dangerous when U+0000 is not discarded before tokenization, which it isn&apos;t in HTML.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>97131</commentid>
    <comment_count>16</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-12-04 16:05:36 +0000</bug_when>
    <thetext>These are the known EBCDIC ones that IE supports per the WHATWG table including BOCU-1 and SCSU labels:

 * bocu-1
 * ccsid00924
 * ccsid01140
 * ccsid01141
 * ccsid01142
 * ccsid01143
 * ccsid01144
 * ccsid01145
 * ccsid01146
 * ccsid01147
 * ccsid01148
 * ccsid01149
 * cp00924
 * cp01140
 * cp01141
 * cp01142
 * cp01143
 * cp01144
 * cp01145
 * cp01146
 * cp01147
 * cp01148
 * cp01149
 * cp037
 * cp1025
 * cp1026
 * cp273
 * cp278
 * cp280
 * cp284
 * cp285
 * cp290
 * cp297
 * cp420
 * cp423
 * cp424
 * cp500
 * cp870
 * cp871
 * cp875
 * cp880
 * cp905
 * cp930
 * cp933
 * cp935
 * cp937
 * cp939
 * csbocu-1
 * csbocu1
 * csibm037
 * csibm1026
 * csibm273
 * csibm277
 * csibm278
 * csibm280
 * csibm284
 * csibm285
 * csibm290
 * csibm297
 * csibm420
 * csibm423
 * csibm424
 * csibm500
 * csibm870
 * csibm871
 * csibm880
 * csibm905
 * csibmthai
 * csscsu
 * ebcdic-cp-ar1
 * ebcdic-cp-be
 * ebcdic-cp-ca
 * ebcdic-cp-ch
 * ebcdic-cp-dk
 * ebcdic-cp-es
 * ebcdic-cp-fi
 * ebcdic-cp-fr
 * ebcdic-cp-gb
 * ebcdic-cp-gr
 * ebcdic-cp-he
 * ebcdic-cp-is
 * ebcdic-cp-it
 * ebcdic-cp-nl
 * ebcdic-cp-no
 * ebcdic-cp-roece
 * ebcdic-cp-se
 * ebcdic-cp-tr
 * ebcdic-cp-us
 * ebcdic-cp-wt
 * ebcdic-cp-yu
 * ebcdic-cyrillic
 * ebcdic-de-273+euro
 * ebcdic-dk-277+euro
 * ebcdic-es-284+euro
 * ebcdic-fi-278+euro
 * ebcdic-fr-297+euro
 * ebcdic-gb-285+euro
 * ebcdic-international-500+euro
 * ebcdic-is-871+euro
 * ebcdic-it-280+euro
 * ebcdic-jp-kana
 * ebcdic-latin9--euro
 * ebcdic-no-277+euro
 * ebcdic-se-278+euro
 * ebcdic-us-37+euro
 * ibm-thai
 * ibm00924
 * ibm01047
 * ibm01140
 * ibm01141
 * ibm01142
 * ibm01143
 * ibm01144
 * ibm01145
 * ibm01146
 * ibm01147
 * ibm01148
 * ibm01149
 * ibm037
 * ibm1026
 * ibm273
 * ibm277
 * ibm278
 * ibm280
 * ibm284
 * ibm285
 * ibm290
 * ibm297
 * ibm420
 * ibm423
 * ibm424
 * ibm500
 * ibm870
 * ibm871
 * ibm880
 * ibm905
 * scsu
 * x-cp21027
 * x-ebcdic-japaneseanduscanada
 * x-ebcdic-koreanextended

Of course on the server ICU might be used and which labels we want to ban from that is unclear to me. ICU supports a lot of labels, including weird ones like &quot;ISO_2022,locale=ko,version=0&quot;.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114479</commentid>
    <comment_count>17</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-04 13:54:01 +0000</bug_when>
    <thetext>Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in comment 16 to the replacement encoding? (With the risk that this might break pages that depend on fallback to the default encoding.)

If there is no active interest into getting this into browsers, I&apos;m not sure if we should keep this open. 

(Note that we have introduced a replacement encoding and disabled iso-2022-kr and hz-gb-2312 successfully, so those parts of comment 0 are addressed.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114486</commentid>
    <comment_count>18</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2014-11-04 14:32:32 +0000</bug_when>
    <thetext>(In reply to Anne from comment #17)
&gt; Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in
&gt; comment 16 to the replacement encoding? (With the risk that this might break
&gt; pages that depend on fallback to the default encoding.)

I&apos;m still interested in this, because problem #2 from comment 0 hasn&apos;t been addressed yet. (Granted, it&apos;s a problem of insufficient clue of the part of a Web developer, but we do sometimes try to save people from themselves.)

I&apos;m not going to have time to research the problem of this potentially breaking pages that expect fallback in the foreseeable future, though.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>114498</commentid>
    <comment_count>19</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2014-11-04 17:13:40 +0000</bug_when>
    <thetext>There might also be sites that instead rely on a later encoding declaration with a different label being picked up. e.g.

Content-Type: unknown
...
&lt;meta charset=known&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>122694</commentid>
    <comment_count>20</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2015-08-21 07:39:13 +0000</bug_when>
    <thetext>Closing this in favor of https://github.com/whatwg/encoding/issues/8 since I&apos;d like to stop using Bugzilla.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>