<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>20886</bug_id>
          
          <creation_ts>2013-02-06 16:25:11 +0000</creation_ts>
          <short_desc>Strengthen definition of &quot;ASCII-compatible character encoding&quot; to match §4.2.5.5</short_desc>
          <delta_ts>2013-05-31 20:13:05 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>HTML</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>NEEDSINFO</resolution>
          
          
          <bug_file_loc>http://www.whatwg.org/specs/web-apps/current-work/#encoding-terminology</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>annevk</cc>
    
    <cc>ian</cc>
    
    <cc>mike</cc>
    
    <cc>VYV03354</cc>
    
    <cc>zackw</cc>
          
          <qa_contact>contributor</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>82627</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2013-02-06 16:25:11 +0000</bug_when>
    <thetext>Specification: http://www.whatwg.org/specs/web-apps/current-work/
Multipage: http://www.whatwg.org/C#encoding-terminology
Complete: http://www.whatwg.org/c#encoding-terminology

Comment:
I think the definition of &quot;ASCII-compatible character encoding&quot; should be
strengthened to match the language at §4.2.5.5 about potentially dangerous
encodings.

Posted from: 108.17.82.100
User agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 Iceweasel/18.0.1</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>82628</commentid>
    <comment_count>1</comment_count>
    <who name="Zack Weinberg">zackw</who>
    <bug_when>2013-02-06 16:35:50 +0000</bug_when>
    <thetext>Elaboration: &quot;ASCII-compatible character encoding&quot; as defined in §2.1.6 is weak enough to permit character sets (such as HZ-GB-2312) that are specifically called out in §4.2.5.5 as unsafe due to risk of misinterpretation.  I would like to suggest changing the definition to read

    An _ASCII-compatible character encoding_ is a single-byte or variable-length encoding in which the bytes 0x09, 0x0A, 0x0C, 0x0D, and 0x20 -- 0x7E always encode Unicode characters U+0009, U+000A, U+000C, U+000D, and U+0020 -- U+007E respectively.

If this is done, the note immediately after the definition should also be changed to read

    Note: This includes UTF-8 and most of the single-byte character encodings (such as all variants of ISO 8859) still in wide use.  It excludes UTF-16 as well as some variable-length encodings (such as Shift_JIS, HZ-GB-2312, and variants of ISO-2022) in which it is possible for bytes in the 0x20--0x7E range to be part of longer sequences that are unrelated to their interpretation as ASCII.  It also excludes various obsolete encodings such as UTF-7, GSM03.38, and EBCDIC.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>82652</commentid>
    <comment_count>2</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2013-02-06 22:02:49 +0000</bug_when>
    <thetext>Can you construct a file with those encodings that lets user-generated content execute script because of tricking the UA into using the wrong encoding?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>82659</commentid>
    <comment_count>3</comment_count>
    <who name="Zack Weinberg">zackw</who>
    <bug_when>2013-02-06 22:55:57 +0000</bug_when>
    <thetext>(In reply to comment #2)
&gt; Can you construct a file with those encodings that lets user-generated
&gt; content execute script because of tricking the UA into using the wrong
&gt; encoding?

It is possible to construct a HZ-GB-2312-encoded HTML file that executes script if the UA does *not* support that encoding:

  &lt;!doctype html&gt;
  &lt;html&gt;&lt;head&gt;
    &lt;meta charset=&quot;hz-gb-2312&quot;&gt;
    &lt;title&gt;script executed only if hz-gb-2312 unsupported&lt;/title&gt;
  &lt;/head&gt;&lt;body&gt;
  ~{&lt;script&gt;alert(&quot;gotcha&quot;)&lt;/script&gt;~}
  &lt;/body&gt;&lt;/html&gt;

(The text between ~{ and ~} must not contain any spaces, as single bytes of the double-byte GB2312 encoding must have values between 0x21 and 0x7E inclusive.  0x78 through 0x7E inclusive cannot appear at the first, third, fifth, etc. byte positions between ~{ and ~} but this seems like only a minor hindrance to an exploit, despite { and } being in that range.)

Something similar is probably possible in Big5.  I have not, however, been able to find an exploit that goes the other way, i.e. script is executed only if the UA *does* support HZ-GB-2312 or Big5.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>82729</commentid>
    <comment_count>4</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2013-02-08 02:29:10 +0000</bug_when>
    <thetext>In theory the set of encodings that are supported is now fixed, so I&apos;m not particularly worried about attacks that rely on one vendor supporting one set of encodings and another not — the solution is just &quot;supporting the standard set of encodings&quot;. My concern is more with cases like a page that thinks it&apos;s in one encoding (e.g. ASCII or HZ-GB-2312) but attacker-inserted text is able to get interpreted as another encoding. For example, in your case, if there was a way to trick the browser into not parsing the page as HZ-GB-2312 (other than the browser not supporting it) then that would be a concern.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83283</commentid>
    <comment_count>5</comment_count>
    <who name="Zack Weinberg">zackw</who>
    <bug_when>2013-02-18 18:23:38 +0000</bug_when>
    <thetext>I just thought of a social engineering attack: Write a page which declares itself as HZ-GB-2312 (for instance).  Invite visitors to copy and paste what appears to be a string of nonsense hanzi onto a victim site (which is encoded in UTF-8 and allows user comments).  This is a bit of a stretch since I think the clipboard will probably get recoded on paste (but I&apos;m not sure of it) or on upload (but I&apos;m not sure of that either) and even it doesn&apos;t, a site that&apos;s not *expecting* HZ-GB-2312 ought to notice the smuggled &lt;script&gt; tag in its regular old anti-lulz filters.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83314</commentid>
    <comment_count>6</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-02-19 10:57:28 +0000</bug_when>
    <thetext>I don&apos;t understand the attack in comment 5. Are you assuming the browser would not store the clipboard data as text (encoded as utf-16 most likely) but rather as the raw bytes or some such?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>83326</commentid>
    <comment_count>7</comment_count>
    <who name="Zack Weinberg">zackw</who>
    <bug_when>2013-02-19 14:33:05 +0000</bug_when>
    <thetext>Yes, I was imagining that the raw bytes would be transferred.  (I honestly have no idea how clipboards work nowadays.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>86057</commentid>
    <comment_count>8</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2013-04-12 18:56:51 +0000</bug_when>
    <thetext>Clipboards copy the text as Unicode characters, so I don&apos;t think the attack in comment 5 would work.

Any other attacks?</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>