<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>21275</bug_id>
          
          <creation_ts>2013-03-13 20:58:59 +0000</creation_ts>
          <short_desc>[Imports]: Force utf-8</short_desc>
          <delta_ts>2014-04-14 18:29:18 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebAppsWG</product>
          <component>HISTORICAL - Component Model</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          <blocked>20683</blocked>
          <everconfirmed>1</everconfirmed>
          <reporter name="Dimitri Glazkov">dglazkov</reporter>
          <assigned_to name="Dimitri Glazkov">dglazkov</assigned_to>
          <cc>adamk</cc>
    
    <cc>annevk</cc>
    
    <cc>gphemsley</cc>
    
    <cc>ian</cc>
    
    <cc>morrita</cc>
          
          <qa_contact>public-webapps-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>84366</commentid>
    <comment_count>0</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-03-13 20:58:59 +0000</bug_when>
    <thetext>From annevk: Lets override all the encoding magic the HTML layer might bring with
it and simply always decode these resources using utf-8, just as we do
with workers and such.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>87695</commentid>
    <comment_count>1</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-05-14 23:03:39 +0000</bug_when>
    <thetext>Anne, can you help with a good language here? I am not very well-versed in encodings. Or as you might say, my sniffing is not up to snuff.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>87781</commentid>
    <comment_count>2</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-05-16 16:28:20 +0000</bug_when>
    <thetext>Basically. I think you want &quot;utf-8 decode&quot; /byte stream/ and then parse the result of that operation using the HTML parser. XMLHttpRequest does something similar.

http://encoding.spec.whatwg.org/#utf-8-decode defines utf-8 decode.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>90957</commentid>
    <comment_count>3</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-07-18 22:44:18 +0000</bug_when>
    <thetext>From IRC:

3:42 PM &lt;annevk&gt; dglazkov: so now you cannot have CORS-cross-origin resources so that part of the spec doesn&apos;t make sense anymore
3:43 PM &lt;annevk&gt; dglazkov: you also need to say what happens if fetching failed</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>90958</commentid>
    <comment_count>4</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-07-18 22:47:46 +0000</bug_when>
    <thetext>3:44 PM &lt;annevk&gt; I guess it&apos;s mostly fine otherwise, although I wonder if it shouldn&apos;t use a crossorigin attribute on &lt;link&gt; like most other things we have</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>90960</commentid>
    <comment_count>5</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-07-18 22:48:27 +0000</bug_when>
    <thetext>Sorry for spam. Wrong bug.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>90994</commentid>
    <comment_count>6</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-07-19 17:16:32 +0000</bug_when>
    <thetext>https://dvcs.w3.org/hg/webcomponents/rev/121eefa9215e</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>91000</commentid>
    <comment_count>7</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-07-19 17:48:45 +0000</bug_when>
    <thetext>You convert the stream into a code point stream. You cannot use that new stream as a byte stream. Language similar to http://xhr.spec.whatwg.org/#document-response-entity-body is probably enough: &quot;Decode byte stream response entity body using fallback encoding charset and then let document be a document that represents the result of that, parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled.&quot;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>91002</commentid>
    <comment_count>8</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-07-19 18:28:24 +0000</bug_when>
    <thetext>Spoke with Hixie, he&apos;s going to give us a better hook.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>91007</commentid>
    <comment_count>9</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-07-19 20:12:25 +0000</bug_when>
    <thetext>(In reply to comment #8)
&gt; Spoke with Hixie, he&apos;s going to give us a better hook.

WDYT? https://dvcs.w3.org/hg/webcomponents/rev/975f62535f11</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>91008</commentid>
    <comment_count>10</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-07-19 20:46:52 +0000</bug_when>
    <thetext>Seems fine to me. I ended up not mentioning the input byte stream explicitly in XMLHttpRequest. Either way works I think.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>91010</commentid>
    <comment_count>11</comment_count>
    <who name="Dimitri Glazkov">dglazkov</who>
    <bug_when>2013-07-19 20:53:05 +0000</bug_when>
    <thetext>#winning</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>101279</commentid>
    <comment_count>12</comment_count>
    <who name="Adam Klein">adamk</who>
    <bug_when>2014-02-24 22:30:36 +0000</bug_when>
    <thetext>In trying to implement this in Blink (see https://codereview.chromium.org/178883002/), I&apos;m somewhat puzzled as to what&apos;s intended by the spec. It seems to be saying that the response is always interpreted as UTF-8. Per HTML, that overrides &lt;meta charset&gt;. Is that intentional? What about HTTP Content-Type headers? Is overriding those intentional? It seems strange to me to choose a different encoding than the document (or document&apos;s server) claim the document is encoded in.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>101786</commentid>
    <comment_count>13</comment_count>
    <who name="Adam Klein">adamk</who>
    <bug_when>2014-03-03 20:35:33 +0000</bug_when>
    <thetext>Ping? I&apos;m interested particularly in the opinions of morrita and annevk (who I think suggested this in the first place).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>101787</commentid>
    <comment_count>14</comment_count>
    <who name="Morrita Hajime">morrita</who>
    <bug_when>2014-03-03 20:55:12 +0000</bug_when>
    <thetext>(In reply to Adam Klein from comment #13)
&gt; Ping? I&apos;m interested particularly in the opinions of morrita and annevk (who
&gt; I think suggested this in the first place).

I&apos;m inclined to do no special casing for both utf-8 and quirks mode thing
as long as imports are documents not fragments. There are many knobs to affect text encoding detection and it seems hard to state this special case well.

If we switch to fragments, we can define how the transport layer of imports work separately from one of the document.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102316</commentid>
    <comment_count>15</comment_count>
    <who name="Adam Klein">adamk</who>
    <bug_when>2014-03-13 20:44:18 +0000</bug_when>
    <thetext>(In reply to Morrita Hajime from comment #14)
&gt; (In reply to Adam Klein from comment #13)
&gt; &gt; Ping? I&apos;m interested particularly in the opinions of morrita and annevk (who
&gt; &gt; I think suggested this in the first place).
&gt; 
&gt; I&apos;m inclined to do no special casing for both utf-8 and quirks mode thing
&gt; as long as imports are documents not fragments. There are many knobs to
&gt; affect text encoding detection and it seems hard to state this special case
&gt; well.
&gt; 
&gt; If we switch to fragments, we can define how the transport layer of imports
&gt; work separately from one of the document.

Seems like the spec should be updated to remove this utf-8 special casing for now, at least.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102326</commentid>
    <comment_count>16</comment_count>
    <who name="Morrita Hajime">morrita</who>
    <bug_when>2014-03-13 22:14:54 +0000</bug_when>
    <thetext>https://github.com/w3c/webcomponents/commit/77d82322655f528017322696470d487c2d09e722</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102433</commentid>
    <comment_count>17</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-16 14:29:37 +0000</bug_when>
    <thetext>No we should always use utf-8. Just like workers are always utf-8, new types of HTML resources should always be utf-8. &lt;meta&gt; and text/html&apos;s charset parameter are simply irrelevant.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102466</commentid>
    <comment_count>18</comment_count>
    <who name="Morrita Hajime">morrita</who>
    <bug_when>2014-03-17 16:51:02 +0000</bug_when>
    <thetext>(In reply to Anne from comment #17)
&gt; No we should always use utf-8. Just like workers are always utf-8, new types
&gt; of HTML resources should always be utf-8. &lt;meta&gt; and text/html&apos;s charset
&gt; parameter are simply irrelevant.

The point here is that it is not clear if HTML imports are new type of resources or they are just HTML. If we turn the type of imports to DocumentFragments, it would definitely a new type of resources. But if it is Document, I think it&apos;s just an HTML as we have had.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102470</commentid>
    <comment_count>19</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-17 17:02:33 +0000</bug_when>
    <thetext>No it&apos;s not. All kinds of things are different around script execution and such. The HTML we have is something you load in a browsing context. This is something else. Similarly HTML loaded through XMLHttpRequest has restrictions on encodings (though not enough).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102471</commentid>
    <comment_count>20</comment_count>
    <who name="Morrita Hajime">morrita</who>
    <bug_when>2014-03-17 17:13:37 +0000</bug_when>
    <thetext>(In reply to Anne from comment #19)
&gt; XMLHttpRequest has restrictions on encodings (though not enough).
Ah OK, this is good news to me. Probably I should just revert the change.

Do you think the integration point used in [1] is correct one?

[1] https://github.com/w3c/webcomponents/commit/77d82322655f528017322696470d487c2d09e722</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102473</commentid>
    <comment_count>21</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-17 17:56:16 +0000</bug_when>
    <thetext>Yes, that was perfect.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102477</commentid>
    <comment_count>22</comment_count>
    <who name="Morrita Hajime">morrita</who>
    <bug_when>2014-03-17 18:14:29 +0000</bug_when>
    <thetext>(In reply to Anne from comment #21)
&gt; Yes, that was perfect.
OK, I read the XHR spec and apparently it uses the same one so using it would result a consistent behavior.

And it looks like that XHR refers Content-Type header [1]. I think it&apos;s good pattern to follow - There should be a way to pick non-UTF-8 charsets, although I agree that the preferable default should be UTF-8 and we shouldn&apos;t use agent-specific encoding detection algorithms.


[1] http://xhr.spec.whatwg.org/#final-charset</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102481</commentid>
    <comment_count>23</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-17 18:25:23 +0000</bug_when>
    <thetext>No, XMLHttpRequest does that due to legacy. There should be no way to pick anything outside utf-8. That is what we have for workers, WebVTT, and anything new. Allowing other encodings is a security hazard best avoided.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102483</commentid>
    <comment_count>24</comment_count>
    <who name="Morrita Hajime">morrita</who>
    <bug_when>2014-03-17 19:28:01 +0000</bug_when>
    <thetext>(In reply to Anne from comment #23)
&gt; No, XMLHttpRequest does that due to legacy. There should be no way to pick
&gt; anything outside utf-8. That is what we have for workers, WebVTT, and
&gt; anything new. Allowing other encodings is a security hazard best avoided.

On the other hand, there is a risk that bad script could let the UA load existing non-UTF-8 HTML using UTF-8, that seems bad as well. Probably we should stop with an error if the UA sees non-UTF charset in Content-Type header.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102499</commentid>
    <comment_count>25</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-17 20:40:50 +0000</bug_when>
    <thetext>No that is fine. There is no risk in decoding using utf-8. The risk is with other encodings.

We might want to not enable sniffing though and require a text/html MIME type. Did you guys discuss that with Adam Barth?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102502</commentid>
    <comment_count>26</comment_count>
    <who name="Adam Klein">adamk</who>
    <bug_when>2014-03-17 20:51:40 +0000</bug_when>
    <thetext>(In reply to Anne from comment #25)
&gt; No that is fine. There is no risk in decoding using utf-8. The risk is with
&gt; other encodings.
&gt; 
&gt; We might want to not enable sniffing though and require a text/html MIME
&gt; type. Did you guys discuss that with Adam Barth?

Adam Barth commented on a code review that it seemed &quot;strange&quot; to him to force utf-8:

https://codereview.chromium.org/178883002/#msg2</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102539</commentid>
    <comment_count>27</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-18 09:42:06 +0000</bug_when>
    <thetext>He said that in the context of a patch that did not follow the suggestion to always use utf-8.

I don&apos;t really understand what&apos;s the problem here. There&apos;s numerous new contexts where we force utf-8, why are imports special?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102555</commentid>
    <comment_count>28</comment_count>
    <who name="Gordon P. Hemsley">gphemsley</who>
    <bug_when>2014-03-18 13:28:07 +0000</bug_when>
    <thetext>(In reply to Anne from comment #19)
&gt; No it&apos;s not. All kinds of things are different around script execution and
&gt; such. The HTML we have is something you load in a browsing context. This is
&gt; something else. Similarly HTML loaded through XMLHttpRequest has
&gt; restrictions on encodings (though not enough).

I agree with Anne&apos;s assessment that this would be a new &quot;import&quot; context which would require its own entry in the Context-specific sniffing section of mimesniff (akin to the style context, among others):

http://mimesniff.spec.whatwg.org/#context-specific-sniffing

As such, we can enforce any stricter restrictions on it that we want, and I think that would include requiring utf-8 and an appropriate MIME type. (I assume this would be specified through the Content-Type header, but Anne may have other ideas.)

One concern of mine regarding the MIME type is XHTML: These imports seem to be available to XHTML documents as well as HTML documents, so I would think that the imported documents would also be allowed to be XHTML documents. This would mean, in addition to allowing &quot;text/html&quot;, we would also need to allow an(y) XML type.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103829</commentid>
    <comment_count>29</comment_count>
    <who name="Morrita Hajime">morrita</who>
    <bug_when>2014-04-14 18:29:18 +0000</bug_when>
    <thetext>Discussed this at the last F2F: http://www.w3.org/2014/04/11-webapps-minutes.html

There was no strong objection. There were some concerns about inconsistency as I had though. My feeling is that this is a small detail that has no perfect answer but we can live with.

It seems that developers seem to love default UTF-8 because it allows keep imports concise, 
omitting whole &lt;head&gt; and making &lt;body&gt; implicitly.

I think this is valid use case or demand. So I recovered original change to do it: 

https://github.com/w3c/webcomponents/commit/d4594a9e558f526da358e24a6808b55861c906ed</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>