21275 2013-03-13 20:58:59 +0000 [Imports]: Force utf-8 2014-04-14 18:29:18 +0000 1 1 1 Unclassified WebAppsWG HISTORICAL - Component Model unspecified PC All RESOLVED FIXED P2 normal --- 20683 1 dglazkov dglazkov adamk annevk gphemsley ian morrita public-webapps-bugzilla oldest_to_newest 84366 0 dglazkov 2013-03-13 20:58:59 +0000 From annevk: Lets override all the encoding magic the HTML layer might bring with it and simply always decode these resources using utf-8, just as we do with workers and such. 87695 1 dglazkov 2013-05-14 23:03:39 +0000 Anne, can you help with a good language here? I am not very well-versed in encodings. Or as you might say, my sniffing is not up to snuff. 87781 2 annevk 2013-05-16 16:28:20 +0000 Basically. I think you want "utf-8 decode" /byte stream/ and then parse the result of that operation using the HTML parser. XMLHttpRequest does something similar. http://encoding.spec.whatwg.org/#utf-8-decode defines utf-8 decode. 90957 3 dglazkov 2013-07-18 22:44:18 +0000 From IRC: 3:42 PM <annevk> dglazkov: so now you cannot have CORS-cross-origin resources so that part of the spec doesn't make sense anymore 3:43 PM <annevk> dglazkov: you also need to say what happens if fetching failed 90958 4 dglazkov 2013-07-18 22:47:46 +0000 3:44 PM <annevk> I guess it's mostly fine otherwise, although I wonder if it shouldn't use a crossorigin attribute on <link> like most other things we have 90960 5 dglazkov 2013-07-18 22:48:27 +0000 Sorry for spam. Wrong bug. 90994 6 dglazkov 2013-07-19 17:16:32 +0000 https://dvcs.w3.org/hg/webcomponents/rev/121eefa9215e 91000 7 annevk 2013-07-19 17:48:45 +0000 You convert the stream into a code point stream. You cannot use that new stream as a byte stream. Language similar to http://xhr.spec.whatwg.org/#document-response-entity-body is probably enough: "Decode byte stream response entity body using fallback encoding charset and then let document be a document that represents the result of that, parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled." 91002 8 dglazkov 2013-07-19 18:28:24 +0000 Spoke with Hixie, he's going to give us a better hook. 91007 9 dglazkov 2013-07-19 20:12:25 +0000 (In reply to comment #8) > Spoke with Hixie, he's going to give us a better hook. WDYT? https://dvcs.w3.org/hg/webcomponents/rev/975f62535f11 91008 10 annevk 2013-07-19 20:46:52 +0000 Seems fine to me. I ended up not mentioning the input byte stream explicitly in XMLHttpRequest. Either way works I think. 91010 11 dglazkov 2013-07-19 20:53:05 +0000 #winning 101279 12 adamk 2014-02-24 22:30:36 +0000 In trying to implement this in Blink (see https://codereview.chromium.org/178883002/), I'm somewhat puzzled as to what's intended by the spec. It seems to be saying that the response is always interpreted as UTF-8. Per HTML, that overrides <meta charset>. Is that intentional? What about HTTP Content-Type headers? Is overriding those intentional? It seems strange to me to choose a different encoding than the document (or document's server) claim the document is encoded in. 101786 13 adamk 2014-03-03 20:35:33 +0000 Ping? I'm interested particularly in the opinions of morrita and annevk (who I think suggested this in the first place). 101787 14 morrita 2014-03-03 20:55:12 +0000 (In reply to Adam Klein from comment #13) > Ping? I'm interested particularly in the opinions of morrita and annevk (who > I think suggested this in the first place). I'm inclined to do no special casing for both utf-8 and quirks mode thing as long as imports are documents not fragments. There are many knobs to affect text encoding detection and it seems hard to state this special case well. If we switch to fragments, we can define how the transport layer of imports work separately from one of the document. 102316 15 adamk 2014-03-13 20:44:18 +0000 (In reply to Morrita Hajime from comment #14) > (In reply to Adam Klein from comment #13) > > Ping? I'm interested particularly in the opinions of morrita and annevk (who > > I think suggested this in the first place). > > I'm inclined to do no special casing for both utf-8 and quirks mode thing > as long as imports are documents not fragments. There are many knobs to > affect text encoding detection and it seems hard to state this special case > well. > > If we switch to fragments, we can define how the transport layer of imports > work separately from one of the document. Seems like the spec should be updated to remove this utf-8 special casing for now, at least. 102326 16 morrita 2014-03-13 22:14:54 +0000 https://github.com/w3c/webcomponents/commit/77d82322655f528017322696470d487c2d09e722 102433 17 annevk 2014-03-16 14:29:37 +0000 No we should always use utf-8. Just like workers are always utf-8, new types of HTML resources should always be utf-8. <meta> and text/html's charset parameter are simply irrelevant. 102466 18 morrita 2014-03-17 16:51:02 +0000 (In reply to Anne from comment #17) > No we should always use utf-8. Just like workers are always utf-8, new types > of HTML resources should always be utf-8. <meta> and text/html's charset > parameter are simply irrelevant. The point here is that it is not clear if HTML imports are new type of resources or they are just HTML. If we turn the type of imports to DocumentFragments, it would definitely a new type of resources. But if it is Document, I think it's just an HTML as we have had. 102470 19 annevk 2014-03-17 17:02:33 +0000 No it's not. All kinds of things are different around script execution and such. The HTML we have is something you load in a browsing context. This is something else. Similarly HTML loaded through XMLHttpRequest has restrictions on encodings (though not enough). 102471 20 morrita 2014-03-17 17:13:37 +0000 (In reply to Anne from comment #19) > XMLHttpRequest has restrictions on encodings (though not enough). Ah OK, this is good news to me. Probably I should just revert the change. Do you think the integration point used in [1] is correct one? [1] https://github.com/w3c/webcomponents/commit/77d82322655f528017322696470d487c2d09e722 102473 21 annevk 2014-03-17 17:56:16 +0000 Yes, that was perfect. 102477 22 morrita 2014-03-17 18:14:29 +0000 (In reply to Anne from comment #21) > Yes, that was perfect. OK, I read the XHR spec and apparently it uses the same one so using it would result a consistent behavior. And it looks like that XHR refers Content-Type header [1]. I think it's good pattern to follow - There should be a way to pick non-UTF-8 charsets, although I agree that the preferable default should be UTF-8 and we shouldn't use agent-specific encoding detection algorithms. [1] http://xhr.spec.whatwg.org/#final-charset 102481 23 annevk 2014-03-17 18:25:23 +0000 No, XMLHttpRequest does that due to legacy. There should be no way to pick anything outside utf-8. That is what we have for workers, WebVTT, and anything new. Allowing other encodings is a security hazard best avoided. 102483 24 morrita 2014-03-17 19:28:01 +0000 (In reply to Anne from comment #23) > No, XMLHttpRequest does that due to legacy. There should be no way to pick > anything outside utf-8. That is what we have for workers, WebVTT, and > anything new. Allowing other encodings is a security hazard best avoided. On the other hand, there is a risk that bad script could let the UA load existing non-UTF-8 HTML using UTF-8, that seems bad as well. Probably we should stop with an error if the UA sees non-UTF charset in Content-Type header. 102499 25 annevk 2014-03-17 20:40:50 +0000 No that is fine. There is no risk in decoding using utf-8. The risk is with other encodings. We might want to not enable sniffing though and require a text/html MIME type. Did you guys discuss that with Adam Barth? 102502 26 adamk 2014-03-17 20:51:40 +0000 (In reply to Anne from comment #25) > No that is fine. There is no risk in decoding using utf-8. The risk is with > other encodings. > > We might want to not enable sniffing though and require a text/html MIME > type. Did you guys discuss that with Adam Barth? Adam Barth commented on a code review that it seemed "strange" to him to force utf-8: https://codereview.chromium.org/178883002/#msg2 102539 27 annevk 2014-03-18 09:42:06 +0000 He said that in the context of a patch that did not follow the suggestion to always use utf-8. I don't really understand what's the problem here. There's numerous new contexts where we force utf-8, why are imports special? 102555 28 gphemsley 2014-03-18 13:28:07 +0000 (In reply to Anne from comment #19) > No it's not. All kinds of things are different around script execution and > such. The HTML we have is something you load in a browsing context. This is > something else. Similarly HTML loaded through XMLHttpRequest has > restrictions on encodings (though not enough). I agree with Anne's assessment that this would be a new "import" context which would require its own entry in the Context-specific sniffing section of mimesniff (akin to the style context, among others): http://mimesniff.spec.whatwg.org/#context-specific-sniffing As such, we can enforce any stricter restrictions on it that we want, and I think that would include requiring utf-8 and an appropriate MIME type. (I assume this would be specified through the Content-Type header, but Anne may have other ideas.) One concern of mine regarding the MIME type is XHTML: These imports seem to be available to XHTML documents as well as HTML documents, so I would think that the imported documents would also be allowed to be XHTML documents. This would mean, in addition to allowing "text/html", we would also need to allow an(y) XML type. 103829 29 morrita 2014-04-14 18:29:18 +0000 Discussed this at the last F2F: http://www.w3.org/2014/04/11-webapps-minutes.html There was no strong objection. There were some concerns about inconsistency as I had though. My feeling is that this is a small detail that has no perfect answer but we can live with. It seems that developers seem to love default UTF-8 because it allows keep imports concise, omitting whole <head> and making <body> implicitly. I think this is valid use case or demand. So I recovered original change to do it: https://github.com/w3c/webcomponents/commit/d4594a9e558f526da358e24a6808b55861c906ed