Bug 21275 - [Imports]: Force utf-8
Summary: [Imports]: Force utf-8
Status: RESOLVED FIXED
Alias: None
Product: WebAppsWG
Classification: Unclassified
Component: HISTORICAL - Component Model (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Dimitri Glazkov
QA Contact: public-webapps-bugzilla
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 20683
  Show dependency treegraph
 
Reported: 2013-03-13 20:58 UTC by Dimitri Glazkov
Modified: 2014-04-14 18:29 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dimitri Glazkov 2013-03-13 20:58:59 UTC
From annevk: Lets override all the encoding magic the HTML layer might bring with
it and simply always decode these resources using utf-8, just as we do
with workers and such.
Comment 1 Dimitri Glazkov 2013-05-14 23:03:39 UTC
Anne, can you help with a good language here? I am not very well-versed in encodings. Or as you might say, my sniffing is not up to snuff.
Comment 2 Anne 2013-05-16 16:28:20 UTC
Basically. I think you want "utf-8 decode" /byte stream/ and then parse the result of that operation using the HTML parser. XMLHttpRequest does something similar.

http://encoding.spec.whatwg.org/#utf-8-decode defines utf-8 decode.
Comment 3 Dimitri Glazkov 2013-07-18 22:44:18 UTC
From IRC:

3:42 PM <annevk> dglazkov: so now you cannot have CORS-cross-origin resources so that part of the spec doesn't make sense anymore
3:43 PM <annevk> dglazkov: you also need to say what happens if fetching failed
Comment 4 Dimitri Glazkov 2013-07-18 22:47:46 UTC
3:44 PM <annevk> I guess it's mostly fine otherwise, although I wonder if it shouldn't use a crossorigin attribute on <link> like most other things we have
Comment 5 Dimitri Glazkov 2013-07-18 22:48:27 UTC
Sorry for spam. Wrong bug.
Comment 6 Dimitri Glazkov 2013-07-19 17:16:32 UTC
https://dvcs.w3.org/hg/webcomponents/rev/121eefa9215e
Comment 7 Anne 2013-07-19 17:48:45 UTC
You convert the stream into a code point stream. You cannot use that new stream as a byte stream. Language similar to http://xhr.spec.whatwg.org/#document-response-entity-body is probably enough: "Decode byte stream response entity body using fallback encoding charset and then let document be a document that represents the result of that, parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled."
Comment 8 Dimitri Glazkov 2013-07-19 18:28:24 UTC
Spoke with Hixie, he's going to give us a better hook.
Comment 9 Dimitri Glazkov 2013-07-19 20:12:25 UTC
(In reply to comment #8)
> Spoke with Hixie, he's going to give us a better hook.

WDYT? https://dvcs.w3.org/hg/webcomponents/rev/975f62535f11
Comment 10 Anne 2013-07-19 20:46:52 UTC
Seems fine to me. I ended up not mentioning the input byte stream explicitly in XMLHttpRequest. Either way works I think.
Comment 11 Dimitri Glazkov 2013-07-19 20:53:05 UTC
#winning
Comment 12 Adam Klein 2014-02-24 22:30:36 UTC
In trying to implement this in Blink (see https://codereview.chromium.org/178883002/), I'm somewhat puzzled as to what's intended by the spec. It seems to be saying that the response is always interpreted as UTF-8. Per HTML, that overrides <meta charset>. Is that intentional? What about HTTP Content-Type headers? Is overriding those intentional? It seems strange to me to choose a different encoding than the document (or document's server) claim the document is encoded in.
Comment 13 Adam Klein 2014-03-03 20:35:33 UTC
Ping? I'm interested particularly in the opinions of morrita and annevk (who I think suggested this in the first place).
Comment 14 Morrita Hajime 2014-03-03 20:55:12 UTC
(In reply to Adam Klein from comment #13)
> Ping? I'm interested particularly in the opinions of morrita and annevk (who
> I think suggested this in the first place).

I'm inclined to do no special casing for both utf-8 and quirks mode thing
as long as imports are documents not fragments. There are many knobs to affect text encoding detection and it seems hard to state this special case well.

If we switch to fragments, we can define how the transport layer of imports work separately from one of the document.
Comment 15 Adam Klein 2014-03-13 20:44:18 UTC
(In reply to Morrita Hajime from comment #14)
> (In reply to Adam Klein from comment #13)
> > Ping? I'm interested particularly in the opinions of morrita and annevk (who
> > I think suggested this in the first place).
> 
> I'm inclined to do no special casing for both utf-8 and quirks mode thing
> as long as imports are documents not fragments. There are many knobs to
> affect text encoding detection and it seems hard to state this special case
> well.
> 
> If we switch to fragments, we can define how the transport layer of imports
> work separately from one of the document.

Seems like the spec should be updated to remove this utf-8 special casing for now, at least.
Comment 17 Anne 2014-03-16 14:29:37 UTC
No we should always use utf-8. Just like workers are always utf-8, new types of HTML resources should always be utf-8. <meta> and text/html's charset parameter are simply irrelevant.
Comment 18 Morrita Hajime 2014-03-17 16:51:02 UTC
(In reply to Anne from comment #17)
> No we should always use utf-8. Just like workers are always utf-8, new types
> of HTML resources should always be utf-8. <meta> and text/html's charset
> parameter are simply irrelevant.

The point here is that it is not clear if HTML imports are new type of resources or they are just HTML. If we turn the type of imports to DocumentFragments, it would definitely a new type of resources. But if it is Document, I think it's just an HTML as we have had.
Comment 19 Anne 2014-03-17 17:02:33 UTC
No it's not. All kinds of things are different around script execution and such. The HTML we have is something you load in a browsing context. This is something else. Similarly HTML loaded through XMLHttpRequest has restrictions on encodings (though not enough).
Comment 20 Morrita Hajime 2014-03-17 17:13:37 UTC
(In reply to Anne from comment #19)
> XMLHttpRequest has restrictions on encodings (though not enough).
Ah OK, this is good news to me. Probably I should just revert the change.

Do you think the integration point used in [1] is correct one?

[1] https://github.com/w3c/webcomponents/commit/77d82322655f528017322696470d487c2d09e722
Comment 21 Anne 2014-03-17 17:56:16 UTC
Yes, that was perfect.
Comment 22 Morrita Hajime 2014-03-17 18:14:29 UTC
(In reply to Anne from comment #21)
> Yes, that was perfect.
OK, I read the XHR spec and apparently it uses the same one so using it would result a consistent behavior.

And it looks like that XHR refers Content-Type header [1]. I think it's good pattern to follow - There should be a way to pick non-UTF-8 charsets, although I agree that the preferable default should be UTF-8 and we shouldn't use agent-specific encoding detection algorithms.


[1] http://xhr.spec.whatwg.org/#final-charset
Comment 23 Anne 2014-03-17 18:25:23 UTC
No, XMLHttpRequest does that due to legacy. There should be no way to pick anything outside utf-8. That is what we have for workers, WebVTT, and anything new. Allowing other encodings is a security hazard best avoided.
Comment 24 Morrita Hajime 2014-03-17 19:28:01 UTC
(In reply to Anne from comment #23)
> No, XMLHttpRequest does that due to legacy. There should be no way to pick
> anything outside utf-8. That is what we have for workers, WebVTT, and
> anything new. Allowing other encodings is a security hazard best avoided.

On the other hand, there is a risk that bad script could let the UA load existing non-UTF-8 HTML using UTF-8, that seems bad as well. Probably we should stop with an error if the UA sees non-UTF charset in Content-Type header.
Comment 25 Anne 2014-03-17 20:40:50 UTC
No that is fine. There is no risk in decoding using utf-8. The risk is with other encodings.

We might want to not enable sniffing though and require a text/html MIME type. Did you guys discuss that with Adam Barth?
Comment 26 Adam Klein 2014-03-17 20:51:40 UTC
(In reply to Anne from comment #25)
> No that is fine. There is no risk in decoding using utf-8. The risk is with
> other encodings.
> 
> We might want to not enable sniffing though and require a text/html MIME
> type. Did you guys discuss that with Adam Barth?

Adam Barth commented on a code review that it seemed "strange" to him to force utf-8:

https://codereview.chromium.org/178883002/#msg2
Comment 27 Anne 2014-03-18 09:42:06 UTC
He said that in the context of a patch that did not follow the suggestion to always use utf-8.

I don't really understand what's the problem here. There's numerous new contexts where we force utf-8, why are imports special?
Comment 28 Gordon P. Hemsley 2014-03-18 13:28:07 UTC
(In reply to Anne from comment #19)
> No it's not. All kinds of things are different around script execution and
> such. The HTML we have is something you load in a browsing context. This is
> something else. Similarly HTML loaded through XMLHttpRequest has
> restrictions on encodings (though not enough).

I agree with Anne's assessment that this would be a new "import" context which would require its own entry in the Context-specific sniffing section of mimesniff (akin to the style context, among others):

http://mimesniff.spec.whatwg.org/#context-specific-sniffing

As such, we can enforce any stricter restrictions on it that we want, and I think that would include requiring utf-8 and an appropriate MIME type. (I assume this would be specified through the Content-Type header, but Anne may have other ideas.)

One concern of mine regarding the MIME type is XHTML: These imports seem to be available to XHTML documents as well as HTML documents, so I would think that the imported documents would also be allowed to be XHTML documents. This would mean, in addition to allowing "text/html", we would also need to allow an(y) XML type.
Comment 29 Morrita Hajime 2014-04-14 18:29:18 UTC
Discussed this at the last F2F: http://www.w3.org/2014/04/11-webapps-minutes.html

There was no strong objection. There were some concerns about inconsistency as I had though. My feeling is that this is a small detail that has no perfect answer but we can live with.

It seems that developers seem to love default UTF-8 because it allows keep imports concise, 
omitting whole <head> and making <body> implicitly.

I think this is valid use case or demand. So I recovered original change to do it: 

https://github.com/w3c/webcomponents/commit/d4594a9e558f526da358e24a6808b55861c906ed