Bugzilla – Bug 21275
[Imports]: Force utf-8
Last modified: 2014-03-03 20:55:12 UTC
From annevk: Lets override all the encoding magic the HTML layer might bring with
it and simply always decode these resources using utf-8, just as we do
with workers and such.
Anne, can you help with a good language here? I am not very well-versed in encodings. Or as you might say, my sniffing is not up to snuff.
Basically. I think you want "utf-8 decode" /byte stream/ and then parse the result of that operation using the HTML parser. XMLHttpRequest does something similar.
http://encoding.spec.whatwg.org/#utf-8-decode defines utf-8 decode.
3:42 PM <annevk> dglazkov: so now you cannot have CORS-cross-origin resources so that part of the spec doesn't make sense anymore
3:43 PM <annevk> dglazkov: you also need to say what happens if fetching failed
3:44 PM <annevk> I guess it's mostly fine otherwise, although I wonder if it shouldn't use a crossorigin attribute on <link> like most other things we have
Sorry for spam. Wrong bug.
You convert the stream into a code point stream. You cannot use that new stream as a byte stream. Language similar to http://xhr.spec.whatwg.org/#document-response-entity-body is probably enough: "Decode byte stream response entity body using fallback encoding charset and then let document be a document that represents the result of that, parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled."
Spoke with Hixie, he's going to give us a better hook.
(In reply to comment #8)
> Spoke with Hixie, he's going to give us a better hook.
Seems fine to me. I ended up not mentioning the input byte stream explicitly in XMLHttpRequest. Either way works I think.
In trying to implement this in Blink (see https://codereview.chromium.org/178883002/), I'm somewhat puzzled as to what's intended by the spec. It seems to be saying that the response is always interpreted as UTF-8. Per HTML, that overrides <meta charset>. Is that intentional? What about HTTP Content-Type headers? Is overriding those intentional? It seems strange to me to choose a different encoding than the document (or document's server) claim the document is encoded in.
Ping? I'm interested particularly in the opinions of morrita and annevk (who I think suggested this in the first place).
(In reply to Adam Klein from comment #13)
> Ping? I'm interested particularly in the opinions of morrita and annevk (who
> I think suggested this in the first place).
I'm inclined to do no special casing for both utf-8 and quirks mode thing
as long as imports are documents not fragments. There are many knobs to affect text encoding detection and it seems hard to state this special case well.
If we switch to fragments, we can define how the transport layer of imports work separately from one of the document.