This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding paragraph 8 says: "The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding." But Gecko will never follow this. See Gecko bug 815551 for details. Therefore the paragraph will only confuse readers. Luckily, "Note:" is not a normative part of the spec, so we can just remove it.
I don't understand the arguments in that bug. If you're parsing a file and you find that the page has high bits set and is valid UTF-8, but you're parsing it as Win1252, it's almost certainly being decoded wrongly. So why not at least tell the user? If you're doing any autodetection at all, then detecting UTF-8 seems like the obvious choice. If you're not doing any autodetection at all, then the whole step is irrelevant. I could change the phrasing of the note to be more conditional and recommend against any sniffing at all, if you think that'd made sense.
We want to make the web less dependent on sniffing. And then where we think sniffing might be necessary, we try to scope it to the TLD of the site, rather than the locale of the user (as discussed on the WHATWG list). The specification should slowly align with this as in the end making this whole thing deterministic would be great. So e.g. we found that sniffing for utf-8 is not something we need, but sniffing for Japanese might be necessary.
This step isn't locale-specific. I agree that making things based on the TLD and not the locale is an improvement, but I don't think that's relevant here.
The other problem with allowing sniffing is that if one UA starts doing it, others will have to follow if content starts depending on it. You start with the premise that there might be such content. I am hoping we can prevent such content from growing by not sniffing and encouraging people to define their encoding. We need to get out of the non-determinism.
The argument in comment 4 is about removing step 8 entirely. That's a different bug. This bug is about the non-normative note in step 8 (for which the solution might be the last paragraph of comment 1?).
Fair. We want to remove sniffing entirely, but for now we think that for Japanese we might not be able to get away with that; same for Russian/Ukrainian. Those are the last cases we might need to define sniffing for and hopefully eventually we can define an algorithm for that rather than the current setup. So yes, last paragraph of comment 1 sounds good.
I'm all for that. Please do file a bug if you think we can get there! In the meantime, I'll see about updating the note...
So: 1) UTF-8 is reliably detectable when you have the whole stream to look at. 2) With http[s], you don't have the whole stream before you are expected to display something. 3) With http[s], you don't have tho whole stream before the encoding is expected to inherit into subresources. My conclusion is that the encouragement to detect UTF-8 should be scoped to being a suggestion for the case where you are reading normal files (not endless streams that have a file path) from file: URLs. I think it's harmful to have it in the spec as a general suggestion in the absence of a detailed explanation of how to reconcile the sniffing with incremental parsing and encoding inheritance into subresources.
That seems reasonable.
Please reopen if the new text isn't good enough.
Checked in as WHATWG revision r8722. Check-in comment: Adjust notes on encoding detection http://html5.org/tools/web-apps-tracker?from=8721&to=8722