This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 25534 - HTML spec should not encourage to auto-detect UTF-8
Summary: HTML spec should not encourage to auto-detect UTF-8
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-02 11:31 UTC by Masatoshi Kimura
Modified: 2014-08-27 23:12 UTC (History)
5 users (show)

See Also:


Attachments

Description Masatoshi Kimura 2014-05-02 11:31:11 UTC
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding paragraph 8 says:

 "The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding."

But Gecko will never follow this. See Gecko bug 815551 for details. Therefore the paragraph will only confuse readers. Luckily, "Note:" is not a normative part of the spec, so we can just remove it.
Comment 1 Ian 'Hixie' Hickson 2014-05-02 18:33:13 UTC
I don't understand the arguments in that bug. If you're parsing a file and you find that the page has high bits set and is valid UTF-8, but you're parsing it as Win1252, it's almost certainly being decoded wrongly. So why not at least tell the user?

If you're doing any autodetection at all, then detecting UTF-8 seems like the obvious choice.

If you're not doing any autodetection at all, then the whole step is irrelevant.

I could change the phrasing of the note to be more conditional and recommend against any sniffing at all, if you think that'd made sense.
Comment 2 Anne 2014-05-06 14:27:40 UTC
We want to make the web less dependent on sniffing. And then where we think sniffing might be necessary, we try to scope it to the TLD of the site, rather than the locale of the user (as discussed on the WHATWG list).

The specification should slowly align with this as in the end making this whole thing deterministic would be great. So e.g. we found that sniffing for utf-8 is not something we need, but sniffing for Japanese might be necessary.
Comment 3 Ian 'Hixie' Hickson 2014-06-05 21:52:04 UTC
This step isn't locale-specific. I agree that making things based on the TLD and not the locale is an improvement, but I don't think that's relevant here.
Comment 4 Anne 2014-06-06 08:12:40 UTC
The other problem with allowing sniffing is that if one UA starts doing it, others will have to follow if content starts depending on it.

You start with the premise that there might be such content. I am hoping we can prevent such content from growing by not sniffing and encouraging people to define their encoding.

We need to get out of the non-determinism.
Comment 5 Ian 'Hixie' Hickson 2014-06-06 20:11:20 UTC
The argument in comment 4 is about removing step 8 entirely. That's a different bug. This bug is about the non-normative note in step 8 (for which the solution might be the last paragraph of comment 1?).
Comment 6 Anne 2014-06-07 12:56:11 UTC
Fair. We want to remove sniffing entirely, but for now we think that for Japanese we might not be able to get away with that; same for Russian/Ukrainian. Those are the last cases we might need to define sniffing for and hopefully eventually we can define an algorithm for that rather than the current setup.

So yes, last paragraph of comment 1 sounds good.
Comment 7 Ian 'Hixie' Hickson 2014-06-09 16:51:50 UTC
I'm all for that. Please do file a bug if you think we can get there!

In the meantime, I'll see about updating the note...
Comment 8 Henri Sivonen 2014-06-10 08:32:43 UTC
So:
 1) UTF-8 is reliably detectable when you have the whole stream to look at.
 2) With http[s], you don't have the whole stream before you are expected to display something.
 3) With http[s], you don't have tho whole stream before the encoding is expected to inherit into subresources.

My conclusion is that the encouragement to detect UTF-8 should be scoped to being a suggestion for the case where you are reading normal files (not endless streams that have a file path) from file: URLs. I think it's harmful to have it in the spec as a general suggestion in the absence of a detailed explanation of how to reconcile the sniffing with incremental parsing and encoding inheritance into subresources.
Comment 9 Ian 'Hixie' Hickson 2014-06-10 16:46:11 UTC
That seems reasonable.
Comment 10 Ian 'Hixie' Hickson 2014-08-27 23:12:30 UTC
Please reopen if the new text isn't good enough.
Comment 11 contributor 2014-08-27 23:12:46 UTC
Checked in as WHATWG revision r8722.
Check-in comment: Adjust notes on encoding detection
http://html5.org/tools/web-apps-tracker?from=8721&to=8722