Bugzilla – Bug 18397
Encoding Sniffing Algorithm: Clarify what "information on the likely encoding" covers
Last modified: 2013-06-11 11:41:07 UTC
Please clarify what the step 'information on the likely encoding" covers.
For instance, does it cover the XML encoding declaration? Why? Why not?
In 2012, Chrome, Safari and Opera 12 still reads the XML encoding declaration when/if the HTMl encoding declaration is lacking.
In october 2009, Ian Hickson wrote: "So in the absence of more compelling reasons to add this, I'd rather get Opera and WebKit to remove the support for this, than add more" 
However, it seems to me that the step "information on the likely encoding" would cover their asses. After all, the presence of <?xml version="1.0" encoding="UTF-8" ?> increases the chance that the encoding is UTF-8. May be the algorithm could be specific on what is allowed and what is not allowed in this step?
The spec should therefore offer more data on what this step of the sniffing algorithm refers to. Also see my blog post for more data.
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:
Change Description: none
For context, this is about http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding
The point of this clause is *precisely* to be open ended. The rest of the algorithm provides good foundations for interoperability in the vast majority of cases. But then you have situations in which something extra might be required. For instance, as mentioned in the spec, you may have manually overridden the encoding in a previous visit to this page. Or the browser may be calling to a third-party service that has some smart heuristics about encodings that it can use to override. Or it may believe that during the full moon pages from the .paris domain switch to being encoded in UTF-9.
The important point is that for decisions that are covered by this clause, the confidence be set to "tentative". This allows the parser to change the encoding to something else if it gets better information.