Bug 18397 - Encoding Sniffing Algorithm: Clarify what "information on the likely encoding" covers
Summary: Encoding Sniffing Algorithm: Clarify what "information on the likely encoding...
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: This bug has no owner yet - up for the taking
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/spec/Overview...
Depends on:
Reported: 2012-07-25 13:51 UTC by Leif Halvard Silli
Modified: 2013-06-11 11:41 UTC (History)
7 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2012-07-25 13:51:25 UTC
Please clarify what the step 'information on the likely encoding" covers.

For instance, does it cover the XML encoding declaration? Why? Why not?

In 2012, Chrome, Safari and Opera 12 still reads the XML encoding declaration when/if the HTMl encoding declaration is lacking. 

In october 2009, Ian Hickson wrote: "So in the absence of more compelling reasons to add this, I'd rather get  Opera and WebKit to remove the support for this, than add more" [1]

However, it seems to me that the step "information on the likely encoding" would cover their asses. After all, the presence of <?xml version="1.0" encoding="UTF-8" ?> increases the chance that the encoding is UTF-8. May be the algorithm could be specific on what is allowed and what is not allowed in this step? 

The spec should therefore offer more data on what this step of the sniffing algorithm refers to. Also see my blog post for more data.[2]

[1] http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-October/023670.html
[2] http://målform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm
Comment 1 Robin Berjon 2013-06-11 11:41:07 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:


Status: Rejected
Change Description: none

For context, this is about http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding

The point of this clause is *precisely* to be open ended. The rest of the algorithm provides good foundations for interoperability in the vast majority of cases. But then you have situations in which something extra might be required. For instance, as mentioned in the spec, you may have manually overridden the encoding in a previous visit to this page. Or the browser may be calling to a third-party service that has some smart heuristics about encodings that it can use to override. Or it may believe that during the full moon pages from the .paris domain switch to being encoded in UTF-9.

The important point is that for decisions that are covered by this clause, the confidence be set to "tentative". This allows the parser to change the encoding to something else if it gets better information.