Bug 12897 - In some parsers, UTF-8 BOM trumps the HTTP charset attribute (Encoding sniffing algorithm)
In some parsers, UTF-8 BOM trumps the HTTP charset attribute (Encoding sniffi...
Status: RESOLVED WONTFIX
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec
unspecified
PC All
: P3 major
: ---
Assigned To: contributor
HTML WG Bugzilla archive list
http://dev.w3.org/html5/spec/parsing#...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-06 20:01 UTC by Leif Halvard Silli
Modified: 2011-12-31 03:34 UTC (History)
7 users (show)

See Also:


Attachments
Polyglot file with BOM, served as 'application/xhtml+x; charset=koi8-r' (3.18 KB, application/xhtml+xml;charset=koi8-r)
2011-06-06 21:04 UTC, Leif Halvard Silli
Details
Polyglot file with BOM served as 'text/html charset=koi8-r' (3.18 KB, text/html; charset=koi8-r)
2011-06-06 21:09 UTC, Leif Halvard Silli
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2011-06-06 20:01:58 UTC
PROPOSAL: 
   Spec IE and Webkit's handling of the Byte Order Mark for the UTF-8 encoding as  REQUIRED:   Whenever the document begins with the UTF-8 Byte Order Mark, then ignore the encoding info of the HTTP "Content-Type: text/html; charset=[encodingname]" header and ignore as well any user actions to override the document's encoding.
   Consequently, 
      * when there is a UTF-8 BOM,  then the encoding info provided by HTTP and the user should be treated as irrelevant
      * the two first steps of the encoding sniffing algorithm must be changed 

CURRENT STATUS: 
   The encoding sniffing algorithm two first steps give users + transporation layer (HTTP/MIME) power to override a document's character encoding:

   ]] 1. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain and abort these steps.
      2. If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.[[

   HOWEVER, reality is that two mayor user agents operates with an exceptio to the above rules:  Whenever the document includes the UTF-8 Byte Order Mark, then Internet Explorer and Webkit  
    - do *not* allow users to override the encoding
    - do *not* respect the encoding information in the HTTP server's Content-Type header.
    - do *not* permit their heuristic character dection features to guess any encoding other than UTF-8
   Consequently, in IE and Webkit it is impossible for the user - as well as for a HTTP server -  to cause a document with the UTF-8 Byte Order Mark to be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded.

   In contrast, Firefox and Opera
    - *do* obey the HTTP server's Content-Type header also when ther is a UTF-8 BOM 
    - *do* allow users to override the encoding also when ther is a UTF-8 BOM
    - *do* permit their heuristic character dection features to guess an encoding other than UTF-8 (Opera and Firefox allow their users to tune/fiddle with how their heuristic encoding sniffing work.)
   Consequently, in Firefox and Opera it is *possible* for the user - as well as for a HTTP server -  to cause a document with the UTF-8 Byte Order Mark to be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded

BENEFITS:
    A. Harmonization with XML 1.0  Appendix F.2, "Priorities in the Presence of External Encoding Information", which recommends BOM to have higher priority than external encoding information: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info  (Opera/Firefox do not yet implement this XML 1.0 recommendation)
    B. a simple, reliable way to specify the UTF-8 encoding
    C. FIrefox/Opera converge with IE/Webkit = browsers more interopable: 
    D. security: cameleon documens (where the document gets another and risky interpretation when read as legacy encoding) become more difficult to create
    E. User experience: less "gibberish" and less "mojibake" for users [*] http://en.wikipedia.org/wiki/Mojibake
    F. Same as A): Promotes a polyglot way to specify the encoding: the BOM works in both HTML and XML. (The Polyglot spec already says that the UTF-8 BOM  is the most polyglot enocoding method.)

Other justifications:
   - Opera Software "We have introduced the BOM as requirement for each source file when we have written the build tools as a simple way to verify that all files are utf8 encoded”. (http://stackoverflow.com/questions/4658985/how-to-keep-the-bom-when-editing-files-in-espresso)

NOTES: 

   (1) Browsers tested as part of this bug report: IE8, Safari, Chrome (which shows above described behavior) as well as Opera and Firefox (which do support this behavior). Other browsers, e.g. KHTML, have not been tested.
   (2) BOM in UTF-16: I have not looked into how BOM in UTF-16 is handled by parsers.
   (3)  For the record: All browsers, including Firefox and Opera, *do* already ignore the META charset *element* whenever ther is a UTF-8 BOM. This bug report says that they should *also* ignore HTTP.
Comment 1 Leif Halvard Silli 2011-06-06 20:15:50 UTC
A minor data correction:

(In reply to comment #0)
   [ snip ]
> NOTES: 
> 
>    (1) Browsers tested as part of this bug report: IE8, Safari, Chrome (which
> shows above described behavior) as well as Opera and Firefox (which do support
> this behavior). Other browsers, e.g. KHTML, have not been tested.

The last parenthesis should be read "(which do *not* support this behaviour)".
Comment 2 Leif Halvard Silli 2011-06-06 20:46:34 UTC
Quirks-mode:  Additionally, the Opera/Firefox behaviour sends those browsers into Quirks-Mode (because they see some illegal characters before the DOCTYPE), whereas Internet Explorer and Webkit remain in no-quirks mode.
Comment 3 Leif Halvard Silli 2011-06-06 21:04:54 UTC
Created attachment 994 [details]
Polyglot file with BOM, served as 'application/xhtml+x; charset=koi8-r'

Parsers which follow the recommendation in  XML 1.0, should respect the BOM and ignore the encoding information inisde the Content-Type header.
Comment 4 Leif Halvard Silli 2011-06-06 21:09:57 UTC
Created attachment 995 [details]
Polyglot file with BOM served as 'text/html charset=koi8-r'
Comment 5 Leif Halvard Silli 2011-06-06 23:51:26 UTC
Test file which demonstrates the issues are available here - it is recommended to read the explanation of that page: http://malform.no/testing/html5/bom/

Direct link, XML test file: http://malform.no/testing/html5/bom/xml.html
Direct link, HTML test file: http://malform.no/testing/html5/bom/htm.html
Additionally, Opera has some extra bugs (a bug in the bug): http://malform.no/testing/html5/bom/xml_(ISO-8859-1).html

All the pages have a BOM in combination with erroneous encoding info inside the HTTP Content-Type: header.
Comment 6 Leif Halvard Silli 2011-06-07 02:49:13 UTC
Related Mozilla bug: https://bugzilla.mozilla.org/show_bug.cgi?id=662458
 Related Opera bug: DSK-338772    AT-the-server      bugs.opera.com
Comment 7 Henri Sivonen 2011-06-07 05:14:24 UTC
(In reply to comment #0)
>     A. Harmonization with XML 1.0  Appendix F.2, "Priorities in the Presence of
> External Encoding Information", which recommends BOM to have higher priority
> than external encoding information:
> http://www.w3.org/TR/xml/#sec-guessing-with-ext-info  (Opera/Firefox do not yet
> implement this XML 1.0 recommendation)

I believe you are misreading the XML 1.0 spec. It says that in the HTTP case, RFC 3023 applies but for anyone specifying a new case, they recommend giving XML itself precedence. However, since the RFC applies in the HTTP case, in the HTTP case, the charset parameter on the HTTP level is authoritative.
Comment 8 Leif Halvard Silli 2011-06-07 11:53:05 UTC
(In reply to comment #7)

> I believe you are misreading the XML 1.0 spec. It says that in the HTTP case,
> RFC 3023 applies but for anyone specifying a new case, they recommend giving
> XML itself precedence. However, since the RFC applies in the HTTP case, in the
> HTTP case, the charset parameter on the HTTP level is authoritative.

(1) It is already great if we agree that about the interpretation whenever HTTP  is *not* used!

(2) In that regard, HTML5 tends to talk about "the higher protocol" and not specifically about HTTP.

(3) It is in the power of HTML5 spec to specify how XHTML5 and HTML5 document should be interpreted. Because: 

  a) the HTML5 effort (including "sister projects") looks as redefining/refining the HTTP specs as well as HTML itself. 

  b) XML 1.0 defers it: "the preferred method of handling conflict  should be specified as part of the higher-level protocol used to deliver XML"

  c) XML 1.0 defines a recommended rule (which it probably would like to see in HTTP as well): "If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding."

But apart from what XML says, we must also look at interoperatibility - and the effects of Opera and Mozilla's reading of the specifications.

  I) In Mozilla's bugzilla there are several reports about how to handle the BOM gibberish letters whenever the BOM is ignored in favor of an external protocol.

  II) Opera has implemented a very strange behaviour were it sometimes eats the BOM gibberish, so that the page does not go in to quirks-mode, whereas sometimes it does not eat the BOM gibberish, leading to quirks mode. See my tests: http://malform.no/testing/html5/bom/ 

   Et cetera: Yellow Screen of Death, IE/Webkit, wrong resulting encoding. 

   I don't know if I misread Julian, but I'll also quote a message to Adam in 2009: [*]

]]
   > The algorithm tolerates leading white space, but not leading BOMs.

   Is there a particular reason why the BOM is not tolerated, given 
   <http://www.w3.org/TR/REC-xml/#sec-guessing>?
               [ snipping in Julian's message ]
   Let's ignore "correctly" for a second -- [ snipping ]
]]

[*] http://lists.w3.org/Archives/Public/public-html/2009Nov/0579
Comment 9 Leif Halvard Silli 2011-06-07 14:58:25 UTC
XML 1.0 points to RFC3023. But is notable how RFC3023 only speaks about the UTF-16 BOM and not about the UTF-8 BOM: http://tools.ietf.org/html/rfc3023#page-15
Comment 10 Leif Halvard Silli 2011-06-09 11:27:00 UTC
More data collected - after discussion on www-international@ and implementation tests:

NOTE: Data is needed for IE9's XML parser. Assumption: behaves as Webkit (because that is how it acts for HTML)

Spec data - XML:
---

* XML 1.0 only says that Content-Type: *can* have priority (depending on what the higher protocol says) over "<?xml version="1.0" encoding="value"?>" Quote:  ]] In the absence of information provided by an external transport protocol  (e.g. HTTP or MIME), it is a fatal error[[ <http://www.w3.org/TR/xml/#charencoding> Thus it depends on the rules of the higher protocal.

Spec data - RFC3023
---

1) RFC3023 'XML Media Types' specifies that HTTP charset parameter does have priority. (Meaning that the xml parser must - legally - ignore the XML encoding declaration.) 

2) But RFC3023 actually only justifies it for 'text/xml', where *transcoding* (leading the doc to have another coding than the one specified inside the document) and *compatibility with tex/plain* are the justifications: <http://tools.ietf.org/html/rfc3023#section-3.1> 

3) For 'application/xml', then RFC3023 has no real justification. The only thing it has is: "it is possible for users to configure web servers" and "the HTTP spec says so". http://tools.ietf.org/html/rfc3023#section-3.2

4)    Notably, RFC3023 seriously discusses the Appendix F: "Autodetection of Character Encodings (Non-Normative". (http://www.w3.org/TR/xml/#sec-guessing)  Which (once again) under the heading "Priorities in the Presence of External Encoding Information" states:  ]] In the interests of interoperability, however, the following rule is recommended. If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding. [[  <http://www.w3.org/TR/xml/#sec-guessing-with-ext-info>

Implementation data - RFC3023:
---

* Parsers implementing RFC3023 (HTTP has priority over document data): Opera, Firefox, Amaya

** Parsers implementing RFC3023 and which *also* emits 'fatal errror' if HTTP charset and UTF-8 BOM disagree: Opera, Firefox. (Thus: not Amaya.) Note: per XML 1.0 it is required, *if HTTP and RFC3023 requires it! (and they do!)* to ignore the XML encoding declaration in favour of the HTTP charset paramenter. But note that it is not permitted, per XML 1.0, to act as if BOM does not exist, even if the doc is served via HTTP!

* Parsers *not* implementing RFC3023 (thus giving priority to document data instead), and which do not emit fatal errors: Webkit, Xerces C++, XMLMind Editor on Mac (based on Xerces Java), RXP, oXygen

** Parsers *not* implementing RFC3023 and which, in case of conflict and without emitting fatal error, adheres to BOM and ignores the XML encoding declaration: Webkit, (IE9 must be checked)

** Parsers not implementing RFC3023 and which, in case of conflict and without emitting fatal error, adheres to the XML encoding declaration and ignores the BOM: XMLmind Editor for Mac, Xerces C++, oXygen, RXP

   
Implementation data - non-RFC3023 (file protocol):
---

* Parsers emitting fatal error if UTF-8 BOM conflicts with the XML encoding declaration: Opera.

* Parsers *not* emitting fatal error if UTF-8 BOM conflicts with the XML encoding declaration: Webkit, Firefox, oXygen, XMLmind XML editor for mac (based on Xerces Java), Amaya

** Parsers *not* emitting fatal error if UTF-8 BOM conflicts with the XML encoding declaration and which gives priority to UTF-8 BOM: Webkit, Firefox, oXygen

** Parsers *not* emitting fatal error if UTF-8 BOM conflicts with the XML encoding declaration and which gives priority to XML encoding declaration (and/or to the UTF-8 encoding default, if they comopletely jumps over the UTF-8 BOM): XMLmind XML editor, RXP and (probably) Xerces C++


Implementation data - charset names:
---

* Webkit and some of the editiors, emit 'fatal error' if the charset *name* in the XML encoding declaration is *unknown*. This, even if they (for example Webkit) *otherwise* do not emit a fatal error whenever UTF-8 BOM conflicts with the XML encoding declaration.
Comment 11 Leif Halvard Silli 2011-06-09 11:49:24 UTC
* Xerces C++ bug: https://issues.apache.org/jira/browse/XERCESC-1967
* RXP bug has been reported.
* XMLmind Editor Bug has been reported
* oXygen bug has been reported
* The XML test suite has a 'Lack of 'fatal error' tests for invalid encodings' http://lists.w3.org/Archives/Public/public-xml-testsuite/2011Jun/0000.html
* Thread on ww-international:
  http://lists.w3.org/Archives/Public/www-international/2011AprJun/0079.html
Comment 12 Leif Halvard Silli 2011-06-09 12:30:19 UTC
libxml2 is a rare exception - for HTTP, it behaves as RFC3023 specifices, thus we can count libxml, Opera and Firefox:

$: xmllint http://malform.no/testing/html5/bom/xml.html
Comment 13 Leif Halvard Silli 2011-06-09 12:34:05 UTC
(In reply to comment #12)
> libxml2 is a rare exception - for HTTP, it behaves as RFC3023 specifices, thus
> we can count libxml, Opera and Firefox:
> 
> $: xmllint http://malform.no/testing/html5/bom/xml.html

However, for file:// operations, then it does not do what XML 1.0 specifies: It ignores the UTF-8 BOM. And obeyes the XML encoding declaration.
Comment 14 Leif Halvard Silli 2011-06-09 12:52:46 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > libxml2 is a rare exception - for HTTP, it behaves as RFC3023 specifices, thus
> > we can count libxml, Opera and Firefox:
> > 
> > $: xmllint http://malform.no/testing/html5/bom/xml.html
> 
> However, for file:// operations, then it does not do what XML 1.0 specifies: It
> ignores the UTF-8 BOM. And obeyes the XML encoding declaration.

https://bugzilla.gnome.org/show_bug.cgi?id=652185
Comment 15 Leif Halvard Silli 2011-06-17 18:44:03 UTC
Example site in the wild: http://vertikal.dk

Web site data: 

* Each page typicaly begins with *several* byte-order marks (sic)
* Many pages mixes UTF-8 and Windows-1252 encoding, even in the
   very same *sentence*: http://vertikal.dk/foto/hitlist.htm

Browser handling data:

* Opera and Firefox allows users to override the encoding.  
   However, to *very little* benfit for the user
* Internet Explorer and Webkit does not allow encoding to be overridden. 
* All browsers apears to be in quirks-mode
Comment 16 Michael[tm] Smith 2011-08-04 05:16:46 UTC
mass-move component to LC1
Comment 17 Ian 'Hixie' Hickson 2011-12-02 00:41:24 UTC
Henri, what do you think the spec should say here? (If you think no change is needed, please close the bug. Thanks!)
Comment 18 Henri Sivonen 2011-12-02 08:26:26 UTC
(In reply to comment #17)
> Henri, what do you think the spec should say here? (If you think no change is
> needed, please close the bug. Thanks!)

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the tracker issue; or you may create a tracker issue
yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: The precedence of HTTP encoding declaration over the internal encoding declaration is indeed backwards in the light of Ruby's Postulate (http://intertwingly.net/slides/2004/devcon/69.html). However, the main issue is HTTP vs. meta. The BOM is mainly a theoretical sideshow. Since, for compatibility, performance, etc., we aren't changing the precedence of HTTP and meta, it's not worthwhile to tweak the precedence of the UTF-8 BOM, which in practice is a sideshow (mainly because it makes sense to configure text editors not to emit it in order to make the text editors useful for editing formats that misbehave if the UTF-8 BOM is present). When we aren't changing the precedence of HTTP and meta to give higher precedence to the value that logically has the higher probability of being right, it makes sense to fully retain the current order which is logical in another way: precedence is given to the outermost encoding indicator.
Comment 19 Leif Halvard Silli 2011-12-31 03:34:13 UTC
Comment #0 does not display its content. For the original text, see public-html:
http://lists.w3.org/Archives/Public/public-html/2011Jun/0084.html

Comment #1  was also lost. Hene I paste it in here from my e-mail copy:

2011-06-06 20:15:50 UTC ---
A minor data correction:

(In reply to comment #0)
   [ snip ]
> NOTES: 

???(1) Browsers tested as part of this bug report: IE8, Safari, Chrome (which
shows above described behavior) as well as Opera and Firefox (which do support
this behavior). Other browsers, e.g. KHTML, have not been tested.

The last parenthesis should be read "(which do *not* support this behaviour)".