Bug 17810 - Make BOM trump HTTP
Make BOM trump HTTP
Status: RESOLVED FIXED
Product: WHATWG
Classification: Unclassified
Component: HTML
unspecified
Other other
: P3 normal
: Unsorted
Assigned To: Ian 'Hixie' Hickson
contributor
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-18 06:53 UTC by contributor
Modified: 2012-09-16 03:56 UTC (History)
12 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description contributor 2012-07-18 06:53:06 UTC
This was was cloned from bug 15359 as part of operation convergence.
Originally filed: 2011-12-29 14:20:00 +0000
Original reporter: Anne <annevk@annevk.nl>

================================================================================
 #0   Anne                                            2011-12-29 14:20:43 +0000 
--------------------------------------------------------------------------------
In Trident and WebKit the octet sequences FF FE, FE FF, and EF, BB, BF at the start of a resource are considered more important than anything else when it comes to determining the encoding of the resource.

http://kb.dsqq.cn/ is an example of where this creates a problem for non-WebKit/Trident implementations.

I think throughout the platform (HTML, XHR, File API, CSS) we should implement the WebKit/Trident strategy with respect to the BOM.

(I created a new bug rather than reopening bug 12897 as that bug has a lot of noise.)
================================================================================
 #1   Ian 'Hixie' Hickson                             2012-02-01 00:56:57 +0000 
--------------------------------------------------------------------------------
Henri, what do you think the spec should say here? (If you think no change is
needed, please close the bug. Thanks!)
================================================================================
 #2   Henri Sivonen                                   2012-02-07 14:05:55 +0000 
--------------------------------------------------------------------------------
I think we should change the spec to require the byte sequences Anne mentions to override HTTP charset.
================================================================================
 #3   Leif Halvard Silli                              2012-02-13 12:19:14 +0000 
--------------------------------------------------------------------------------
(In reply to comment #0)

>more important than anything else

NOTE: For Webkit/Chromium/Trident, then "anything else" 
      includes user's attempt to override the encoding.

That is to say: When the BOM defines the encoding, then, the UA's Text Encoding menu is either not available or without any effect (on the current page).

Please make sure that the spec requires the same.
================================================================================
 #4   theimp@iinet.net.au                             2012-07-05 03:12:07 +0000 
--------------------------------------------------------------------------------
(In reply to comment #0 & comment #2)

I personally don't think that a byte order mark (*especially* the so-called UTF-8 Byte Order Mark) should override the HTTP charset as specified. However, I am not interested in formally objecting to this idea, as I will agree that it seems (and history has shown) to be more likely that an author (or server) will incorrectly specify the HTTP charset, than that they will incorrectly add an invalid BOM (this is less true of UTF-8, where the BOM is not actually a BOM).

Note that, for example, CSS uses different mandatory rules to what is proposed here ( http://www.w3.org/TR/CSS2/syndata.html#charset ); this might cause great confusion for authors with an incomplete understanding of these rules.

(In reply to comment #3)

I think that it is an *extremely* bad idea to require that user agents must not be able to handle documents in whatever way their users have configured them to - rightly, wrongly, or otherwise.

I cannot emphasize this enough. Not all user agents are browsers, and even browsers should be configured first and foremost by their users. By all means, make it a primary heuristic; forbid even suggesting a switch and require an explicit action by the user, if need be; but if the user says, "no, do this", the browser should do so unless actual harm can be demonstrated and no challenge is offered.
================================================================================
 #5   Leif Halvard Silli                              2012-07-05 05:57:03 +0000 
--------------------------------------------------------------------------------
(In reply to comment #4)

> (In reply to comment #3)
> 
> I think that it is an *extremely* bad idea to require that user agents must not
> be able to handle documents in whatever way their users have configured them to
> - rightly, wrongly, or otherwise.

Are you dissatisfied with the way XML is specified?

XML does not allow you to override the encoding at will. For instance, when there is no encoding info (BOM, XML declaration, XML encoding declaration, HTTP) associtated with an XML file, then the file MUST be UTF-8 encoded. Those XML parsers that are serious about implementing XML thus do not permit that you override the encoding of an UTF-8 encoded XML document, because, to do so, would be a fatal error.

It is great if HTML gets semi-aligned with XML so that, at least when the BOM is used, it gets impossible to override the DOM. That way, we decrease the need for users to "fix" encoding configuration problems.
================================================================================
 #6   theimp@iinet.net.au                             2012-07-05 07:44:28 +0000 
--------------------------------------------------------------------------------
The charset determination rules for XML are non-normative, except for the case you mention, where there is no BOM and no (XML) declaration and no higher-level specifier (such as a HTTP header). This bug does not discuss this scenario directly.

Even so, it is perfectly acceptable for valid XML processor to detect a BOM, ignore it, and pick any encoding it likes, because technically, it's only required to use the BOM for the specific case of picking between UTF-8 and UTF-16, not between one of those and anything else. I could detect a UTF-16 BOM, and decide to nevertheless render it in any encoding I want *except* UTF-8, and likewise the reverse, and it would be fully compliant:

> XML processors MUST be able to use this character [U+FEFF] to differentiate between UTF-8 and UTF-16 encoded documents.

Also:

> In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration [...] containing an encoding declaration

But, there is no special elaboration as to what "external character encoding information" actually means, and it is not clear that "specific instruction from the user" could not qualify. Think like command-line parameters for batch parsers, etc.

So, a document in, really, any encoding, would not automatically be invalid XML in the specific case of a user who said "use this encoding" (again, any encoding).

Even if that is not the case, that does not change the fact that a processor processing a document with, say, a UFT-8 "BOM" *and* a text declaration specifying, say, ISO-8859-1 encoding, has no requirement to obey the BOM over the text declaration. A BOM is required if the document is UTF-16; but other encodings are not forbidden from having that (or any other) BOM, nor does XML require that the BOM be considered as authoritive if present (In fact it explicitly only recommends it).

I believe that this will typically mean that character data will appear at the start of the document, but this is only an error, not a fatal error (and for XML 1.0, maybe not necessarily even that; I don't really remember and will have to check). And in fact, I think that there is wiggle-room for even on this point.

If this character data is then interpreted as such and emitted into the HTML, then this would of course then be an error in HTML, but that is for the HTML spec. to deal with, which it does: the spec. currently says that what would be an initial BOM should be ignored even if it is unrelated to the encoding.

Furthermore, not all representations of HTML5 will be XML-compatible anyway. I very much agree with the goal of aligning HTML5 with XML; but vendors should be left interpret this however they want if they want more robust XML processing over legacy support; it should not be specified here.
================================================================================
 #7   Leif Halvard Silli                              2012-07-05 12:44:24 +0000 
--------------------------------------------------------------------------------
(In reply to comment #6)

To put it politely: You have not read the XML 1.0 spec correctly.

Section '4.3.3 Character Encoding in Entities' is NORMATIVE. [1]  Whereas the section you talk appear to be talking about, is Appendix F. [2] Appendix F only contains helpful instructions/tips for how to fulfill the requirements of section 4.3.3.

[1] http://www.w3.org/TR/REC-xml/#charencoding
[2] http://www.w3.org/TR/REC-xml/#sec-guessing

Appendix F.1 first - in the first table - discusses how to sniff the encoding when there is a BOM. This is simple: Ifone parses an XML document which _contains_ the BOM as something other than UTF-16, UTF-8 or UTF-32 (UCS-4), then the BOM is not a BOM but an illegal character = fatal error.

Then - in the second table - it discusses how to sniff it when there is no BOM. This is also simple: Except for UTF-8, it is impossible, per the rules that XML operates with.

And therefore, in the second table of Appendix F.1, each row (except the last row about UTF-8!) ends roughly the same way: "The encoding declaration must be read to determine" (the encoding). And if there is no encoding declaration, then it is a fatal error, per section 4.3.3. Section 4.3.3. is also clear about the fact that if there is an external (typically HTTP) or internal encdoing declaration, and this declaration turns out to be incorrect, then it is a fatal error. Lack of encoding information is also considered as a signal that the document is UTF-8 encoded.

The effect of all this is that, in XML, it should always cause a fatal error if you try to override the encoding. Firefox is probably one of the XML parsers that _best_ reflects XML's encoding rules. So if you are in doubt, I suggest that you do some experimentes for yourself.
================================================================================
 #8   Leif Halvard Silli                              2012-07-05 12:59:55 +0000 
--------------------------------------------------------------------------------
(In reply to comment #7)
> (In reply to comment #6)

> Appendix F.1 first - in the first table - discusses how to sniff the encoding
> when there is a BOM. This is simple: Ifone parses an XML document which
> _contains_ the BOM as something other than UTF-16, UTF-8 or UTF-32 (UCS-4),
> then the BOM is not a BOM but an illegal character = fatal error.

By the way: In that case it is an illegal character per HTML5 as well: A UTF-8 document with a BOM would  be would bring the browser into Quirks-Mode if the browser reads the document as - for example - ISO-8859-1.

So even if we look squarely at HTML5, it makes no sense to permit the encoding to be overridden whenever there is a BOM: To permit the encoding to be overridden when there is a BOM would be like permitting users to shoot themselves in the foot.

PS: I now consider this subject to be debated to death. I am not going to give any more explanations. And if I give any more replies, then they will be short and to the point.
================================================================================
 #9   theimp@iinet.net.au                             2012-07-05 16:05:00 +0000 
--------------------------------------------------------------------------------
Firstly, as I said, this bug covers only where there is a BOM, not where there is neither a BOM nor an encoding declaration nor a header.

Secondly, the bug suggests ignoring headers when there is a BOM present, but the XML spec. *specifically* says that "external character encoding information" can be used to determine the encoding.

> 4.3.3 In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

So if I have:

0xFE 0xFF <?xml encoding="ISO-8859-1"?>

It is a fatal error to decode it as UTF-16. Sure, this causes other problems, but not necessarily fatal errors (at least in XML 1.0).

Your arguments over using the UI to configure the charset are not within the original scope of this bug.

Thirdly, as I said, not all HTML will be XML.

Those are alone enough reason not to make the proposed behavior mandatory for all user agents in all cases.

> Firefox is probably one of the XML parsers that _best_ reflects XML's encoding rules. So if you are in doubt, I suggest that you do some experimentes for yourself.

Make it a recommendation, I do not care. I am not saying that browsers must allow changes; I'm saying they should not be constrained from allowing them. If they think that there is no value in allowing users to change encodings at will, I don't see that as being a problem. But it should not be part of the spec. If you were going to advocate requiring total compliance with XML in all circumstances, I would be sympathetic; but apparently that is not what is desired for HTML5.

Furthermore:

> By the way: In that case it is an illegal character per HTML5 as well: A UTF-8
document with a BOM would  be would bring the browser into Quirks-Mode if the
browser reads the document as - for example - ISO-8859-1.

Yes, it would typically trigger Quirks mode (except in some, perhaps only theoretical, encodings). That's not a fatal error though.

> Section '4.3.3 Character Encoding in Entities' is NORMATIVE.

Yes.

> Whereas the section you talk appear to be talking about, is Appendix F. [2] Appendix F only contains helpful instructions/tips for how to fulfill the requirements of section 4.3.3.

No, I only mention Appendix F once, to say that it is non-normative. Every part of the spec. that I actually quoted is from Section 4.3.3.

The following:

> XML processors MUST be able to use this character [U+FEFF] to differentiate between UTF-8 and UTF-16 encoded documents.

and:

> In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration [...] containing an encoding declaration

Are in Section 4.3.3.

It's really great that you are passionate about XML, but you must be careful only to read what the spec. actually says.

Let's take the example of a ISO-8859-1 with a UTF-16 BOM. In section 4.4.3:

> It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process.

Not a problem, the browser supports ISO-8859-1.

> It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding.

Not a problem, both 0xFE and 0xFF are legal encoding sequences in ISO-8859-1.

> Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode.

Not relevant.

> Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Not a problem, "higher-level protocol" could include "user configuration", since this term is not defined anywhere. Even if this is not the case, there is still the scenario where there is both a BOM and an encoding declaration - it does not say that the BOM should trump the encoding declaration. Doing so 

Anything else is at worst an error, not a fatal error.
================================================================================
 #10  Leif Halvard Silli                              2012-07-06 02:30:39 +0000 
--------------------------------------------------------------------------------
(In reply to comment #9)
> Firstly, as I said, this bug covers only where there is a BOM, not where there
> is neither a BOM nor an encoding declaration nor a header.

BOM is a side track. My question was: Do you dislike the "user experience" of XML when  it comes to its prohibitance against manual encoding overriding? Becasue XML's user experience is the same regardless of whether there is a BOM or not. If you can live with XML's _general_ prohibition against manual encoding overriding, then I don't see why you can't also live with the same strict rule for the _specific_ subset of HTML when there is a BOM.

> Secondly, the bug suggests ignoring headers when there is a BOM present, but
> the XML spec. *specifically* says that "external character encoding
> information" can be used to determine the encoding.

Correct. It is currently also against the HTTP specs.
 
> So if I have:
> 
> 0xFE 0xFF <?xml encoding="ISO-8859-1"?>

1) This XML declaration is invalid as it lacks the version attribute.
2) There are two characters, 0xFE 0xFF, in front of the declaration. 

Note regarding 2): As I have tried to say before, 4.3.3 specs that:

     "It is a fatal error for a TextDecl to occur other than at the 
      beginning of an external entity."
 
> It is a fatal error to decode it as UTF-16. Sure, this causes other problems,
> but not necessarily fatal errors (at least in XML 1.0).

Wrong. See my 'Note regarding 2)' above.

> Your arguments over using the UI to configure the charset are not within the
> original scope of this bug.

This bug requests the behaviour of IE and Webkit to be standardized. And IE and Webkit do prohibit manual overriding. Meanwhile, the bug requester has since written the 'Encoding' standard, in which he states: [*]

   "For compatibility with deployed content, the byte order mark (also
    known as BOM) is considered more authoritative than anything else."

[*] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#decode-and-encode
 
> Thirdly, as I said, not all HTML will be XML.

There is benefit - security wise and practical - in disallowing users
to "disturb" the encoding regardless of whether the page is XML or HTML.

>> By the way: In that case it is an illegal character per HTML5 as well: A UTF-8
>> document with a BOM would  be would bring the browser into Quirks-Mode if the
>>browser reads the document as - for example - ISO-8859-1.
> 
> Yes, it would typically trigger Quirks mode (except in some, perhaps only
> theoretical, encodings). That's not a fatal error though.

So, now you are offering me at least one use case: To allow users to
place the page in quirks-mode. Frankly: I dismiss that use case.

>  but you must be careful only to read what the spec. actually says.

A hillarious comment. But an excellent advice.
================================================================================
 #11  theimp@iinet.net.au                             2012-07-06 02:45:36 +0000 
--------------------------------------------------------------------------------
Firstly, sorry that I am not explaining myself well. I will try to be more careful.

> 1) This XML declaration is invalid as it lacks the version attribute.

It was just an example that I didn't spell out properly. Sorry, I should have put a bit more thought into it.

> 2) There are two characters, 0xFE 0xFF, in front of the declaration. 
> Wrong. See my 'Note regarding 2)' above.

Yes, I see my phrasing mistake now, you're right *that it is an error*.

Not, however, that it *not* also a fatal error to treat it as UTF-16 because of the BOM (say, if the parser wants to go on and see what other errors it finds, it should use the encoding specified).

What I meant was, irrespective of whether the documents are well-formed, obeying the BOM is, without question, WRONG in that case.

> For compatibility with deployed content, the byte order mark (also known as BOM) is considered more authoritative than anything else.

Too bad; it's wrong:

http://203.59.75.251/Bug15359

A simple testcase has been done, and [latest release versions of] all major browsers currently fail XML compliance due to this proposed handling of the BOM (some non-browser XML processors get this right, though).

Now, my position is unreservedly that, for compatibility with XML, the BOM must not be specified as overriding all other considerations in all cases. [The proposal for this Bug]

As for overriding the HTTP Content-Type parameter specifically, or user selection generally, my position is unchanged, for the reasons already given.

In particular, the spec. should remain silent on the subject of users configuring their user agent to apply certain encodings to certain documents. How this may impact XML (in terms of whether this would be valid in a particular case) is unrelated to how it should impact (X)HTML5; to whatever degree that it is already specified in the XML spec., leave it at that.

Sometimes, users simply have to debug misdetected/misspecified encodings; the fact that I've just demonstrated a new encoding-related misbehavior is proof of that.
================================================================================
 #12  theimp@iinet.net.au                             2012-07-06 04:27:41 +0000 
--------------------------------------------------------------------------------
> BOM is a side track. My question was: Do you dislike the "user experience" of XML when  it comes to its prohibitance against manual encoding overriding? Becasue XML's user experience is the same regardless of whether there is a BOM or not. If you can live with XML's _general_ prohibition against manual encoding overriding, then I don't see why you can't also live with the same strict rule for the _specific_ subset of HTML when there is a BOM.

For strict, valid, well-formed, XML, served with an explicit XML Content-Type, then no, I have no problem with the idea.

My problem is applying those rules to billions of pages that are *not* strict, valid, well-formed, XML, served with an explicit XML Content-Type.

The difference being, that practically no strict, valid, well-formed, XML, served with an explicit XML Content-Type, will have an incorrect BOM plus a contradictory charset indicator of any other kind. The same cannot be said for other web content (if it could, I would probably accept that too).

For this reason, as far as the user-configured option is concerned, consistency is better served by having HTML5 say nothing, and letting the user agent forbid it in the circumstances where they think that appropriate (such as XML).

> Correct. It is currently also against the HTTP specs.

I'm very sorry, but I did a lot of research and I can't find anything that says this. Is this a current spec. or a draft spec.?

See, RFC 2616 says:
"HTTP/1.1 recipients MUST respect the charset label provided by the sender"
"user agents [...] MUST use the charset from the content-type field if they support that charset"

Meanwhile, RFC 3023 keeps saying, over and over:
"[the HTTP Content-Type] charset parameter is authoritative"

I know that no-one here seems to like any specs. other than this one that they're writing, but I just don't see any way that this not just another "willful violation".

> So, now you are offering me at least one use case: To allow users to place the page in quirks-mode. Frankly: I dismiss that use case.

Better than getting unreadable garbage because someone specified an incorrect BOM/charset combination on some 10-year-old document.
================================================================================
 #13  Leif Halvard Silli                              2012-07-06 04:32:35 +0000 
--------------------------------------------------------------------------------
(In reply to comment #11)

> http://203.59.75.251/Bug15359
> 
> A simple testcase has been done, and [latest release versions of] all major
> browsers currently fail XML compliance due to this proposed handling of the BOM
> (some non-browser XML processors get this right, though).

Their behaviour *would have been* correct, if we changed XML to say this:

]] In the absence of information provided by an external transport protocol
   (e.g. HTTP or MIME) <INS> or a byte order mark</INS>,
   it is a fatal error for an entity including an encoding declaration to
   be presented to the XML processor in an encoding other than that named
   in the declaration, [[

As both HTTP or BOM are "external" to the markup such a change would makes sense.

> Now, my position is unreservedly that, for compatibility with XML, the BOM must
> not be specified as overriding all other considerations in all cases. [The
> proposal for this Bug]

There are many aspects of "compatibility with XML". The most important
aspect is UTF-8, itself. Problem is: HTML defaults to Windows-1252. XML
defaults to UTF-8. This means that, occationally, the HTML page can
achieve an encoding - via default or by manual overriding - that differs
from the author's intended encoding.

The second important aspect of compatibility with XML is the fact that it's
impossible to override the encoding of an XML document.

We can have both of these benefits in HTML too, if only one uses the BOM. This benefit, however, comes at the expence of HTTP charset: The BOM must be allowed to override the HTTP charset. This is a price worth paying. Encodings is an evil. We should try remove their importance as much as possible.
 
> As for overriding the HTTP Content-Type parameter specifically, or user
> selection generally, my position is unchanged, for the reasons already given.

I don't understand your reasons. You are CONTRA that the BOM overrides the HTTP charset. But you are PRO that the user can override the BOM. I see no benefit in that standpoint. I only see pessimism about the need for users to override encodings.

NOTE: One reason that the BOM should override HTTP is that the BOM is likely to be more correct. (Plust that Webkit and IE alread behave like that.) If all browsers impoements IE and Webkit's behaviour, the encoding errors should not occur, and thus the user will have no need for overriding the encoding.
 
> In particular, the spec. should remain silent on the subject of users
> configuring their user agent to apply certain encodings to certain documents.
> How this may impact XML (in terms of whether this would be valid in a
> particular case) is unrelated to how it should impact (X)HTML5; to whatever
> degree that it is already specified in the XML spec., leave it at that.
> 
> Sometimes, users simply have to debug misdetected/misspecified encodings; the
> fact that I've just demonstrated a new encoding-related misbehavior is proof of
> that.

You have documented a discrepancy between what browsers do and what XML
specifies. You have not documented that what the browsers do lead to
any problems. For instance, the test page you created above, works just
fine. You have not even expressed any wish to override their encoding. 
So, I'm sorry, but the page you made does not demonstrate what you
claim it to demonstrate.
================================================================================
 #14  Leif Halvard Silli                              2012-07-06 04:44:25 +0000 
--------------------------------------------------------------------------------
(In reply to comment #12)
>> BOM is a side track. My question was: Do you dislike the "user 
>> experience" of XML when it comes to its prohibitance against manual
>> encoding overriding? [ ... ]
> 
> For strict, valid, well-formed, XML, served with an explicit XML Content-Type,
> then no, I have no problem with the idea.
> 
> My problem is applying those rules to billions of pages that are *not* strict,
> valid, well-formed, XML, served with an explicit XML Content-Type.

If all browsers implement the IE/Webkit behaviour, then there is no
problem. If you know that it is a problem, then you should provide 
evidence thereof - for instance by pointing to a page that gets
broken if this behaviour is implemented.
 
> > Correct. It is currently also against the HTTP specs.
> 
> I'm very sorry, but I did a lot of research and I can't find anything that says
> this. Is this a current spec. or a draft spec.?

It is common sense - and specified - that HTTP's charset parament should
trumph anything inside the document. This goes for HTML and for XML.
_THAT_ is the controversial side of this bug: This bug ask the BOM to
override the HTTP charset parameter. (The point I have been making, that
BOM should also 'override manual user overriding', is part of the same
thing - I just wanted to be sure that everone got that.)
 
> I know that no-one here seems to like any specs. other than this one that
> they're writing, but I just don't see any way that this not just another
> "willful violation".

Exactly. That is what it is.
 
>> So, now you are offering me at least one use case: To allow users to
>> place the page in quirks-mode. Frankly: I dismiss that use case.
> 
> Better than getting unreadable garbage because someone specified an incorrect
> BOM/charset combination on some 10-year-old document.

You are welcome to demonstrate that it is an actual problem.
================================================================================
 #15  Leif Halvard Silli                              2012-07-06 05:29:14 +0000 
--------------------------------------------------------------------------------
(In reply to comment #14)

> > I know that no-one here seems to like any specs. other than this one that
> > they're writing, but I just don't see any way that this not just another
> > "willful violation".
> 
> Exactly. That is what it is.

But note that to diallow users to override the encoding, breaks no spec.
================================================================================
 #16  theimp@iinet.net.au                             2012-07-06 13:41:01 +0000 
--------------------------------------------------------------------------------
> Their behaviour *would have been* correct, if we changed [something else] to say this:

That basically summarizes every problem with everything, ever.

> The second important aspect of compatibility with XML is the fact that it's impossible to override the encoding of an XML document.

Not always. In fact, never; but usually not without generating a fatal error, as well.

Developers, for example, might want to do this even if it generates an error.

Also, beyond browsers, fatal errors might not be so total. For example, a graphical editor such as Amaya should still be able to load the document, exposing the source text (which the processor may still do after a fatal error). Doing so requires that it detect an encoding, and as that could be wrong in respect of the intended content, the author must be able to override it (especially if the editor means to fix the incorrect BOM!).

> We can have both of these benefits in HTML too, if only one uses the BOM. This benefit, however, comes at the expence of HTTP charset: The BOM must be allowed to override the HTTP charset. This is a price worth paying. Encodings is an evil. We should try remove their importance as much as possible.

I am sympathetic to your ends, but not your means.

See also: http://xkcd.com/927/

> I don't understand your reasons. You are CONTRA that the BOM overrides the HTTP
charset. But you are PRO that the user can override the BOM.

I'm PRO that the user can override just about anything. The web and the software exists for the user, not the other way around.

As soon as the user changes any configuration option - including installing an extension - all bets are off, and the spec. should not have to - or try to - account for such scenarios.

> You have documented a discrepancy between what browsers do and what XML specifies.

That should be enough.

> You have not documented that what the browsers do lead to any problems.

It will lead to problems when this happens: some browser - maybe a current one, maybe a new one - begins to obey the spec., which they can do at any time. Then you're right back to different renderings again. The whole point of HTML5 being so meticulous about every detail is because of the past problems where one browser does something wrong, and another browser decides to fix it, and then the whole web is split in half (or worse).

> For instance, the test page you created above, works just fine.

It "works" with current major browsers. Not all user agents are browsers. For example, many validators treat it according to the spec., which means a fatal error.

The following are some examples of online validators that (correctly) determine the example to be invalid:

 http://validator.nu/ and http://validator.w3.org/nu/
 http://validator.aborla.net/

This validator detects the error, but does not consider it fatal:

 http://www.validome.org/

Also, at least some older versions of some browsers do produce an error on this page. Specifically, Epiphany 2.30.6 (still the default install on the latest release of Debian, at this time). And since (this particular version) is a Gecko-based browser, and Firefox has a significant population who upgrade very slowly, it is possible that this might add up to a lot of browsers. I might try to test this further.

Also, it doesn't "work just fine", because in this case, I, the author, expected something different. This is an example of how browsers violate the XML spec.; "works just fine" would be that it does what I said, which is what the spec. says to do, because that's what it was very specifically coded to, with the explicit intention of causing a fatal error for testing/demonstration purposes. This is the problem with trying to second-guess what you are explicitly told, generally.

> You have not even expressed any wish to override their encoding.

That is a different argument, and not in the original scope of this bug. You were the one to first mention how important compliance with the XML spec. is; now that I have shown that, in fact, it is the action of this bug which is non-compliant, you want to ignore that and argue about user overrides instead.

> So, I'm sorry, but the page you made does not demonstrate what you claim it to demonstrate.

It demonstrates exactly what I claim it to; and no more. I only claim it to demonstrate that this bug, as originally filed, which says that the BOM should be "considered more important than anything else when it comes to determining the encoding of the resource", is incompatible with XML.

> If all browsers implement the IE/Webkit behaviour, then there is no problem. If you know that it is a problem, then you should provide evidence thereof - for instance by pointing to a page that gets broken if this behaviour is implemented.

I haven't seen much recently, I'll have a look and post what I find.

> You are welcome to demonstrate that it is an actual problem.

Try those validators. Though I expect that won't satisfy you.

> But note that to diallow users to override the encoding, breaks no spec.

And it would also break no spec. for me to write a browser plugin that lets me configure the character encoding. Your proposal doesn't actually solve anything if you expect perfect, unconditional control over something that a user may want to change, for whatever reason that suits them. All it does is shift it down the line a bit (moving from browser developer -> plugin developer).

It is just beyond the scope of the HTML specification to command the browser to this extent.

Now, since the standards of evidence are so high, I would like for you to demonstrate to me the problem that you claim: where is the proof that users receive documents with BOMs, yet nevertheless cause havoc by manually changing the encoding settings of their browser without knowing exactly what they are doing, or at least understanding that doing so makes only them responsible for "shooting themselves in the foot".

In fact, it seems to me that the conditions in which users would change the encoding manually can only be cases where there is an encoding detection problem, implying that either such documents must exist, or else that it is never a problem that would actually occur. Except of course for idle curiosity. That is, nothing that gives basis for direction in this spec.
================================================================================
 #17  Leif Halvard Silli                              2012-07-06 20:03:09 +0000 
--------------------------------------------------------------------------------
(In reply to comment #16)
> I'm PRO that the user can override just about anything. The web and the
> software exists for the user, not the other way around.

1st sentence doesn't always follow from 2nd. And for some reason your are not PRO the same thing when it comes to XML. I wonder why.
 
> As soon as the user changes any configuration option - including 
> installing an extension - all bets are off, and the spec.  should
> not have to - or try to - account for such scenarios.

A plug-in or config that ignore/extend a spec, do ignore/extend that spec. 

WRT your test file <http://203.59.75.251/Bug15359>:
>> You have not documented that what the browsers do lead to any problems.
> 
> It will lead to problems when this happens: some browser - maybe a 
> current one, maybe a new one - begins to obey the spec., [...]

Specs, UAs, tools, authors etc need to stretch towards perfect unity. When there are discrepancies, the question is who needs to change.

>> For instance, the test page you created above, works just fine.
> 
> It "works" with current major browsers.

It even works with the highly regarded xmllint (aka libxml2).

> Not all user agents are browsers. For example, many validators
> treat it according to the spec., which means a fatal error.

Validators are a special hybrid of user agents and authoring tools which of course need to be strict.

> Also, at least some older versions of some browsers do produce an 
> error on this page. Specifically, Epiphany 2.30.6 [...] a
> Gecko-based browser

Firefox 1.0 does not have that issue.
 
> Also, it doesn't "work just fine", because in this case, I, the author,
> expected something different.

Something should adjust - the XML spec or the UAs. Make it an 'error' as opposed to 'fatal error', is my proposal. The result of an 'error' is undefined.

>> You have not even expressed any wish to override their encoding.
> 
> That is a different argument, and not in the original scope of this bug. You
> were the one to first mention how important compliance with the XML 
> spec. is;

Your page contains encoding related errors, and your argumet in favor of user override was to help the user to (temporarily) fix errors. So it perplexes me to hear that this is a "different argument".

But if that is how you think, then I see no point it discussing your page any more - which is why this is my last comment regarding that page.

> where is the proof that users receive documents with BOMs, yet nevertheless cause havoc
> by manually changing the encoding settings of their browser without knowing exactly 
> what they are doing, or at least understanding that doing so makes only them responsible for
> "shooting themselves in the foot".

So many layers of if … As an author, I sometimes want prevent users from overriding the encoding regardless of how conscious they are about violating my intent.

There is evidence that users do override the encoding, manually. And that this causes problems, especially in forms. To solve that problem, the Ruby community has developed their own "BOM snowman": <http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen> The HTML5 spec does as well say that documents, and especially forms, are not guaranteed to work unless the page is UTF-8 encoded.
================================================================================
 #18  theimp@iinet.net.au                             2012-07-08 23:46:13 +0000 
--------------------------------------------------------------------------------
(In reply to comment #17)
> And for some reason your are not PRO the same thing when it comes to XML. I wonder why.

It is a compromise position. This is a specification, not a religion; compromise is good. Properly formed XHTML served as XHTML to a bug-free parser would almost never have any need to have the encoding changed by the user. The few cases where it would be useful would probably be better handled with specialty software that does not need to be interested in parsing according to the spec. So, I for one would be prepared to compromise.

This talk about the "user experience" is silly. I would postulate the following, offering no evidence whatsoever:

1) In 99.9% of cases, users who see garbage simply close the page. They know nothing about encodings or how to override them, and care to know even less.
2) Of the other 0.1%, in 99.9% of cases the user changes the encoding of the page only to view it, not to submit any form.
3) Of the other 0.1% of 0.1%, in 99.9% of cases the submitted forms contain only AT-standard-keyboard-characters with ASCII codepoints which are identical to the codepoints of the corresponding UTF-8 text.

Agree/disagree?

> A plug-in or config that ignore/extend a spec, do ignore/extend that spec.

The point is, you can never get the assurances you seem to argue are essential.

> Specs, UAs, tools, authors etc need to stretch towards perfect unity.

Finally, something we are in total agreement over.

> When there are discrepancies, the question is who needs to change.

Of course. Unfortunately, more and more it seems to be answered with "anyone or everyone, except us!"

> It even works with the highly regarded xmllint (aka libxml2).

Great, more broken software. How is this helpful to your position? I certainly do not regard it highly now that I know it parses incorrectly.

> Validators are a special hybrid of user agents and authoring tools which of course need to be strict.

The whole point of XML is that *all* processors process it the same way! There is nothing special about a validator in that respect.

It is clear that you do not care one bit about the XML spec. - it is an argument for you to use when it suits you, and to discard when it goes against you.

> Something should adjust - the XML spec or the UAs.

Either. But if the user agents are to stay the same (in respect of XML), it should be because the XML spec. changes - not because the HTML5 spec. changes unilaterally.

The internet is bigger than the web, and the web is bigger than HTML, and HTML is bigger than the major browsers. Outright layer violations, marginalization of national laws, breaking of backwards compatibility; you name it, HTML5 has it. I sincerely hope that these attempts to re-architect the internet from the top down are, in time, fondly remembered for their reckless pursuit of perfection that made everything much better in the end, and not poorly remembered for their hubris that set back standards adoption more than the Browser Wars.

> Make it an 'error' as opposed to 'fatal error', is my proposal. The result of an 'error' is undefined.

That would be VERY BAD. It would mean that there is more than one valid way to parse the same document.

Your previous suggestion was much better:
> <INS> or a byte order mark</INS>

I would even be prepared to support such a suggestion, if there was no indication that the current wording is deliberate/relied upon (which I suspect - note how the current reading is fundamentally the same as the CSS rules, for example. Not likely to be a coincidence).

> Your page contains encoding related errors, and your argumet in favor of user override was to help the user to (temporarily) fix errors. So it perplexes me to hear that this is a "different argument".

My goodness me, that page was meant to demonstrate exactly one point, not every argument I made in this bug. Sheesh.

Okay, here we will do a little though experiment.

Imagine a page like my example.

Now imagine that, like a large number of web pages, it is fed content from a database. I *know* that the content in the database is proper ISO-8859-1.

Imagine that this text also happens to coincidentally be valid UTF-8 sequences, which is very possible (if this is too hard to imagine, let me know and I'll update the page). Or more likely, that the page is not served as XHTML, which is even easier to imagine.

Imagine that I get complaints from my users, that the page is full of garbage.

I check the page, and they're right!

I click "View Source", and see the garbage. I don't see the BOM, because it's invisible when the software thinks it's a BOM. I do see my encoding declaration, though, and to me it looks like it's at the top of the page (because I can't see the BOM). Why didn't that work? Or did it, and the problem is somewhere else?

I open the XHTML file in an editor, and it looks the same, and for the same reason. I don't see the database text, of course, just code.

Eventually I notice that the browser says the page is UTF-8. So I try to change it to what I know the content actually is. But it doesn't work! Funny... that used to work. Must be a browser bug. Why else would the menu even be there?

** Note: If your browser lets you change the encoding, you can stop here. Otherwise, continue to the end, then start using WireShark.

In all probability, this theoretical author acquired the XHTML file from somewhere else complete with BOM and just searched for "how do i make a xhtml file iso-8859-1 plz". That, or they had some overreaching editor that decided that it would insert a UTF-8 BOM in front of all ASCII files in the name of promoting UTF-8. Or maybe the editor used very poor terminology and used the word "Unicode" instead of the actual encoding name in the save dialog, making the author - who knows basically nothing about encodings - think that this simply implies that it will work on the web ("all browsers support Unicode!"). Or any number of possible scenarios.

Consider that, having explained what the author observed above, they would probably be getting advice along the lines of "try setting the http-content-type header lol!!1".

Actually, they'd probably be told - it's your database. For as long as they're looking there, they'll never find the problem.

Getting this far is probably already the limit of what the vast majority of authors can manage. How does this author, who just wants their damn blog to work with their old database, debug this problem further?

Exercise: If some records in my database coincidentally don't have any extended characters (ie. are equivalent to ASCII and therefore UTF-8), meaning that some pages work and some don't, how am I (and anyone I ask) *more likely* to think that the problem is with the static page, rather than the database?

Bonus points: Why does the CSS file, from the same source/edited with the same editor, work exactly the way they expect? ie.

0xEF 0xBB 0xBF @charset "ISO-8859-1";
               body::before{content:"CSS treated as encoded in ISO-8859-1!"}
               /* Not by all browsers; just browsers which obey the spec. */

Triple Word Score: What if the BOM was actually added by, say, the webserver (or the FTP server that got it onto the webserver):
http://publib.boulder.ibm.com/infocenter/zos/v1r12/topic/com.ibm.zos.r12.halz001/unicodef.htm
Or a transformative cache? Or a proxy?

> As an author, I sometimes want prevent users from overriding the encoding regardless of how conscious they are about violating my intent.

As a user, I sometimes want override the encoding regardless of how conscious the author is about violating my intent.

What makes you argument more valid than mine?

Hint: It's not:
http://www.w3.org/TR/html-design-principles/#priority-of-constituencies
"costs or difficulties to the user should be given more weight than costs to authors"

There are more users than there are authors, and there always will be.

> There is evidence that users do override the encoding, manually.

Of course there is! Because they have to! Why do you think browsers allowed users to change the encoding in the first place?

Where is the evidence that this happens for pages WITH A BOM.

> And that this causes problems, especially in forms.

Not so much problems for the user. As for authors, read on.

> To solve that problem, the Ruby community has developed their own "BOM snowman":

Notes about that article:

1) It makes no mention of such pages containing UTF-8 or UTF-16 BOMs. There is no indication that the proposed resolution of this bug would solve that problem at all, because in all likelihood those pages do not have BOMs.

2) It admits that the problem begins when authors start serving garbage to their users.

3) It cannot be fixed by the proposed resolution of this bug. Everything from legacy clients to active attackers can still send them garbage, and they will still serve it right back out to everyone. But even if you could magically stop all submission of badly-encoded data, that does not change the users' need to change the encoding for all of their pages that have already been polluted. The *real problem* is that the authors didn't sanitize their inputs (accepting whatever garbage they receive without validation), nor did they sanitize their outputs (sending whatever garbage they have without conversion). This kind of fix would just give authors/developers an excuse to be lazy, at the expense of the users.

Note also that authors *still* can't depend upon the BOM to solve ANY problem they have, because again, it might be either added or stripped by editors, filesystem drivers, backup agents, ftp clients or servers, web servers, transformative caches (language translation, archival, mobile speedup), proxies, CDNs, filters, or anything else that might come in between the coder and the browser user.

Consider:

http://schneegans.de/
http://webcache.googleusercontent.com/search?q=cache:kKjAj-u4s6IJ:http://schneegans.de/

Notice how the Google Cache version has stripped the BOM?

If this is just about forms, then how is this for a compromise:

"When the document contains a UTF-8 or UTF-16 BOM, all forms are to be submitted in UTF-8 or UTF-16, respectively, regardless of the current encoding of the page. However, authors must not rely on this behavior."

Would that satisfy you?
================================================================================
 #19  Leif Halvard Silli                              2012-07-09 01:04:03 +0000 
--------------------------------------------------------------------------------
(In reply to comment #18)

Even though we continue to talks past each others - and to disagree, I'm not going to debate this any more in this "forum". We have made ourselves reasonably clear.  Unless I see sudden light/darkness, I am instead going to await the Editor's decision before 

I contact you off-bugzilla about eventually discussing in some other fora.
================================================================================
Comment 1 Ian 'Hixie' Hickson 2012-09-16 03:47:12 UTC
Points #5 to #19 seem to be about XML so I have ignored them.

Points #0 and #2 are pretty unambiguous about UA requirements, so I have done that.
Comment 2 contributor 2012-09-16 03:56:00 UTC
Checked in as WHATWG revision r7360.
Check-in comment: Make a BOM override HTTP headers.
http://html5.org/tools/web-apps-tracker?from=7359&to=7360