Bug 15359 - Make BOM trump HTTP
Make BOM trump HTTP
Status: RESOLVED FIXED
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec
unspecified
PC All
: P2 normal
: ---
Assigned To: Silvia Pfeiffer
HTML WG Bugzilla archive list
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-29 14:20 UTC by Anne
Modified: 2013-03-18 01:46 UTC (History)
16 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Anne 2011-12-29 14:20:43 UTC
In Trident and WebKit the octet sequences FF FE, FE FF, and EF, BB, BF at the start of a resource are considered more important than anything else when it comes to determining the encoding of the resource.

http://kb.dsqq.cn/ is an example of where this creates a problem for non-WebKit/Trident implementations.

I think throughout the platform (HTML, XHR, File API, CSS) we should implement the WebKit/Trident strategy with respect to the BOM.

(I created a new bug rather than reopening bug 12897 as that bug has a lot of noise.)
Comment 1 Ian 'Hixie' Hickson 2012-02-01 00:56:57 UTC
Henri, what do you think the spec should say here? (If you think no change is
needed, please close the bug. Thanks!)
Comment 2 Henri Sivonen 2012-02-07 14:05:55 UTC
I think we should change the spec to require the byte sequences Anne mentions to override HTTP charset.
Comment 3 Leif Halvard Silli 2012-02-13 12:19:14 UTC
(In reply to comment #0)

>more important than anything else

NOTE: For Webkit/Chromium/Trident, then "anything else" 
      includes user's attempt to override the encoding.

That is to say: When the BOM defines the encoding, then, the UA's Text Encoding menu is either not available or without any effect (on the current page).

Please make sure that the spec requires the same.
Comment 4 theimp 2012-07-05 03:12:07 UTC
(In reply to comment #0 & comment #2)

I personally don't think that a byte order mark (*especially* the so-called UTF-8 Byte Order Mark) should override the HTTP charset as specified. However, I am not interested in formally objecting to this idea, as I will agree that it seems (and history has shown) to be more likely that an author (or server) will incorrectly specify the HTTP charset, than that they will incorrectly add an invalid BOM (this is less true of UTF-8, where the BOM is not actually a BOM).

Note that, for example, CSS uses different mandatory rules to what is proposed here ( http://www.w3.org/TR/CSS2/syndata.html#charset ); this might cause great confusion for authors with an incomplete understanding of these rules.

(In reply to comment #3)

I think that it is an *extremely* bad idea to require that user agents must not be able to handle documents in whatever way their users have configured them to - rightly, wrongly, or otherwise.

I cannot emphasize this enough. Not all user agents are browsers, and even browsers should be configured first and foremost by their users. By all means, make it a primary heuristic; forbid even suggesting a switch and require an explicit action by the user, if need be; but if the user says, "no, do this", the browser should do so unless actual harm can be demonstrated and no challenge is offered.
Comment 5 Leif Halvard Silli 2012-07-05 05:57:03 UTC
(In reply to comment #4)

> (In reply to comment #3)
> 
> I think that it is an *extremely* bad idea to require that user agents must not
> be able to handle documents in whatever way their users have configured them to
> - rightly, wrongly, or otherwise.

Are you dissatisfied with the way XML is specified?

XML does not allow you to override the encoding at will. For instance, when there is no encoding info (BOM, XML declaration, XML encoding declaration, HTTP) associtated with an XML file, then the file MUST be UTF-8 encoded. Those XML parsers that are serious about implementing XML thus do not permit that you override the encoding of an UTF-8 encoded XML document, because, to do so, would be a fatal error.

It is great if HTML gets semi-aligned with XML so that, at least when the BOM is used, it gets impossible to override the DOM. That way, we decrease the need for users to "fix" encoding configuration problems.
Comment 6 theimp 2012-07-05 07:44:28 UTC
The charset determination rules for XML are non-normative, except for the case you mention, where there is no BOM and no (XML) declaration and no higher-level specifier (such as a HTTP header). This bug does not discuss this scenario directly.

Even so, it is perfectly acceptable for valid XML processor to detect a BOM, ignore it, and pick any encoding it likes, because technically, it's only required to use the BOM for the specific case of picking between UTF-8 and UTF-16, not between one of those and anything else. I could detect a UTF-16 BOM, and decide to nevertheless render it in any encoding I want *except* UTF-8, and likewise the reverse, and it would be fully compliant:

> XML processors MUST be able to use this character [U+FEFF] to differentiate between UTF-8 and UTF-16 encoded documents.

Also:

> In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration [...] containing an encoding declaration

But, there is no special elaboration as to what "external character encoding information" actually means, and it is not clear that "specific instruction from the user" could not qualify. Think like command-line parameters for batch parsers, etc.

So, a document in, really, any encoding, would not automatically be invalid XML in the specific case of a user who said "use this encoding" (again, any encoding).

Even if that is not the case, that does not change the fact that a processor processing a document with, say, a UFT-8 "BOM" *and* a text declaration specifying, say, ISO-8859-1 encoding, has no requirement to obey the BOM over the text declaration. A BOM is required if the document is UTF-16; but other encodings are not forbidden from having that (or any other) BOM, nor does XML require that the BOM be considered as authoritive if present (In fact it explicitly only recommends it).

I believe that this will typically mean that character data will appear at the start of the document, but this is only an error, not a fatal error (and for XML 1.0, maybe not necessarily even that; I don't really remember and will have to check). And in fact, I think that there is wiggle-room for even on this point.

If this character data is then interpreted as such and emitted into the HTML, then this would of course then be an error in HTML, but that is for the HTML spec. to deal with, which it does: the spec. currently says that what would be an initial BOM should be ignored even if it is unrelated to the encoding.

Furthermore, not all representations of HTML5 will be XML-compatible anyway. I very much agree with the goal of aligning HTML5 with XML; but vendors should be left interpret this however they want if they want more robust XML processing over legacy support; it should not be specified here.
Comment 7 Leif Halvard Silli 2012-07-05 12:44:24 UTC
(In reply to comment #6)

To put it politely: You have not read the XML 1.0 spec correctly.

Section '4.3.3 Character Encoding in Entities' is NORMATIVE. [1]  Whereas the section you talk appear to be talking about, is Appendix F. [2] Appendix F only contains helpful instructions/tips for how to fulfill the requirements of section 4.3.3.

[1] http://www.w3.org/TR/REC-xml/#charencoding
[2] http://www.w3.org/TR/REC-xml/#sec-guessing

Appendix F.1 first - in the first table - discusses how to sniff the encoding when there is a BOM. This is simple: Ifone parses an XML document which _contains_ the BOM as something other than UTF-16, UTF-8 or UTF-32 (UCS-4), then the BOM is not a BOM but an illegal character = fatal error.

Then - in the second table - it discusses how to sniff it when there is no BOM. This is also simple: Except for UTF-8, it is impossible, per the rules that XML operates with.

And therefore, in the second table of Appendix F.1, each row (except the last row about UTF-8!) ends roughly the same way: "The encoding declaration must be read to determine" (the encoding). And if there is no encoding declaration, then it is a fatal error, per section 4.3.3. Section 4.3.3. is also clear about the fact that if there is an external (typically HTTP) or internal encdoing declaration, and this declaration turns out to be incorrect, then it is a fatal error. Lack of encoding information is also considered as a signal that the document is UTF-8 encoded.

The effect of all this is that, in XML, it should always cause a fatal error if you try to override the encoding. Firefox is probably one of the XML parsers that _best_ reflects XML's encoding rules. So if you are in doubt, I suggest that you do some experimentes for yourself.
Comment 8 Leif Halvard Silli 2012-07-05 12:59:55 UTC
(In reply to comment #7)
> (In reply to comment #6)

> Appendix F.1 first - in the first table - discusses how to sniff the encoding
> when there is a BOM. This is simple: Ifone parses an XML document which
> _contains_ the BOM as something other than UTF-16, UTF-8 or UTF-32 (UCS-4),
> then the BOM is not a BOM but an illegal character = fatal error.

By the way: In that case it is an illegal character per HTML5 as well: A UTF-8 document with a BOM would  be would bring the browser into Quirks-Mode if the browser reads the document as - for example - ISO-8859-1.

So even if we look squarely at HTML5, it makes no sense to permit the encoding to be overridden whenever there is a BOM: To permit the encoding to be overridden when there is a BOM would be like permitting users to shoot themselves in the foot.

PS: I now consider this subject to be debated to death. I am not going to give any more explanations. And if I give any more replies, then they will be short and to the point.
Comment 9 theimp 2012-07-05 16:05:00 UTC
Firstly, as I said, this bug covers only where there is a BOM, not where there is neither a BOM nor an encoding declaration nor a header.

Secondly, the bug suggests ignoring headers when there is a BOM present, but the XML spec. *specifically* says that "external character encoding information" can be used to determine the encoding.

> 4.3.3 In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

So if I have:

0xFE 0xFF <?xml encoding="ISO-8859-1"?>

It is a fatal error to decode it as UTF-16. Sure, this causes other problems, but not necessarily fatal errors (at least in XML 1.0).

Your arguments over using the UI to configure the charset are not within the original scope of this bug.

Thirdly, as I said, not all HTML will be XML.

Those are alone enough reason not to make the proposed behavior mandatory for all user agents in all cases.

> Firefox is probably one of the XML parsers that _best_ reflects XML's encoding rules. So if you are in doubt, I suggest that you do some experimentes for yourself.

Make it a recommendation, I do not care. I am not saying that browsers must allow changes; I'm saying they should not be constrained from allowing them. If they think that there is no value in allowing users to change encodings at will, I don't see that as being a problem. But it should not be part of the spec. If you were going to advocate requiring total compliance with XML in all circumstances, I would be sympathetic; but apparently that is not what is desired for HTML5.

Furthermore:

> By the way: In that case it is an illegal character per HTML5 as well: A UTF-8
document with a BOM would  be would bring the browser into Quirks-Mode if the
browser reads the document as - for example - ISO-8859-1.

Yes, it would typically trigger Quirks mode (except in some, perhaps only theoretical, encodings). That's not a fatal error though.

> Section '4.3.3 Character Encoding in Entities' is NORMATIVE.

Yes.

> Whereas the section you talk appear to be talking about, is Appendix F. [2] Appendix F only contains helpful instructions/tips for how to fulfill the requirements of section 4.3.3.

No, I only mention Appendix F once, to say that it is non-normative. Every part of the spec. that I actually quoted is from Section 4.3.3.

The following:

> XML processors MUST be able to use this character [U+FEFF] to differentiate between UTF-8 and UTF-16 encoded documents.

and:

> In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration [...] containing an encoding declaration

Are in Section 4.3.3.

It's really great that you are passionate about XML, but you must be careful only to read what the spec. actually says.

Let's take the example of a ISO-8859-1 with a UTF-16 BOM. In section 4.4.3:

> It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process.

Not a problem, the browser supports ISO-8859-1.

> It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding.

Not a problem, both 0xFE and 0xFF are legal encoding sequences in ISO-8859-1.

> Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode.

Not relevant.

> Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Not a problem, "higher-level protocol" could include "user configuration", since this term is not defined anywhere. Even if this is not the case, there is still the scenario where there is both a BOM and an encoding declaration - it does not say that the BOM should trump the encoding declaration. Doing so 

Anything else is at worst an error, not a fatal error.
Comment 10 Leif Halvard Silli 2012-07-06 02:30:39 UTC
(In reply to comment #9)
> Firstly, as I said, this bug covers only where there is a BOM, not where there
> is neither a BOM nor an encoding declaration nor a header.

BOM is a side track. My question was: Do you dislike the "user experience" of XML when  it comes to its prohibitance against manual encoding overriding? Becasue XML's user experience is the same regardless of whether there is a BOM or not. If you can live with XML's _general_ prohibition against manual encoding overriding, then I don't see why you can't also live with the same strict rule for the _specific_ subset of HTML when there is a BOM.

> Secondly, the bug suggests ignoring headers when there is a BOM present, but
> the XML spec. *specifically* says that "external character encoding
> information" can be used to determine the encoding.

Correct. It is currently also against the HTTP specs.
 
> So if I have:
> 
> 0xFE 0xFF <?xml encoding="ISO-8859-1"?>

1) This XML declaration is invalid as it lacks the version attribute.
2) There are two characters, 0xFE 0xFF, in front of the declaration. 

Note regarding 2): As I have tried to say before, 4.3.3 specs that:

     "It is a fatal error for a TextDecl to occur other than at the 
      beginning of an external entity."
 
> It is a fatal error to decode it as UTF-16. Sure, this causes other problems,
> but not necessarily fatal errors (at least in XML 1.0).

Wrong. See my 'Note regarding 2)' above.

> Your arguments over using the UI to configure the charset are not within the
> original scope of this bug.

This bug requests the behaviour of IE and Webkit to be standardized. And IE and Webkit do prohibit manual overriding. Meanwhile, the bug requester has since written the 'Encoding' standard, in which he states: [*]

   "For compatibility with deployed content, the byte order mark (also
    known as BOM) is considered more authoritative than anything else."

[*] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#decode-and-encode
 
> Thirdly, as I said, not all HTML will be XML.

There is benefit - security wise and practical - in disallowing users
to "disturb" the encoding regardless of whether the page is XML or HTML.

>> By the way: In that case it is an illegal character per HTML5 as well: A UTF-8
>> document with a BOM would  be would bring the browser into Quirks-Mode if the
>>browser reads the document as - for example - ISO-8859-1.
> 
> Yes, it would typically trigger Quirks mode (except in some, perhaps only
> theoretical, encodings). That's not a fatal error though.

So, now you are offering me at least one use case: To allow users to
place the page in quirks-mode. Frankly: I dismiss that use case.

>  but you must be careful only to read what the spec. actually says.

A hillarious comment. But an excellent advice.
Comment 11 theimp 2012-07-06 02:45:36 UTC
Firstly, sorry that I am not explaining myself well. I will try to be more careful.

> 1) This XML declaration is invalid as it lacks the version attribute.

It was just an example that I didn't spell out properly. Sorry, I should have put a bit more thought into it.

> 2) There are two characters, 0xFE 0xFF, in front of the declaration. 
> Wrong. See my 'Note regarding 2)' above.

Yes, I see my phrasing mistake now, you're right *that it is an error*.

Not, however, that it *not* also a fatal error to treat it as UTF-16 because of the BOM (say, if the parser wants to go on and see what other errors it finds, it should use the encoding specified).

What I meant was, irrespective of whether the documents are well-formed, obeying the BOM is, without question, WRONG in that case.

> For compatibility with deployed content, the byte order mark (also known as BOM) is considered more authoritative than anything else.

Too bad; it's wrong:

http://203.59.75.251/Bug15359

A simple testcase has been done, and [latest release versions of] all major browsers currently fail XML compliance due to this proposed handling of the BOM (some non-browser XML processors get this right, though).

Now, my position is unreservedly that, for compatibility with XML, the BOM must not be specified as overriding all other considerations in all cases. [The proposal for this Bug]

As for overriding the HTTP Content-Type parameter specifically, or user selection generally, my position is unchanged, for the reasons already given.

In particular, the spec. should remain silent on the subject of users configuring their user agent to apply certain encodings to certain documents. How this may impact XML (in terms of whether this would be valid in a particular case) is unrelated to how it should impact (X)HTML5; to whatever degree that it is already specified in the XML spec., leave it at that.

Sometimes, users simply have to debug misdetected/misspecified encodings; the fact that I've just demonstrated a new encoding-related misbehavior is proof of that.
Comment 12 theimp 2012-07-06 04:27:41 UTC
> BOM is a side track. My question was: Do you dislike the "user experience" of XML when  it comes to its prohibitance against manual encoding overriding? Becasue XML's user experience is the same regardless of whether there is a BOM or not. If you can live with XML's _general_ prohibition against manual encoding overriding, then I don't see why you can't also live with the same strict rule for the _specific_ subset of HTML when there is a BOM.

For strict, valid, well-formed, XML, served with an explicit XML Content-Type, then no, I have no problem with the idea.

My problem is applying those rules to billions of pages that are *not* strict, valid, well-formed, XML, served with an explicit XML Content-Type.

The difference being, that practically no strict, valid, well-formed, XML, served with an explicit XML Content-Type, will have an incorrect BOM plus a contradictory charset indicator of any other kind. The same cannot be said for other web content (if it could, I would probably accept that too).

For this reason, as far as the user-configured option is concerned, consistency is better served by having HTML5 say nothing, and letting the user agent forbid it in the circumstances where they think that appropriate (such as XML).

> Correct. It is currently also against the HTTP specs.

I'm very sorry, but I did a lot of research and I can't find anything that says this. Is this a current spec. or a draft spec.?

See, RFC 2616 says:
"HTTP/1.1 recipients MUST respect the charset label provided by the sender"
"user agents [...] MUST use the charset from the content-type field if they support that charset"

Meanwhile, RFC 3023 keeps saying, over and over:
"[the HTTP Content-Type] charset parameter is authoritative"

I know that no-one here seems to like any specs. other than this one that they're writing, but I just don't see any way that this not just another "willful violation".

> So, now you are offering me at least one use case: To allow users to place the page in quirks-mode. Frankly: I dismiss that use case.

Better than getting unreadable garbage because someone specified an incorrect BOM/charset combination on some 10-year-old document.
Comment 13 Leif Halvard Silli 2012-07-06 04:32:35 UTC
(In reply to comment #11)

> http://203.59.75.251/Bug15359
> 
> A simple testcase has been done, and [latest release versions of] all major
> browsers currently fail XML compliance due to this proposed handling of the BOM
> (some non-browser XML processors get this right, though).

Their behaviour *would have been* correct, if we changed XML to say this:

]] In the absence of information provided by an external transport protocol
   (e.g. HTTP or MIME) <INS> or a byte order mark</INS>,
   it is a fatal error for an entity including an encoding declaration to
   be presented to the XML processor in an encoding other than that named
   in the declaration, [[

As both HTTP or BOM are "external" to the markup such a change would makes sense.

> Now, my position is unreservedly that, for compatibility with XML, the BOM must
> not be specified as overriding all other considerations in all cases. [The
> proposal for this Bug]

There are many aspects of "compatibility with XML". The most important
aspect is UTF-8, itself. Problem is: HTML defaults to Windows-1252. XML
defaults to UTF-8. This means that, occationally, the HTML page can
achieve an encoding - via default or by manual overriding - that differs
from the author's intended encoding.

The second important aspect of compatibility with XML is the fact that it's
impossible to override the encoding of an XML document.

We can have both of these benefits in HTML too, if only one uses the BOM. This benefit, however, comes at the expence of HTTP charset: The BOM must be allowed to override the HTTP charset. This is a price worth paying. Encodings is an evil. We should try remove their importance as much as possible.
 
> As for overriding the HTTP Content-Type parameter specifically, or user
> selection generally, my position is unchanged, for the reasons already given.

I don't understand your reasons. You are CONTRA that the BOM overrides the HTTP charset. But you are PRO that the user can override the BOM. I see no benefit in that standpoint. I only see pessimism about the need for users to override encodings.

NOTE: One reason that the BOM should override HTTP is that the BOM is likely to be more correct. (Plust that Webkit and IE alread behave like that.) If all browsers impoements IE and Webkit's behaviour, the encoding errors should not occur, and thus the user will have no need for overriding the encoding.
 
> In particular, the spec. should remain silent on the subject of users
> configuring their user agent to apply certain encodings to certain documents.
> How this may impact XML (in terms of whether this would be valid in a
> particular case) is unrelated to how it should impact (X)HTML5; to whatever
> degree that it is already specified in the XML spec., leave it at that.
> 
> Sometimes, users simply have to debug misdetected/misspecified encodings; the
> fact that I've just demonstrated a new encoding-related misbehavior is proof of
> that.

You have documented a discrepancy between what browsers do and what XML
specifies. You have not documented that what the browsers do lead to
any problems. For instance, the test page you created above, works just
fine. You have not even expressed any wish to override their encoding. 
So, I'm sorry, but the page you made does not demonstrate what you
claim it to demonstrate.
Comment 14 Leif Halvard Silli 2012-07-06 04:44:25 UTC
(In reply to comment #12)
>> BOM is a side track. My question was: Do you dislike the "user 
>> experience" of XML when it comes to its prohibitance against manual
>> encoding overriding? [ ... ]
> 
> For strict, valid, well-formed, XML, served with an explicit XML Content-Type,
> then no, I have no problem with the idea.
> 
> My problem is applying those rules to billions of pages that are *not* strict,
> valid, well-formed, XML, served with an explicit XML Content-Type.

If all browsers implement the IE/Webkit behaviour, then there is no
problem. If you know that it is a problem, then you should provide 
evidence thereof - for instance by pointing to a page that gets
broken if this behaviour is implemented.
 
> > Correct. It is currently also against the HTTP specs.
> 
> I'm very sorry, but I did a lot of research and I can't find anything that says
> this. Is this a current spec. or a draft spec.?

It is common sense - and specified - that HTTP's charset parament should
trumph anything inside the document. This goes for HTML and for XML.
_THAT_ is the controversial side of this bug: This bug ask the BOM to
override the HTTP charset parameter. (The point I have been making, that
BOM should also 'override manual user overriding', is part of the same
thing - I just wanted to be sure that everone got that.)
 
> I know that no-one here seems to like any specs. other than this one that
> they're writing, but I just don't see any way that this not just another
> "willful violation".

Exactly. That is what it is.
 
>> So, now you are offering me at least one use case: To allow users to
>> place the page in quirks-mode. Frankly: I dismiss that use case.
> 
> Better than getting unreadable garbage because someone specified an incorrect
> BOM/charset combination on some 10-year-old document.

You are welcome to demonstrate that it is an actual problem.
Comment 15 Leif Halvard Silli 2012-07-06 05:29:14 UTC
(In reply to comment #14)

> > I know that no-one here seems to like any specs. other than this one that
> > they're writing, but I just don't see any way that this not just another
> > "willful violation".
> 
> Exactly. That is what it is.

But note that to diallow users to override the encoding, breaks no spec.
Comment 16 theimp 2012-07-06 13:41:01 UTC
> Their behaviour *would have been* correct, if we changed [something else] to say this:

That basically summarizes every problem with everything, ever.

> The second important aspect of compatibility with XML is the fact that it's impossible to override the encoding of an XML document.

Not always. In fact, never; but usually not without generating a fatal error, as well.

Developers, for example, might want to do this even if it generates an error.

Also, beyond browsers, fatal errors might not be so total. For example, a graphical editor such as Amaya should still be able to load the document, exposing the source text (which the processor may still do after a fatal error). Doing so requires that it detect an encoding, and as that could be wrong in respect of the intended content, the author must be able to override it (especially if the editor means to fix the incorrect BOM!).

> We can have both of these benefits in HTML too, if only one uses the BOM. This benefit, however, comes at the expence of HTTP charset: The BOM must be allowed to override the HTTP charset. This is a price worth paying. Encodings is an evil. We should try remove their importance as much as possible.

I am sympathetic to your ends, but not your means.

See also: http://xkcd.com/927/

> I don't understand your reasons. You are CONTRA that the BOM overrides the HTTP
charset. But you are PRO that the user can override the BOM.

I'm PRO that the user can override just about anything. The web and the software exists for the user, not the other way around.

As soon as the user changes any configuration option - including installing an extension - all bets are off, and the spec. should not have to - or try to - account for such scenarios.

> You have documented a discrepancy between what browsers do and what XML specifies.

That should be enough.

> You have not documented that what the browsers do lead to any problems.

It will lead to problems when this happens: some browser - maybe a current one, maybe a new one - begins to obey the spec., which they can do at any time. Then you're right back to different renderings again. The whole point of HTML5 being so meticulous about every detail is because of the past problems where one browser does something wrong, and another browser decides to fix it, and then the whole web is split in half (or worse).

> For instance, the test page you created above, works just fine.

It "works" with current major browsers. Not all user agents are browsers. For example, many validators treat it according to the spec., which means a fatal error.

The following are some examples of online validators that (correctly) determine the example to be invalid:

 http://validator.nu/ and http://validator.w3.org/nu/
 http://validator.aborla.net/

This validator detects the error, but does not consider it fatal:

 http://www.validome.org/

Also, at least some older versions of some browsers do produce an error on this page. Specifically, Epiphany 2.30.6 (still the default install on the latest release of Debian, at this time). And since (this particular version) is a Gecko-based browser, and Firefox has a significant population who upgrade very slowly, it is possible that this might add up to a lot of browsers. I might try to test this further.

Also, it doesn't "work just fine", because in this case, I, the author, expected something different. This is an example of how browsers violate the XML spec.; "works just fine" would be that it does what I said, which is what the spec. says to do, because that's what it was very specifically coded to, with the explicit intention of causing a fatal error for testing/demonstration purposes. This is the problem with trying to second-guess what you are explicitly told, generally.

> You have not even expressed any wish to override their encoding.

That is a different argument, and not in the original scope of this bug. You were the one to first mention how important compliance with the XML spec. is; now that I have shown that, in fact, it is the action of this bug which is non-compliant, you want to ignore that and argue about user overrides instead.

> So, I'm sorry, but the page you made does not demonstrate what you claim it to demonstrate.

It demonstrates exactly what I claim it to; and no more. I only claim it to demonstrate that this bug, as originally filed, which says that the BOM should be "considered more important than anything else when it comes to determining the encoding of the resource", is incompatible with XML.

> If all browsers implement the IE/Webkit behaviour, then there is no problem. If you know that it is a problem, then you should provide evidence thereof - for instance by pointing to a page that gets broken if this behaviour is implemented.

I haven't seen much recently, I'll have a look and post what I find.

> You are welcome to demonstrate that it is an actual problem.

Try those validators. Though I expect that won't satisfy you.

> But note that to diallow users to override the encoding, breaks no spec.

And it would also break no spec. for me to write a browser plugin that lets me configure the character encoding. Your proposal doesn't actually solve anything if you expect perfect, unconditional control over something that a user may want to change, for whatever reason that suits them. All it does is shift it down the line a bit (moving from browser developer -> plugin developer).

It is just beyond the scope of the HTML specification to command the browser to this extent.

Now, since the standards of evidence are so high, I would like for you to demonstrate to me the problem that you claim: where is the proof that users receive documents with BOMs, yet nevertheless cause havoc by manually changing the encoding settings of their browser without knowing exactly what they are doing, or at least understanding that doing so makes only them responsible for "shooting themselves in the foot".

In fact, it seems to me that the conditions in which users would change the encoding manually can only be cases where there is an encoding detection problem, implying that either such documents must exist, or else that it is never a problem that would actually occur. Except of course for idle curiosity. That is, nothing that gives basis for direction in this spec.
Comment 17 Leif Halvard Silli 2012-07-06 20:03:09 UTC
(In reply to comment #16)
> I'm PRO that the user can override just about anything. The web and the
> software exists for the user, not the other way around.

1st sentence doesn't always follow from 2nd. And for some reason your are not PRO the same thing when it comes to XML. I wonder why.
 
> As soon as the user changes any configuration option - including 
> installing an extension - all bets are off, and the spec.  should
> not have to - or try to - account for such scenarios.

A plug-in or config that ignore/extend a spec, do ignore/extend that spec. 

WRT your test file <http://203.59.75.251/Bug15359>:
>> You have not documented that what the browsers do lead to any problems.
> 
> It will lead to problems when this happens: some browser - maybe a 
> current one, maybe a new one - begins to obey the spec., [...]

Specs, UAs, tools, authors etc need to stretch towards perfect unity. When there are discrepancies, the question is who needs to change.

>> For instance, the test page you created above, works just fine.
> 
> It "works" with current major browsers.

It even works with the highly regarded xmllint (aka libxml2).

> Not all user agents are browsers. For example, many validators
> treat it according to the spec., which means a fatal error.

Validators are a special hybrid of user agents and authoring tools which of course need to be strict.

> Also, at least some older versions of some browsers do produce an 
> error on this page. Specifically, Epiphany 2.30.6 [...] a
> Gecko-based browser

Firefox 1.0 does not have that issue.
 
> Also, it doesn't "work just fine", because in this case, I, the author,
> expected something different.

Something should adjust - the XML spec or the UAs. Make it an 'error' as opposed to 'fatal error', is my proposal. The result of an 'error' is undefined.

>> You have not even expressed any wish to override their encoding.
> 
> That is a different argument, and not in the original scope of this bug. You
> were the one to first mention how important compliance with the XML 
> spec. is;

Your page contains encoding related errors, and your argumet in favor of user override was to help the user to (temporarily) fix errors. So it perplexes me to hear that this is a "different argument".

But if that is how you think, then I see no point it discussing your page any more - which is why this is my last comment regarding that page.

> where is the proof that users receive documents with BOMs, yet nevertheless cause havoc
> by manually changing the encoding settings of their browser without knowing exactly 
> what they are doing, or at least understanding that doing so makes only them responsible for
> "shooting themselves in the foot".

So many layers of if … As an author, I sometimes want prevent users from overriding the encoding regardless of how conscious they are about violating my intent.

There is evidence that users do override the encoding, manually. And that this causes problems, especially in forms. To solve that problem, the Ruby community has developed their own "BOM snowman": <http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen> The HTML5 spec does as well say that documents, and especially forms, are not guaranteed to work unless the page is UTF-8 encoded.
Comment 18 theimp 2012-07-08 23:46:13 UTC
(In reply to comment #17)
> And for some reason your are not PRO the same thing when it comes to XML. I wonder why.

It is a compromise position. This is a specification, not a religion; compromise is good. Properly formed XHTML served as XHTML to a bug-free parser would almost never have any need to have the encoding changed by the user. The few cases where it would be useful would probably be better handled with specialty software that does not need to be interested in parsing according to the spec. So, I for one would be prepared to compromise.

This talk about the "user experience" is silly. I would postulate the following, offering no evidence whatsoever:

1) In 99.9% of cases, users who see garbage simply close the page. They know nothing about encodings or how to override them, and care to know even less.
2) Of the other 0.1%, in 99.9% of cases the user changes the encoding of the page only to view it, not to submit any form.
3) Of the other 0.1% of 0.1%, in 99.9% of cases the submitted forms contain only AT-standard-keyboard-characters with ASCII codepoints which are identical to the codepoints of the corresponding UTF-8 text.

Agree/disagree?

> A plug-in or config that ignore/extend a spec, do ignore/extend that spec.

The point is, you can never get the assurances you seem to argue are essential.

> Specs, UAs, tools, authors etc need to stretch towards perfect unity.

Finally, something we are in total agreement over.

> When there are discrepancies, the question is who needs to change.

Of course. Unfortunately, more and more it seems to be answered with "anyone or everyone, except us!"

> It even works with the highly regarded xmllint (aka libxml2).

Great, more broken software. How is this helpful to your position? I certainly do not regard it highly now that I know it parses incorrectly.

> Validators are a special hybrid of user agents and authoring tools which of course need to be strict.

The whole point of XML is that *all* processors process it the same way! There is nothing special about a validator in that respect.

It is clear that you do not care one bit about the XML spec. - it is an argument for you to use when it suits you, and to discard when it goes against you.

> Something should adjust - the XML spec or the UAs.

Either. But if the user agents are to stay the same (in respect of XML), it should be because the XML spec. changes - not because the HTML5 spec. changes unilaterally.

The internet is bigger than the web, and the web is bigger than HTML, and HTML is bigger than the major browsers. Outright layer violations, marginalization of national laws, breaking of backwards compatibility; you name it, HTML5 has it. I sincerely hope that these attempts to re-architect the internet from the top down are, in time, fondly remembered for their reckless pursuit of perfection that made everything much better in the end, and not poorly remembered for their hubris that set back standards adoption more than the Browser Wars.

> Make it an 'error' as opposed to 'fatal error', is my proposal. The result of an 'error' is undefined.

That would be VERY BAD. It would mean that there is more than one valid way to parse the same document.

Your previous suggestion was much better:
> <INS> or a byte order mark</INS>

I would even be prepared to support such a suggestion, if there was no indication that the current wording is deliberate/relied upon (which I suspect - note how the current reading is fundamentally the same as the CSS rules, for example. Not likely to be a coincidence).

> Your page contains encoding related errors, and your argumet in favor of user override was to help the user to (temporarily) fix errors. So it perplexes me to hear that this is a "different argument".

My goodness me, that page was meant to demonstrate exactly one point, not every argument I made in this bug. Sheesh.

Okay, here we will do a little though experiment.

Imagine a page like my example.

Now imagine that, like a large number of web pages, it is fed content from a database. I *know* that the content in the database is proper ISO-8859-1.

Imagine that this text also happens to coincidentally be valid UTF-8 sequences, which is very possible (if this is too hard to imagine, let me know and I'll update the page). Or more likely, that the page is not served as XHTML, which is even easier to imagine.

Imagine that I get complaints from my users, that the page is full of garbage.

I check the page, and they're right!

I click "View Source", and see the garbage. I don't see the BOM, because it's invisible when the software thinks it's a BOM. I do see my encoding declaration, though, and to me it looks like it's at the top of the page (because I can't see the BOM). Why didn't that work? Or did it, and the problem is somewhere else?

I open the XHTML file in an editor, and it looks the same, and for the same reason. I don't see the database text, of course, just code.

Eventually I notice that the browser says the page is UTF-8. So I try to change it to what I know the content actually is. But it doesn't work! Funny... that used to work. Must be a browser bug. Why else would the menu even be there?

** Note: If your browser lets you change the encoding, you can stop here. Otherwise, continue to the end, then start using WireShark.

In all probability, this theoretical author acquired the XHTML file from somewhere else complete with BOM and just searched for "how do i make a xhtml file iso-8859-1 plz". That, or they had some overreaching editor that decided that it would insert a UTF-8 BOM in front of all ASCII files in the name of promoting UTF-8. Or maybe the editor used very poor terminology and used the word "Unicode" instead of the actual encoding name in the save dialog, making the author - who knows basically nothing about encodings - think that this simply implies that it will work on the web ("all browsers support Unicode!"). Or any number of possible scenarios.

Consider that, having explained what the author observed above, they would probably be getting advice along the lines of "try setting the http-content-type header lol!!1".

Actually, they'd probably be told - it's your database. For as long as they're looking there, they'll never find the problem.

Getting this far is probably already the limit of what the vast majority of authors can manage. How does this author, who just wants their damn blog to work with their old database, debug this problem further?

Exercise: If some records in my database coincidentally don't have any extended characters (ie. are equivalent to ASCII and therefore UTF-8), meaning that some pages work and some don't, how am I (and anyone I ask) *more likely* to think that the problem is with the static page, rather than the database?

Bonus points: Why does the CSS file, from the same source/edited with the same editor, work exactly the way they expect? ie.

0xEF 0xBB 0xBF @charset "ISO-8859-1";
               body::before{content:"CSS treated as encoded in ISO-8859-1!"}
               /* Not by all browsers; just browsers which obey the spec. */

Triple Word Score: What if the BOM was actually added by, say, the webserver (or the FTP server that got it onto the webserver):
http://publib.boulder.ibm.com/infocenter/zos/v1r12/topic/com.ibm.zos.r12.halz001/unicodef.htm
Or a transformative cache? Or a proxy?

> As an author, I sometimes want prevent users from overriding the encoding regardless of how conscious they are about violating my intent.

As a user, I sometimes want override the encoding regardless of how conscious the author is about violating my intent.

What makes you argument more valid than mine?

Hint: It's not:
http://www.w3.org/TR/html-design-principles/#priority-of-constituencies
"costs or difficulties to the user should be given more weight than costs to authors"

There are more users than there are authors, and there always will be.

> There is evidence that users do override the encoding, manually.

Of course there is! Because they have to! Why do you think browsers allowed users to change the encoding in the first place?

Where is the evidence that this happens for pages WITH A BOM.

> And that this causes problems, especially in forms.

Not so much problems for the user. As for authors, read on.

> To solve that problem, the Ruby community has developed their own "BOM snowman":

Notes about that article:

1) It makes no mention of such pages containing UTF-8 or UTF-16 BOMs. There is no indication that the proposed resolution of this bug would solve that problem at all, because in all likelihood those pages do not have BOMs.

2) It admits that the problem begins when authors start serving garbage to their users.

3) It cannot be fixed by the proposed resolution of this bug. Everything from legacy clients to active attackers can still send them garbage, and they will still serve it right back out to everyone. But even if you could magically stop all submission of badly-encoded data, that does not change the users' need to change the encoding for all of their pages that have already been polluted. The *real problem* is that the authors didn't sanitize their inputs (accepting whatever garbage they receive without validation), nor did they sanitize their outputs (sending whatever garbage they have without conversion). This kind of fix would just give authors/developers an excuse to be lazy, at the expense of the users.

Note also that authors *still* can't depend upon the BOM to solve ANY problem they have, because again, it might be either added or stripped by editors, filesystem drivers, backup agents, ftp clients or servers, web servers, transformative caches (language translation, archival, mobile speedup), proxies, CDNs, filters, or anything else that might come in between the coder and the browser user.

Consider:

http://schneegans.de/
http://webcache.googleusercontent.com/search?q=cache:kKjAj-u4s6IJ:http://schneegans.de/

Notice how the Google Cache version has stripped the BOM?

If this is just about forms, then how is this for a compromise:

"When the document contains a UTF-8 or UTF-16 BOM, all forms are to be submitted in UTF-8 or UTF-16, respectively, regardless of the current encoding of the page. However, authors must not rely on this behavior."

Would that satisfy you?
Comment 19 Leif Halvard Silli 2012-07-09 01:04:03 UTC
(In reply to comment #18)

Even though we continue to talks past each others - and to disagree, I'm not going to debate this any more in this "forum". We have made ourselves reasonably clear.  Unless I see sudden light/darkness, I am instead going to await the Editor's decision before 

I contact you off-bugzilla about eventually discussing in some other fora.
Comment 20 contributor 2012-07-18 06:53:14 UTC
This bug was cloned to create bug 17810 as part of operation convergence.
Comment 21 Silvia Pfeiffer 2012-10-06 00:42:57 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
document:   http://dev.w3.org/html5/decision-policy/decision-policy-v2.html

Status: Accepted
Change Description:
https://github.com/w3c/html/commit/ff2d7ccf03a4d2ef46acc16dbabb83ec55bb332c
Rationale: accepted WHATWG change
Comment 22 i18n IG 2012-11-26 20:11:19 UTC
The initial impetus for the change appears to be that Trident and WebKit supported different behaviour than the spec. If paving the cowpath is the main motivator, we have a problem, since Trident (in IE10) prioritises HTTP over BOM. 

Should we switch back to the previous approach in the light of Trident?

Also, whilst I suspect that this may in fact make life easier for HTML pages for most cases, I wonder how much discussion has taken place about the implications wrt other formats. Afaik the i18n folks were unaware of the change, so we haven't discussed. Has anyone actually discussed Anne's proposal with the CSS and XML people?
Comment 23 i18n IG 2012-11-26 20:16:30 UTC
Btw, see http://w3c-test.org/framework/details/i18n-html5/character-encoding-034 for test results on various platforms. The test asserts the expected result from before the spec was changed, so for the current spec text a pass is a fail and vice versa.
Comment 24 Anne 2012-11-26 20:30:04 UTC
The move from Trident was probably more to comply with the specification than to not break legacy content. The latest CSS drafts have been updated to take the Encoding Standard into account. Dunno about XML, but it should follow suit.
Comment 25 Leif Halvard Silli 2012-11-27 00:57:08 UTC
(In reply to comment #22)
> The initial impetus for the change appears to be that Trident and WebKit
> supported different behaviour than the spec. If paving the cowpath is the
> main motivator, we have a problem, since Trident (in IE10) prioritises HTTP
> over BOM. 
> 
> Should we switch back to the previous approach in the light of Trident?
> 
> Also, whilst I suspect that this may in fact make life easier for HTML pages
> for most cases, I wonder how much discussion has taken place about the
> implications wrt other formats. Afaik the i18n folks were unaware of the
> change, so we haven't discussed. Has anyone actually discussed Anne's
> proposal with the CSS and XML people?

The initial impetus is bug 12897, (https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897) which I filed in june 2011. You will see, if you read that long and - ahem - convoluted bug, that I cared a lot about checking XML parser behavior (see below).

When I worked on bug 12897, I also discussed it over at www-international@. Thus the i18n community has been made uaware long since the.

It is a pity that IE10 stopped being compatible with itself. However, IE has not - yet -started to grow in popularity. So that is less of an argument now, in a way. (But Anne has a point in that web-compatibility should matter most.)

(In reply to comment #23)
> Btw, see
> http://w3c-test.org/framework/details/i18n-html5/character-encoding-034 for
> test results on various platforms. The test asserts the expected result from
> before the spec was changed, so for the current spec text a pass is a fail
> and vice versa.

There you only test HTML browser behavior. In bug 12897, I also tested XML parser behavior. Thus it is not true that just Trident and Webkit started this. (See below.) 

(In reply to comment #24)
> The move from Trident was probably more to comply with the specification
> than to not break legacy content. The latest CSS drafts have been updated to
> take the Encoding Standard into account. Dunno about XML, but it should
> follow suit.

XML parsers/editors has, partly, started to follow suit. In comment 10 fo bug 12897, I wrote:

]] * Parsers *not* implementing RFC3023 (thus giving priority to document data instead), and which do not emit fatal errors: Webkit, Xerces C++, XMLMind Editor on Mac (based on Xerces Java), RXP, oXygen [[

Thus, the above browsers/parsers adheres to the BOM rather than to HTTP. 

In that bug I also noted that Libxml2 was the *only* XML parser I found (apart from Firefox and Opera, according to how they behaved then) which gave priority to HTTP. But in the next comment, I reported that Libxml2, for files stored in a file system (file:// URL), "ignores the UTF-8 BOM. And obeyes the XML encoding declaration." Thus, quite on the head.

In a bold move, I tried to file bugs against XML parsers, to get them to respect HTTP most. And I know that Xerces C++ actually started  on it (but I hope they did not finish it). I also concated oXygen and XMLMind, but they were not enthusiast to adher to RFC3023. More on the contrary, actually.

I also, btw, discoverd that the XML working group had pretty much given a damn about having a test suite for this - I think the were only one relevant BOM test in the entire suite (submitted by John Cowan). Which one probable reason why the uniforimity is so bad.

Anyway, and as a summary: XML parsers handling of the BOM, especially (perhaps) the UTF-8 BOM, is quite messy, actually, and there is, for XML, lots to ask for with regard to unified behavior when it comes to which encoding declaration method that has priority.
Comment 26 Leif Halvard Silli 2012-11-27 01:14:38 UTC
(In reply to comment #24)
> The move from Trident was probably more to comply with the specification
> than to not break legacy content. The latest CSS drafts have been updated to
> take the Encoding Standard into account. Dunno about XML, but it should
> follow suit.

It is worth noting that XML 1.0 does not have any normative rules about what to do in case of conflict btween HTTP and BOM (and also not in case of conflict between HTTP and XML encoding declaration). It only has section F.2, which is non-normative (as it is part of non-normative section F). And thus XML parsers are not required to follow what it says: [1]

]]
F.2 Priorities in the Presence of External Encoding Information

The second possible case occurs when the XML entity is accompanied by encoding information, as in some file systems and some network protocols. When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In particular, please refer to [IETF RFC 3023] or its successor, which defines the text/xml and application/xml MIME types and provides some useful guidance.
[[

Thus XML, currently, (non-normatively) leaves the question of priority over to HTTP ...

[1] http://www.w3.org/TR/xml/#sec-guessing-with-ext-info
Comment 27 Henri Sivonen 2012-11-27 08:22:02 UTC
(In reply to comment #22)
> The initial impetus for the change appears to be that Trident and WebKit
> supported different behaviour than the spec. If paving the cowpath is the
> main motivator, we have a problem, since Trident (in IE10) prioritises HTTP
> over BOM. 

But Firefox changed to give precedence to the BOM, so we still have quorum (WebKit, Presto, Gecko) in favor of giving the BOM precedence.

> Should we switch back to the previous approach in the light of Trident?

No, we should not flip-flop on this. (I discussed the matter with Travis at TPAC. I’m not sure if the details should be considered Member-confidential, so I don’t write the details here.)

> Has anyone actually discussed Anne's
> proposal with the CSS and XML people?

CSS Level 3 gives the BOM precedence (implemented in WebKit, Presto and Gecko). Test case: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251

For JS, Gecko experienced incompatibility with deployed content when not giving the BOM the precedence in the case of JS, so I fixed Gecko to give the BOM the precedence in the JS case.

Frankly, XML should just follow HTML, CSS and JS behavior for consistency. The worst that can happen is that previously ill-formed docs become well-formed. If someone has a problem with that, I suggest they fight the 5th edition first.

I think this bug should go back into the resolved state without spec changes.
Comment 28 Leif Halvard Silli 2012-11-27 14:15:48 UTC
(In reply to comment #27)
> (In reply to comment #22)

> > Has anyone actually discussed Anne's
> > proposal with the CSS and XML people?
> 
> CSS Level 3 gives the BOM precedence (implemented in WebKit, Presto and
> Gecko). Test case:
> http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251
> 
> For JS, Gecko experienced incompatibility with deployed content when not
> giving the BOM the precedence in the case of JS, so I fixed Gecko to give
> the BOM the precedence in the JS case.
> 
> Frankly, XML should just follow HTML, CSS and JS behavior for consistency.
> The worst that can happen is that previously ill-formed docs become
> well-formed. If someone has a problem with that, I suggest they fight the
> 5th edition first.
> 
> I think this bug should go back into the resolved state without spec changes.

CSS- and JavaScript-files are comparable to external entities in XML, for which XML 1.0 requires the use of the BOM, even for UTF-8 texts, if the replacement text of the external entity is to begin with U+FEFF.

Clearly, the justification for that special rule, is a realization of the fact that when a zero with no-break space character occurs as the first character in a plain-text text, then, in a markup language that is biased towoards Unicode/UTF-8/UTF-16, then - in a conforming parser, the first such character *will* be interpreted as a byte order mark.

Thus it isn't necessarily HTML that invented the wheel, here ...
Comment 29 Travis Leithead [MSFT] 2012-12-10 22:03:10 UTC
> No, we should not flip-flop on this. (I discussed the matter with Travis at
> TPAC. I’m not sure if the details should be considered Member-confidential,
> so I don’t write the details here.)

No, not member confidential. During IE10 development, we switched our behavior to the behavior defined in HTML5 (at the time), for which there was some interop (again, at the time). I would not be in favor of flip-flopping again in the spec. We will do our best to update to the old IE behavior at some point in the future :)
Comment 30 Silvia Pfeiffer 2013-03-18 01:46:41 UTC
Re-closing since there is agreement to follow the quorum implemented reality.