12950 – Require Byte-Order Mark (BOM) in UTF-8 encoded pages

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12950 - Require Byte-Order Mark (BOM) in UTF-8 encoded pages

Summary: Require Byte-Order Mark (BOM) in UTF-8 encoded pages

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P3 minor
Target Milestone:	---
Assignee:	contributor
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-06-13 23:40 UTC by Leif Halvard Silli
Modified:	2011-08-04 05:06 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-06-13 23:40:58 UTC

PROBLEMS:

* Unnoticed: If a UTF-8 encoded page is lacking both the BOM and the META charset element and if that page is served via HTTP without the HTTP Content-Type: @charset parameter, then UAs are likely to default to the legacy encoding of the user's locale - such as Windows-1252. This ia problem in itself, and can go rather unnoticed for the user/author, e.g. if the language of the author is expressable in ASCII letters.
* Failing validator: As long as the page does not carry neither BOM nor META charset element, validators are likely to stamp the page as valid, with zero or little warning, despite that the Content-Type: charset parameter carries an incorrect encoding.
Example: http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Ftesting%2Fhtml5%2Fbom%2Fhtm_BOM-less
* User Agents to a certain degree treat UTF-8 encoded pages via the file:// protocol different from files served via http:// protocol. They may autodetect the encoding of the file protocol, but be more reluctant to autodetect - or override the charset of - HTTP. Example: Chrome. Some of these UAs tend to obey the BOM both via HTTP and via file.

PROPOSAL:

* Spec should say that authoring tools MUST - or at least SHOULD - insert the BOM in UTF-8 encoded pages.
* Spec should encourage conformance checkers to recommend the UTF-8 BOM whenever the checker (via HTTP Content-Type's charset property, the META@charset element or the validator user's encoding overriding choice) determines the encoding to be UTF-8.

JUSTIFICATION:

1. HTTP's priority:
The UTF-8 BOM would enable conformance checkers to detect whether HTTP charset propertly is used incorrectly. Because:
* According to HTML5, the optional @charset property of the HTTP Content-Type: overrides both the locale default and the META charset element (if there is one).
* But, unless there is a BOM, conformance checkers cannot programmatically determine whether the page is served with a correct Content-Type: charset parameter or not. (Because, although it becomes entirely illegible to a human being, a UTF-8 encoded page that is lacking the BOM, may - technically - also be parsed in a legacy 8-bit encoding.) [Of course the BOM might also be incorrect, but typically it is correct.]
* In contrast, for a UTF-8 encoded page that does have a BOM, then unless the page is actually served and parsed as UTF-8, the BOM will count as an illegal character before the DOCTYPE which, in turn, will trigger quirks-mode. This is identical to how the UTF-8 BOM character(s), for XML documents, is/are illegal befor the <?xml version="1.0" ?> declaration, unless the page is actually served as UTF-8. A BOM before a the XML declaration if the page is determined tgo be UTF-8, should make XML parsers display a fatal error.

2. SIMPLICITY
* Using an editor which uses the BOM, would be a simplification for the author: he or she would not not need to specify the META charset element, and could also drop the charset parameter.
* Pages with the BOM could be securerly determined to be UTF-8 encoded, when stored or moved. In contrast, if the page has no META charset element, which is fully legal, and also no BOM, then one is left to guess etc.
* Pages with the BOM would not need the HTTP charset parameter

3. POLYGLOTNESS
* The BOM works in both XML and HTML, meaning that the author does not need to user other means that differs with the mark-up language: XML encoding declaration, META charset elements etc.)

POSITIVE EFFECTS:
* Many editors that default to UTF-8, do not default to using the BOM - this change would encourage them to change.
* Programmatic detectio of HTTP-mislabeleled UTF-8 pages
* Would encourage moving to UTF-8
* Markup language indepeend encoding.
* We shake off all the myths about the BOM: that it is incompatible with Web browsers etc. That said, this proposal would also bring attentio to the issue, and make XML parsers handle the UTF-8 BOM, to the extent that they do not already handle it.

NOTES:
* This bug descdribes an alterative to Bug 12897. That is: instead of making the BOM overriding the HTTP header, as requested by Bug 12987, this bug suggests to make the UTF-8 BOM recommended. This would have some of the same effects as bug 12987, without introducing changes to RFC-3023.

Comment 1 Aryeh Gregor 2011-06-14 22:53:06 UTC

BOMs consist of invisible characters.  Ensuring you have a BOM, or adding one if you don't, is not trivial -- at a minimum, the procedure will be completely different for different text editors.  Allowing BOMs is fine, but requiring them isn't.  Some people will want to edit HTML as plain text with no magic characters anywhere, and that's a reasonable position.

Comment 2 Leif Halvard Silli 2011-06-15 00:19:20 UTC

(In reply to comment #1)

Regarding "not trivial":   
 * The bug asks *authoring tools* to add the BOM. It should be trivial for editor that are able to support UTF-8 to also add support for the BOM. (And if they don't support BOM yet, they should ASAP).

Regarding "Allowing BOMs is fine":  
  * If it is fine - a.k.a. "good", then we should encourage its use. (It *is* good, because it is the simplest way to avoid having to label the encoding via META element or HTTP charset parameter. Another reason why it is good is that not serving the file with the right encoding will trigger quirks-mode, which makes it a "soft-draconinan" error.)

Regarding: "but requiring them isn't [fine]":
  * It seems to me that you give a reason (namely: author's will) why the BOM ought to be a SHOULD rather than a MUST. (And you could be right that it should be a SHOULD rather than a MUST. Because, after all, HTML5 allows the encoding to be specced via the HTTP charset parameter as well as the META element.)

Comment 3 Aryeh Gregor 2011-06-15 22:34:16 UTC

"Should" is appropriate for cases where we can say that it's the right thing to do in the large majority of cases, but there may be significant legitimate exceptions.  Whether BOMs are better or worse than <meta> is a matter of opinion, not something we have any reason to take sides on.  Arguably they're worse in most cases, because they're harder for authors to deal with and debug.

Comment 4 Leif Halvard Silli 2011-06-15 23:42:22 UTC

(In reply to comment #3)

> Arguably they're worse in most cases, because they're 
> harder for authors to deal with and debug.

Does this mean that you disagree with HTML5 when it states that tools should default to UTF-8 ?

]] Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629] [[
   http://dev.w3.org/html5/spec/semantics.html#charset

My expectation is that tools *will* default to UTF-8 and that authors will then *use* UTF-8 - they will not often change to legacy encodings.

Why do you eventually expect that authors will change the encoding to a legacy encoding very often, after the default? And why do you eventually expect that the tool will not handle the conversion from UTF-8 with BOM to legacy encoding?  And why do you expect that authors will not combine META charset and BOM? Why do you not think that those that omit the META because there is a BOM will know what they are doing?

From my angle, the UTF-8 BOM would make it *simpler* to debug instances where the HTTP charset differ from the encoding of the DOCUMENT, because:

 * There is less reason to debug, beecause the page will less often fail, w.r.t. encoding.

 * If the UA permits the encoding of a page with UTF-8 BOM to be changed in the first place (by user or by HTTP), then it will become a teachable moment, as this will also simultaneously bring the page in Quirks-Mode. 
    (Only Firefox (for HTTP) and Opera (for HTTP annd file://) do not permit this, however, and I doubt that more UAs will start to permit the BOM to be overridden, because they would then land in quirks-mode as well as render the page illegible.)

Comment 5 Leif Halvard Silli 2011-06-16 01:24:07 UTC

Example when the BOM would be useful:  http://railssnowman.info

]] This bug exists in IE5, IE6, IE7, and IE8. If the user switches the browser's encoding to Latin-1 (to understand why a user would decide to do something seemingly so crazy, check out this google search), any form submission will be sent in Latin-1.[[

The thing is that when the BOM is used, the Internet Explorer does not allow the user to change the encoding. And so this bug would not be visible, and the snowman workaround would thus not be necessary.

Comment 6 Henri Sivonen 2011-06-22 07:46:20 UTC

EDITORIAL ASSISTANT'S RESPONSE: This is an Editorial Assistant's Response to 
your comment. If you are satisfied with this response, please change the 
state of this bug to CLOSED. If you have additional information and would 
like the editor reconsider, please reopen this bug. If you would like to 
escalate the issue to the full HTML Working Group, please add the 
TrackerRequest keyword to this bug, and suggest title and text for the 
tracker issue; or you may create a tracker issue yourself, if you are able 
to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: HTML is designed to be authorable in plain text editors in addition to being authorable using HTML-specific tools. Requiring the UTF-8 BOM would be impractical, because for backwards compatibility (with non-HTML cases) plain text editors generally want to have the property that the file is all-ASCII if the user has only entered Basic Latin characters and need to have the property that special markers meant for script interpreters (e.g. #! or <?php) remain intact at the start of the file without getting bytes prepended.

Comment 7 Michael[tm] Smith 2011-08-04 05:06:01 UTC

mass-moved component to LC1