13392 – i18n-ISSUE-72: BOM as preferred encoding declaration

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13392 - i18n-ISSUE-72: BOM as preferred encoding declaration

Summary: i18n-ISSUE-72: BOM as preferred encoding declaration

Status:	RESOLVED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Leif Halvard Silli
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Duplicates (2):	16908 19931 (view as bug list)
Depends on:
Blocks:

Reported:	2011-07-27 17:51 UTC by I18n Core WG
Modified:	2013-09-05 15:26 UTC (History)
CC List:	10 users (show)

See Also:

Attachments

Description I18n Core WG 2011-07-27 17:51:36 UTC

3. Specifying a Document's Character Encoding
http://www.w3.org/TR/html-polyglot/#character-encoding

"By using the Byte Order Mark (BOM) character (preferred)."

Although the BOM at the start of an HTML file is recognised by most major browsers these days, there are some issues associated with using it:

For example, a BOM can cause problems with files served using the latest version of  PHP, it produces quirks mode in IE6, it overrides HTTP encoding declarations in some browsers - which can be problematic in the case of server-based transcoding, and some authoring tools either don't allow you to set the BOM or don't save with/without the bom properly.

We are happy to leave the above line in the spec, we would just like to see the text "(preferred)" removed.  

Another reason for this is our preference for visible in-document encoding declarations, which you mention a little further down.

Comment 1 Leif Halvard Silli 2011-07-27 23:02:56 UTC

firstly: it is a myth that ie6 goes into quirks mode because of the BOM. It doesn't. (just verified again)

2ndly: human readable encoding decl is not as important as "it just works ". Moreover: You should pursue visible enc decl separately from tips issue.

3rdly: i don't think you realize the benefits of the Bom. (read bug 12897 and think). Hint: Bom makes browsers immune to encoding override, which is a good thing and even a XML-like feature in itself.

Hence I do not support the proposed change. I would rather speak louder in favor of the Bom.

Comment 2 Leif Halvard Silli 2011-07-27 23:22:16 UTC

as for the PHP problem, please document it with an online test, to demo that it still is a problem in current release of PHP and not just another myth. 

That it overrides http charset is eventually a bug in those browsers or in the respective http spec.  For polyglot markup it should not be an issue: it has to be served as Utf-8. And it breaks with this spec to send any other charset header.

Serverbased transcoding - well if it is transcoede then it halts being polyglot markup. So not very relevant.

Comment 3 Henri Sivonen 2011-07-28 08:09:17 UTC

I suggest removing " (preferred)" to avoid a long debate on whether to endorse Leif's preference or the i18n Core WG's preference.

Comment 4 Leif Halvard Silli 2011-07-28 11:18:04 UTC

the php claim is a myth afaics: e.g. the php info script runs fine even if the php page includes the Bom. 

Further: many things have been dropped from html5 because it does not work in browsers despite that it is useful in scripts. Browsers should have priority with regard to Bom as well.

Comment 5 Leif Halvard Silli 2011-07-28 11:20:21 UTC

(In reply to comment #3)
> I suggest removing " (preferred)" to avoid a long debate on whether to endorse
> Leif's preference or the i18n Core WG's preference.

With the same perspective, I propose that the i18n group simply drop this bug.

Comment 6 Leif Halvard Silli 2011-07-28 11:59:27 UTC

Unless I misremember, Polyglot Markup already points to a I18N text that reccommends a visible enc decl. So that issue should be covered. 

To not include the BOM means that a HTTP charset from the server can override the (undeclared) UTF-8 encoding without there being a fatal error in the XML parser. That's a bad thing. But when BOM is present, this mislabeled HTTP charset header can be discovered both in XML parsers (which should get fatal error) and in HTML5 parsers (which should land in quirksmode).

That some tools do not handle BOM properly, is a problem with those tools. Without more explatation, I fail to see that it is relevant.

Comment 7 Leif Halvard Silli 2011-07-28 14:01:50 UTC

Of course: HTTP charset from the server can override not only a undeclared encoding but also a declared encoding. The same is also the case for HTML parsers.

The issue that some (seemingly *most*) parsers give preference to the BOM over the HTTP charset, the meta charset element *and* the XML encoding declaration, is not limited to HTML parsers but is also the case for most XML parsers. This needs to be fixed in the parsers or in the HTTP spec(s).

However, the reason for promoting the BOM as preferred were not related to that bug/feature (which only were discovered after the text "(preferred)" was added) but to the fact that the BOM is the only in-the-file method that applies to both XML and HTML files. This is a fact that IMHO deserves good justification if you say it should be looked away from.

Henri earlier said that Polyglot Markup should be a authored to be a specs subset and not a browsers subset. (My rewording.) From that p.o.v. there should be no problems with promoting the BOM: It *is* the subset of both specs when it comes to in-the-file enc decl.

PS: Conforming XML parsers such as Firefox, Opera and Xmllint (from Libxml2) do not permit changing the encoding from that of the declared (or default) one to another one. For Webkit browsers and IE, the same behavior is currently linked to the use of the BOM (and, at least for Webkits, this is is the case for both HTML and XML). This is what I had in mind when in comment #1 I said that this is an, quote "XML-like feature in itself".

Comment 8 Leif Halvard Silli 2011-07-28 15:16:55 UTC

One reason why it is important that encoding overriding is difficult, can be seen here: http://railsnowman.info
and here: http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen

Supposedly, since the BOM means that IE does not allow the encoding to be overridden, authors therefore would not need to use the  snowman-hack in their Web forms.

So the use of the BOM -in my book- is a perfect example of the "HTML with helmets" that Sam once portrayed Polyglot Markup as.

Comment 9 I18n Core WG 2011-07-29 13:07:10 UTC

I have been preparing PHP demos and other stuff, but have run out of time. Will post on my return in 10 days or so.

Comment 10 Henri Sivonen 2011-07-29 13:25:57 UTC

(In reply to comment #7)
> Henri earlier said that Polyglot Markup should be a authored to be a specs
> subset and not a browsers subset. (My rewording.)

Yes, but...

> From that p.o.v. there should
> be no problems with promoting the BOM: It *is* the subset of both specs when it
> comes to in-the-file enc decl.

...it doesn't follow that Polyglot Markup should then promote things you like within the subset.

Comment 11 Addison Phillips 2011-07-29 15:37:46 UTC

I don't think the argument is that BOM should be removed altogether. What the I18N WG is asking for is that it not (perhaps erroneously) be considered the "preferred" option. 

Leaving aside whether this or that browser or tool responds well to BOM, the BOM is invisible when properly handled and a problem when visible. Visible encoding declarations (when correct) make page encoding easier to work with for humans.

Specifying which one has priority and how to interpret each is the job of Polyglot, but the "preferred" is unnecessary and may actually depend on the user's tools and environment.

Comment 12 Leif Halvard Silli 2011-08-01 14:03:11 UTC

COMPROMISE PROPOSAL:

* The text "(preferred)" was Eliot's addition which, however, was quite compatible with the arguments I presented along with my original spec text proposal - as such I endorsed it/did not speak against it.
* But I am personally happy to state the facts and let the authors draw the conclusions themselves. As such, I can see that the current text - with its "(preferred)" - states a preference without proper justification within the spec.

Hence, instead of the I18N Group's proposed change, I would like to suggest the following, which helps the reader's understanding more:

 1)	REPLACE: 
"By using the Byte Order Mark (BOM) character (preferred)."
	WITH:
"By using the Byte Order Mark (BOM) character, which is an encoding
 signature that both XML and HTML parsers are required to support."

<!--NOTE: the phrase 'encoding signature' stems from XML 1.0
    http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding -->


 2)	REPLACE:
"By using <meta charset="UTF-8"/> (the HTML encoding declaration)."
	WITH:
"By using <meta charset="UTF-8"/> (the HTML encoding declaration) and
 thus, for XML parsers, rely on XML´s encoding default (see above)."

<!--NOTE: 'XML´s encoding default' is explained in the spec, one para
    above - and was also in my original proposal, see bug 12062. -->

The above changes states the facts about each method, in a minimum ammount of text.


Now, some replies to the I18N Group, to Henri and to Addison:


Reply to Comment #9 - the I18N Group: 

It will be great to see the PHP test - and I don't mind putting it in the spec somewhere as long as we can also mention the problems of the <meta charset="UTF-8"/> method. For my own part, I use a PHP based CMS where I had no problems adding the BOM.



Reply to Henri - Comment #10:

> ...it doesn't follow that Polyglot Markup should 
> then promote things you like within the subset.

The polyglot facts/subset/principles says that
   a) the XML and HTML DOMs should be identical, 
   b) the syntax should be legal and neccessary in HTML and XML
It follows that a feature (the BOM) that has the same effect in both XML and HTML, is a stricter subset of XML and HTML than an feature that has effect only in HTML (and which need the HTML5 spec´s "permission" to appear in the XML serialization).



In reply to Addision - Comment #11

> I don't think the argument is that BOM should be removed altogether. 

Nontheless, bug 12062, which is the basis for what the spec currently says, was titled "UTF-8 BOM should not be forbidden in Polyglot Markup". Because per 14th of February this year, the BOM for some reason was forbidden (I wonder if the I18N Group had a finger in that).

> What the I18N WG is asking for is that it not (perhaps erroneously) 
> be considered the "preferred" option. 

Perhaps it was an error of you to suggest that it might have been an error? ;-) I don't see that it creates problems for anyone - not even for those tools which do not support the BOM.  But if you can live with my compromise proposal, then I don't need to defend "(preferred)". 

However, I would like to point out that the current spec text explicitely does *not* state that one should only use one of the - several - encoding declararation options. Instead the spec says:

	"&#8230; in the following ways, which may be used
	   separately or in combination: &#8230;"

From my POV, "(preferred)" is a recommendation to use the BOM - nothing more or less. Thus it is would not, as the spec stands, have been a spec viloation to not use the BOM or to combine it with the visible declarataion or HTTP.

(From my POV, I think I would alway - at least for HTML - include both an external encoding declaration in HTTP as well as at least one internal - currently that seems necessary in order to be on the safe side. But the Polyglot Spec currently does not deal with such detailed advice.Should it?)

> Leaving aside whether this or that browser or tool responds well to BOM,

We cannot completely leave that aside, when you yourself brings in (half truth) claims about negative effects.

> the
> BOM is invisible when properly handled and a problem when visible.

Did you mean '&#8230; and visible if not properly handled.' ? For the record: Opera has a bug in which it swallows the BOM even if the page is ISO-8859-1 encoded. Thus, it is also a problem to not make it visible, when it should be visible.

> Visible
> encoding declarations (when correct) make page encoding easier to 
> work with for humans.

From my POV, there is enough 'visible' notifications of the encoding: browsers report the encoding in one of its menus. And editors reports the encoding in a toolbar or otherwise. And they all also tend to read the BOM as an encoding declaration.

Still I have not protested against the fact that the Polyglot spec points to the i18n group's recommendation to use visible encoding declarations.  It is, as I understand it, not endorsed by the spec - it is just so that spec cites the i18n group's claim that it is helpful, and lets the author decides whether this consideration is something he or she wants to take ad notam. 

This is OK with me also despite the fact that I think it is an advantage to, when possible, only declare the encoding once. Because when something has to be declared more than once, then there is always risk that the multiple declarations get out of sync.

> Specifying which one has priority and how to interpret each is the job of
> Polyglot, but the "preferred" is unnecessary and may actually depend on the
> user's tools and environment.

The BOM is not the only feature that relies on the tools and the environment. All methods - the HTTP charset, the BOM and the meta charset element - depend on the those factors.

E.g. I have more than once been using editors which did not understand the new, HTML5 <meta charset=charset > declaration element. 

Examples: 

	* The HTML parser inside XMLLib2 (try xmllint on the command line) does not understand <meta charset="UTF-8"/> but does instead default to ISO-8859-1. *However*, if the document includes the BOM *or* the legacy <meta@content-type> encoding declaration, xmllib2's HTML parser still succeeds in detecting the encoding as UTF-8.

	* The iCab web browser, before it switched to using Webkit, supported the BOM, but did not support <meta charset="UTF-8"/>.

	* I'm sure I could find several browsers and tools more. 

Thus, the BOM has sometimes better support than the new meta charset element. This because BOM is both XML-compatible as well as HTML-compatible *and* because it is older than the HTML5 encoding declaration. 

	Question: Why isn't the lack of back-compatibility for the new <meta charset="UTF-8"/> a concern for you? Note that Polyglot spec currently says that polyglot documents do not use the legacy encoding declaration (despite that it is fully tolerated, withou any warning, to use it in non-polyglot HTML). Perhaps even Polyglot Markup should tolerate the legacy <meta@http-equiv> variant?

	PS: I don't want to hide facts: I sofar know about 3 parsers which makes the BOM visible: the textbrowsers Lynx inserts an empty paragraph on top of the page. Elinks inserts an empty *line* on top of the page. While Links inserts a paragraph with a <unknown> character inside. Also, the very outdated IE5.x for Mac behaves similar to Links, but without going into quirks mode. (For the record: the text browsers netrik and w3m do not have this problem.)

Comment 13 Michael[tm] Smith 2011-08-04 05:06:47 UTC

mass-move component to LC1

Comment 14 Michael[tm] Smith 2011-08-04 05:07:10 UTC

mass-move component to LC1

Comment 15 Leif Halvard Silli 2011-08-05 02:10:22 UTC

Please be aware that unlike Gecko, then at least Webkit and Opera do not fully support XML's encoding default: 

1) manually set your browser's default encoding to - for example - ISO-8859-5, by selecting it in the encoding menu
2) open a bomless HTML page - the page will (as expected) be displayed in ISO-8859-5. 
3) Now open a bomless XHTML said 

RESULTS:
 -Gecko = UTF-8
 -Opera = ISO-8859-5
 -Webkit = ISO-8859-5

Comment 16 Leif Halvard Silli 2011-08-11 14:21:26 UTC

(In reply to comment #15)

> RESULTS:
>  -Gecko = UTF-8
>  -Opera = ISO-8859-5

Opera bug reported: DSK-339975 (no public interface)

>  -Webkit = ISO-8859-5

Webkit Bug reported: https://bugs.webkit.org/show_bug.cgi?id=66055
(Btw, also related:  https://bugs.webkit.org/show_bug.cgi?id=66056)

Comment 17 Leif Halvard Silli 2011-08-13 20:11:34 UTC

(In reply to comment #16)
> (In reply to comment #15)

> >  -Webkit = ISO-8859-5

> (Btw, also related:  https://bugs.webkit.org/show_bug.cgi?id=66056)

See comment number 13: <https://bugs.webkit.org/show_bug.cgi?id=66056#c13>

As it turns out, if a Polyglot Markup file is interpreted as XHTML, and is the subframe of an HTML page, then Webkit (including Chrome) as well as Opera currently inherits the encoding of the "mother" HTML page. 

TESTS:

* HTML (Win-1252) file with polyglot subframe (w/HTML encoding declaration):
   http://malform.no/testing/html5/bom/frame

* HTML (Win-1252) file with polyglot subframe (w/HTTP charset):
   http://malform.no/testing/html5/bom/frame2

* HTML (Win-1252) file with polyglot subframe (w/BOM):
   http://malform.no/testing/html5/bom/frame3


Only the BOM file works 100% reliably. E.g. if you manually change the encoding of the HTML file, then Webkit/Chrome/Opera will override the encoding - except if you have the BOM, then Webkit/Chrome won't do that.

Of course, the Webkit/Chrome/Opera behaviour is bogus.

Comment 18 Mark Simon 2011-11-03 23:22:20 UTC

According to the Unicode Consortium:

Comment 19 Leif Halvard Silli 2012-05-03 19:00:29 UTC

*** Bug 16908 has been marked as a duplicate of this bug. ***

Comment 20 Henry S. Thompson 2012-07-11 14:58:58 UTC

(In reply to comment #12)
The XML Working Group, who raised 16908, are happy with this proposed resolution.

Comment 21 Addison Phillips 2012-07-11 19:54:08 UTC

(In reply to comment #20)
> (In reply to comment #12)
> The XML Working Group, who raised 16908, are happy with this proposed
> resolution.

Do you mean the proposed resolution in comment #12? It isn't clear from the comment. I believe that I18N will also be happy with that proposal as a resolution. FWIW, we agree with the comments that XML made in 16908.

Comment 22 Henry S. Thompson 2012-07-11 20:26:30 UTC

(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #12)
> > The XML Working Group, who raised 16908, are happy with this proposed
> > resolution.
> 
> Do you mean the proposed resolution in comment #12? It isn't clear from the
> comment. 

Yes, comment #12, as it says at the top of my comment!

Comment 23 Mark Simon 2012-07-12 00:04:56 UTC

(Sorry for the duplication; my original comment seems to have been truncated).

According to the Unicode Consortium:

“Use of a BOM is neither required nor recommended for UTF-8”
http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf

Given the issues surrounding the use of the BOM, I would agree with the
suggestion in Comment 12, that the word (preferred) be removed, but more
explanation be given.

Comment 24 I18n Core WG 2012-07-12 11:32:56 UTC

There seems to be convergence on the proposal to change the first two sub-bullets to say:

- By using the Byte Order Mark (BOM) character, which is an encoding
 signature that both XML and HTML parsers are required to support.
- By using <meta charset="UTF-8"/> (the HTML encoding declaration) and
 thus, for XML parsers, rely on XML´s encoding default (see above).

I just offer an editorial suggestion. There is already a paragraph immediately after the bulleted list that says that "The HTML encoding declaration has no effect in XML. When the HTML encoding declaration is the only encoding declaration, the encoding default from XML makes XML parsers treat content as UTF-8."

So why not just say:

[[
- Within the document:
  - By using the Byte Order Mark (BOM) character
  - By using <meta charset="UTF-8"/> (the HTML encoding declaration)
- Outside the document 
  ...

Both XML and HMTL parsers are required to support the byte order mark. The HTML encoding declaration has no effect in XML. When the HTML encoding declaration is the only encoding declaration, the encoding default from XML makes XML parsers treat content as UTF-8. 
]]

Comment 25 Leif Halvard Silli 2012-07-12 13:35:41 UTC

(In reply to comment #24)

Either way is OK for me.

Comment 26 Leif Halvard Silli 2012-11-10 18:49:14 UTC

*** Bug 19931 has been marked as a duplicate of this bug. ***

Comment 27 bugz.ate.my.horse 2012-11-10 19:56:42 UTC

(Sorry for the dup. bug. Apparently my search skills were insufficient.) 
Two more reasons not mentioned here for not prefering the BOM is that it makes otherwise ASCII text not ASCII, and it is not supported by Java for UTF-8 (see two links in dup. Bug 19931. 

As for PHP, a comment on the site suggests that the BOM causes problems:
<http://www.php.net/manual/en/function.header.php#107930> I am investigating.

Comment 28 Leif Halvard Silli 2013-09-04 13:06:02 UTC

I have removed the “preferred” comment, and I basically used the wording that Richard sugested in comment 24. The changes will be visible in my next commit.

Comment 29 Leif Halvard Silli 2013-09-05 15:26:14 UTC

Fixed in the editors draft.