This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9962 - Permitted Character Encodings should be UTF-8 (recommended) and UTF-16 (permitted)
Summary: Permitted Character Encodings should be UTF-8 (recommended) and UTF-16 (permi...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML/XHTML Compat. Authoring Guide (ed: Eliot Graff) (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: FPWD
Assignee: Eliot Graff
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:
Depends on: 9963
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-20 21:35 UTC by Leif Halvard Silli
Modified: 2010-10-05 13:07 UTC (History)
5 users (show)

See Also:


Attachments

Description Leif Halvard Silli 2010-06-20 21:35:39 UTC
Replace the current section about encodings, with something like this:

(The justification is given below this proposal)

]]
3. Character Encodings

For HTML-compatibility, declaring the encoding via the XML declaration is forbidden – it has no effect in HTML and can trigger Quirks-Mode in some HTML parsers. Only the default encodings of XML — UTF-8 and UTF-16 — are thus permitted in polyglots. Whereas only UTF-8 is a RECOMMENDED encoding.  Most HTML parsers however defaults to Windows-1252 or another 8-bit encoding. Thus, for HTML-compatibility, the choice between UTF-8 or UTF-16 MUST be declared. 

There are two ways to declare the choice of encoding. Either via the meta charset element — this only has effect in HTML parsers:

<meta charset="utf-8"/>

Or by using the BOM. The BOM has effect in both HTML and XML parsers. But note that using the BOM is reported to have some legacy issues in very old HTML parsers.  

It is not forbidden to use <meta charset="*"/> in combination with BOM, as long as it specifies the same as the BOM.

To specify the encoding via the <code>meta</code> <code>http-equiv="Content-Type"</code> meta element is confusing and NOT RECOMMENDED and SHOULD trigger a warning in polyglot validators as this element declares the Content-Type to be <code class="MIME">text/html</code> — in rare cases (for example if a file read via the file URL protocol is lacking an xhtml extension, this could affect whether the document is processed as <code>text/html</code> or <code>application/xhtml+xml</code>. 

<span class="taken_from_HTML5">Note: Using non-UTF-8 can have unexpected results on form submission and URL encodings, which use the document's character encoding by default.</span> But the reason why the polyglot spec forbids other encodings than UTF-8 and UTF-16 is that, with the exception of using the BOM (which has some legacy issues and which only can be used to declare UTF-8 and UTF-16 encodings), there does not exist any polyglot way to declare the encoding of a document.

When UTF-16 is used, the document should include the BOM indicating UTF-16LE or UTF-16BE. 
[[



JUSTIFICATION: The above proposal aims to solve the following problems with the current text:
---------------------------------------------------

<q>
3. Character Encoding<ins>s</ins>
</q>

JUSTIFICATION: HTML5 users plural in its corresponding heading. *And* you do discuss more than a single encoding.

FOR CONSIDERATION: HTML5 has one section ("Character encodings") where it talks about encodings, and another section where it speaks about "Specifying the document's character encoding". This section is about the latter. It could be thinkable to reflect this in the title. But I don't have any proposal for not.

<q>
A polyglot document uses either UTF-8 or UTF-16, although generally UTF-8 is preferred.
</q>

COMMENT: AT the bottom of this section, you say <q>If a polyglot document uses an encoding other than UTF-8 or UTF-16 […]<q>. If other encodings is an options, then then saying that they user either UTF-8 or UTF-16 isn't accurate.

<q>If a polyglot document uses UTF-16, it should include the BOM indicating UTF-16LE or UTF-16BE. In addition, a polyglot document need not include the meta charset declaration, because the parser would have to read UTF-16 in order to parse it by definition.</q>

COMMENT: I get the impression that these 2 sentences speaks only about UTF-16. However, it is not very clear that this is the case. Also, in the midtst of this, you talk about the meta element - which is part of why it is unclear whether you talk only about UTF-16 or more general.

<q>
In short, for correct character encoding, a polyglot document must either:
</q>

COMMENT: I wonder about the user of "MUST", at least when I look at what follows.  

<q>
Use UTF-8 or UTF-16 with the appropriate BOM.
</q>

COMMENT: It is unclear whether the advice about "appropriate BOM" also relates to UTF-8. Note that the I18N WG claims that there are compatibility issues with regard to BOM, for some legacy user agents – though I must recheck how legacy those useragent are ...

<blockquote>
OR
Use both the XML Declaration and meta tag to specify the appropriate character encoding.
</blockquote>

COMMENT: Using the XML Declaration triggers quirks-mode in legacy IE - in fact, it may trigger quirks even in IE8! (If you do it right – of if you wish - if you do it "wrong". [I can document it if you wish.] Therefore perhaps the need to use the XML declaration should be deleted (= only allow UTF-8/UTF-16). There more I think about it, the more I tihnk we should forbid the XML declaration and only allow UTF-8.

<q>If a polyglot document uses an encoding other than UTF-8 or UTF-16, it must include the XML declaration; however, in this case the document must also include the HTML meta tag specifying the character set. When a polyglot document uses both the XML declaration and the HTML meta tag, these must specify the same character and coding.</q>

COMMENT 1: See previous comment. Other encodings than UTF8/UTF16 should be forbidden. However, that does not mean that we do not need to specify the use of the meta charset element. Remember that HTML documents defaults to an 8-bit encoding - most often to Windows 1252.  

COMMENT 2: You do not mention the better option: to send the encoding info as a HTTP header. If one do that, then one may in fact skip the XML declaration also for non-UTF-8 encodings.
Comment 1 Leif Halvard Silli 2010-06-21 00:02:37 UTC
The XML Declaration should in fact be declared forbidden in the section about "Processing Instructions and the XML Declaration"
Comment 2 Eliot Graff 2010-09-27 21:36:18 UTC
Editor's draft of 27 September contains the fixes. I believe I've covered all of the issues here. Please let me know if I've missed anything. Section 3 now reads as follows:


*****************

3. Specifying a Document's Character Encoding
Polyglot markup uses either UTF-8 or UTF-16. UTF-8 is preferred. When polyglot markup uses UTF-16, it must include the BOM indicating UTF-16LE or UTF-16BE. 

Polyglot markup declares character encoding one of two ways: 

By using the BOM. 
In the HTTP header of the response [HTTP11], as in the following: 
Content-type: text/html; charset=utf-8 
or 
Content-type: text/html; charset=utf-16 

Note that polyglot markup may use either text/html or application/xhtml+xml for the value of the content type. 

Using <meta charset="*"/> has no effect in XML. Therefore, polyglot markup may use <meta charset="*"/> in combination with BOM, as long the meta element specifies the same character encoding as the BOM. In addition, the meta tag may be used in the absence of a BOM as long as it matches the already specified encoding. Note that the W3C Internationalization (i18n) Group recommends to always include a visible encoding declaration in a document, because it helps developers, testers, or translation production managers to check the encoding of a document visually. 

*****************

In addition, in regard to comment 1, Section 2 now reads as follows:

*****************

2. Processing Instructions and the XML Declaration
Processing Instructions and the XML Declaration are both forbidden in polyglot markup. 

*****************

Thanks for your patience.