10890 – i18n comment : Allow utf-16 meta encoding declarations

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10890 - i18n comment : Allow utf-16 meta encoding declarations

Summary: i18n comment : Allow utf-16 meta encoding declarations

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-09-30 18:56 UTC by I18n Core WG
Modified:	2014-02-27 21:24 UTC (History)
CC List:	9 users (show)

See Also:

Attachments

Description I18n Core WG 2010-09-30 18:56:10 UTC

Comment from the i18n review of:
http://dev.w3.org/html5/spec/

Comment 8
At http://www.w3.org/International/reviews/0802-html5/
Editorial/substantive: S
Tracked by: RI

Location in reviewed document:
4.2.5.5 Specifying the document's character encoding
[http://www.w3.org/TR/2010/WD-html5-20100624/semantics.html#charset]

Comment: Currently you are not allowed to use <meta charset="utf-16"> or the equivalent pragma directive in utf-16 encoded documents. While logically it is not needed to identify the character encoding, it introduces a special case for authors to remember, and almost certainly many authors will be unaware that this is disallowed and will do it. In addition, in-document declarations of this kind are particularly useful for developers, testers, or translation production managers who want to visually check the encoding of a document (since the bom cannot be seen). Furthermore, there would appear to be no risk incurred by allowing this, since the document would be encoded in utf-16 anyway.

Note that the ask is not that the encoding of the document be determined by the meta element - the bom remains the way of determining that information - solely that no error or warning be raised if the meta element is used.

Please make an exception in the spec for utf-16 so that it is allowed to use <meta charset="utf-16"> or the equivalent pragma directive in utf-16 encoded documents.

Comment 1 Henri Sivonen 2010-10-01 07:18:38 UTC

I'm strongly opposed to ever making <meta charset="utf-16"> conforming.

When the file is actually encoded in UTF-16, <meta charset="utf-16"> has no effect. If authors think it has an effect, the mental model the authors have of how character decoding works is severely faulty and failing to notify authors of the faulty mental model will lead to the authors acting on the faulty model and shooting themselves or their users in the foot later.

When the file is not actually encoded in UTF-16, <meta charset="utf-16"> means the same as <meta charset="utf-8"> and is a clear authoring error. It would be completely illogical not to flag it as an error. That Web compat requires this UA behavior is evidence of authors getting things wrong when they try to use <meta charset="utf-16">.

Comment 2 I18n Core WG 2010-10-01 10:07:02 UTC

(In reply to comment #1)

> When the file is not actually encoded in UTF-16, <meta charset="utf-16"> means
> the same as <meta charset="utf-8"> and is a clear authoring error. It would be
> completely illogical not to flag it as an error. That Web compat requires this
> UA behavior is evidence of authors getting things wrong when they try to use
> <meta charset="utf-16">.

I am not proposing any change to the spec where the file is not actually encoded in UTF-16. I was careful to say 'in UTF-16 encoded documents'.

> When the file is actually encoded in UTF-16, <meta charset="utf-16"> has no
> effect. 

Exactly. So it's not an issue for character detection. However, my point is that there are usability issues. (a) Without a meta element you can't tell the encoding by visual inspection. (b) People will continue to use these meta elements for UTF-16. If there is no harm in it, why force them to change their code? (c) Because the UTF-16 rules for meta are different from other encodings, the author has to always remember to handle UTF-16 in a special way. Why force validators to always check and educational materials to always explain a special exception for meta elements in UTF-16 encoded documents when a <meta charset=utf-16> in such a document does no harm?

Comment 3 Henri Sivonen 2010-10-04 11:42:06 UTC

(In reply to comment #2)
> Exactly. So it's not an issue for character detection. However, my point is
> that there are usability issues. (a) Without a meta element you can't tell the
> encoding by visual inspection. 

My point is that you actually can't inspect this visually. If you open a file in a text editor and you see <meta charset="utf-16"> how do you know whether:
 1) The file had a UTF-16 BOM and was encoded in UTF-16 and the meta has no effect.
OR
 2) The file didn't have an UTF-16 BOM and was encoded in an ASCII-superset encoding and the meta would make a UA treat the file as being UTF-8-encoded.
?

> (b) People will continue to use these meta elements for UTF-16. 

I think we should try to change things so that people will use UTF-8 and not continue to use UTF-16.

> (c) Because the UTF-16 rules for meta are different from other encodings,
> the author has to always remember to handle UTF-16 in a special way.

This will not be a problem if authors always use UTF-8 and, as result, don't use UTF-16.

Comment 4 I18n Core WG 2010-10-07 18:22:06 UTC

(In reply to comment #3)
> My point is that you actually can't inspect this visually. If you open a file
> in a text editor and you see <meta charset="utf-16"> how do you know whether:
>  1) The file had a UTF-16 BOM and was encoded in UTF-16 and the meta has no
> effect.
> OR
>  2) The file didn't have an UTF-16 BOM and was encoded in an ASCII-superset
> encoding and the meta would make a UA treat the file as being UTF-8-encoded.

You can't be sure, no.  But then you can't be sure that any encoding declaration is correct - it's not a reason not disallow it, since that makes life more difficult for people who do do the right thing. 

> I think we should try to change things so that people will use UTF-8 and not
> continue to use UTF-16.

I agree, but we can't preclude it because it's not forbidden by the spec.

> This will not be a problem if authors always use UTF-8 and, as result, don't
> use UTF-16.

I agree, but some people will still use UTF-16. Having said that, we're talking about less than 0.01% of web pages here according to a recent Google survey of 6.5 billion pages (against over 50% using UTF-8, and almost 70% using either UTF-8 or ASCII). My guess is that those few people using UTF-16 are technically aware enough to pay attention to things like this. I don't buy that incorrect labelling is such a serious problem.  

On the other hand, if we don't allow charset=utf-16, then every tutorial, every primer, every book, every checker, etc, has to make a detour to explain how UTF-16 is different from anything else when telling people how to use encoding related markup, which is annoying for both the writer and the reader given that we don't want people to use it anyway.

Comment 5 Aryeh Gregor 2010-10-07 21:33:39 UTC

(In reply to comment #4)
> On the other hand, if we don't allow charset=utf-16, then every tutorial, every
> primer, every book, every checker, etc, has to make a detour to explain how
> UTF-16 is different from anything else when telling people how to use encoding
> related markup, which is annoying for both the writer and the reader given that
> we don't want people to use it anyway.

No, all those tutorials/primers/books/checkers just have to tell people to use UTF-8.  Then the problem doesn't arise.  I doubt any of them actually cover character encodings beyond saying "Use UTF-8" -- the complexity of other encodings is just not needed today.

Comment 6 Ian 'Hixie' Hickson 2010-10-12 09:45:22 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: This case is made illegal because "here be dragons": specifying <meta charset=utf-16> actually has either no effect, or is actually explicitly interpreted as UTF-8. In no case does it have the effect it appears to have, namely to declare the encoding as being UTF-16. Therefore we would not be doing anyone in favours by allowing it.

Comment 7 Julian Reschke 2010-10-12 09:52:46 UTC

For the record: I disagree with the "decision"; we already have special cases around here, and either way a special case is needed here. From my point of view, leaving it conformant when the document indeed is encoded in UTF-16 is more helpful than forbidding it.

I'd recommend that the I18N WG to re-opens the issue.

Comment 8 Addison Phillips 2014-02-27 21:24:35 UTC

The I18N WG resolved on 2014-02-27 not to reopen the issue. While there is some agreement with the sentiment in Comment 7, we don't feel that further discussion will result in changes and the non-prevalence of UTF-16 makes this issue irrelevant for further discussion.