12242 – Make UTF-16 an invalid encoding in Polyglot Markup (i18n-issue-17)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12242 - Make UTF-16 an invalid encoding in Polyglot Markup (i18n-issue-17)

Summary: Make UTF-16 an invalid encoding in Polyglot Markup (i18n-issue-17)

Status:	CLOSED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Eliot Graff
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-03-05 02:39 UTC by Leif Halvard Silli
Modified:	2011-08-04 05:07 UTC (History)
CC List:	10 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-03-05 02:39:03 UTC

* According to HTML5, HTML-parsers must as minimum support UTF-8
and Windows-1252.
http://dev.w3.org/html5/spec/parsing.html#character-encodings-0
* While according to XML, XML-parsers must as mininum support UTF-8
and UTF-16.
* Polyglot Markup, though, "prefers" UTF-8 (based on HTML5's UTF-8
preference, one should think), but else follows the XML approach and
permits both UTF-8 or UTF-16.

AS A RESULT, it becomes possible to author "polyglot markup" that works fine in XML-parsers, but which isn't required to work in all and any HTML-parser.

We should not declare mark-up that isn't required to work in a HTML-parser as "polyglot markup". Hence we should conclude that UTF-16 should not be a recommended encoding for Polyglot Markup.

Discussion:

* It was suggested early on, e.g. by e.g. Sam Ruby, that UTF-8 should be the only recommended encoding for polyglot markup. And this can be a very useful suggestion. For instance, it would become a very useful way to "force" many HTML editing programs to default to UTF-8, one should think. It also meets HTML5 which says that new documents SHOULD default to UTF-8.

* However, the problem is to justify *exclusion* of UTF-16 by inference from the specs. Because, the use of UTF-16 does not seem to break with the principles behind Polyglot Markup, as laid out in its introduction:
http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html#introduction

* Permission to use UTF-16 in polyglot markup is logical, for instance because

- UTF-16 can be reliably detected via the BOM, in both XMLand HTML5
- though HTML5 says that, quote: "Using non-UTF-8 encodings can
have unexpected results on form submission and URL encodings,
which use the document's character encoding by default", the use of
non-UTF-8 probably creates form problems in XML-on-the-web
too. Thus XML and HTML are probaly in same boat here - and
hence it does not seem logical to use against UTF-16 that some
form submission problems could occur.

* That said, the problems with non-UTF-8 *should* carry *some*
weight: e.g. those form submission problems could cause greater
problems in XML and it is a small irriation that it is not permitted/
possible to use an explicit character declaration in UTF-16 encoded
documents.

However, the fact that HTML-parsers aren't required to support UTF-16, is a more fundamental nail in the coffin.

Can it have any real-world effect? Not so much when it comes to "big" browsers - they support multiple encodings. But for "simpllistic" parsers of differnent kinds, it could probably have an effect.

Comment 1 Eliot Graff 2011-03-11 18:22:31 UTC

I've removed all references to UTF-16 within the polyglot spec in the 11 March Editor's Draft.

If you log a bug against the HTML5 spec to _require_ UTF-16, please notify me.

Thanks, again, for your continued help.

Eliot

Comment 2 Leif Halvard Silli 2011-03-12 17:45:08 UTC

(In reply to comment #1)
> I've removed all references to UTF-16 within the polyglot spec in the 11 March
> Editor's Draft.

Super.

> If you log a bug against the HTML5 spec to _require_ UTF-16, please notify me.

;-)

Comment 3 John Cowan 2011-03-18 18:59:28 UTC

Mark Davis mentioned to me that UTF-16 documents constitute less than 0.1% of the Web, so it's a fairly marginal encoding.  According to http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html (which is more than a year old), the web is 70% ASCII and UTF-8, 20% 8859-1/Win1252/8859-15, and 10% for all other encodings.  Scrubbing UTF-16 is clearly the Right Thing.

Comment 4 Addison Phillips 2011-03-28 15:32:30 UTC

The Internationalization Core WG has discussed this issue (as I18N-ISSUE-17) and has no objection to making invalid the UTF-16 character encoding for Polyglot documents. See also:

http://lists.w3.org/Archives/Public/public-i18n-core/2011JanMar/0132.html

http://www.w3.org/2011/03/23-i18n-minutes.html#item06

Comment 5 Michael[tm] Smith 2011-08-04 05:06:58 UTC

mass-move component to LC1

Comment 6 Michael[tm] Smith 2011-08-04 05:07:22 UTC

mass-move component to LC1