15142 – Define "UNICODE" as a defacto alias for "UTF-16"

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 15142 - Define "UNICODE" as a defacto alias for "UTF-16"

Summary: Define "UNICODE" as a defacto alias for "UTF-16"

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P3 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/spec/parsing#...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-12-11 06:33 UTC by Leif Halvard Silli
Modified:	2011-12-14 00:47 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-12-11 06:33:09 UTC

PROPOSAL:

   Define "UNICODE" as a defacto alias for "UTF-16".

   This has 3 implications for encoding determination:

   (1) When "UNICODE" is found inside a data: URI or inside a HTTP 
        Content-Type: header, then parse resource as UTF-16 encoded.


   (2) When "UNICODE" occurs inside <meta charset=*> or
        the Content-Type pragma, then treat it as "UTF-8", as 
        specced in the Encoding Sniffing Algorithm's following step:

]] 13. If charset is a UTF-16 encoding, change the value of charset to UTF-8. [[


   (3) When "UNICODE" occurs inside a XML file, then treat it as
        a legal encoding name that nevertheless gets ignored
        (meaning that parsers is defaulting instead 
        - to UTF-16 or UTF-8)
        Of course, HTML5 doesn't tell how XML parsing should
        work, but I say it for completeness.



JUSTIFICATION:

   JUSTIFICATION for HTML:
   ==============================================================

*  <meta charset=UNICODE > works like <meta charset=UTF-8> 
    in Webkit (Chrome/Safari) and IE (I checked E6-IE9). 
    This makes sense, once one realise that they see it as a synonym
    for UTF-16. (Opera/FIrefox do not yet behave this way.)

*  IE (MSHTML) may save pages with the following charset declaration,
    either by default  or via the user's interation:
    <META content="text/html; charset=unicode" http-equiv=Content-Type>
   (taken from: <http://lists.whatwg.org/pipermail/help-whatwg.org/attachments/20091203/e117921b/attachment.htm>)
    In the Save menu of IE8, there are two menu items with the value
    "UNICODE" - probably one is UTF-8 and one is UTF-16.-

*  There are (thus) numerous pages on the Web which uses "charset=UNICODE". 
    - Opera's MAMA project lists 'UNICODE' as the 29 most used value
       http://devfiles.myopera.com/articles/575/metacenc-url.htm
    - 150.000 Google hits: http://tinyurl.com/charset-unicode 
    - scraping the Web would find many, many more


   JUSTIFICATION for Higher protocols (HTTP),  XML and data URIs:
   ==============================================================

   The value "UNICODE" is treated

 * as UTF-16  - for HTTP, in XML and in data: URIs - by IE
 * as UTF-16  - for HTTP & in data: URIs - by Webkit
 * as - for XML - a legal but ignored encoding name:Firefox/Webkit/Opera
 * (for HTTP & data:) as unknown, causing locale default: Firefox/Opera

    NOTE: For HTTP Content-Type: and for data: URIs, then "UNICODE"
               is treated (by IE and Webkit) as "UTF-16" regardless of 
               whether the document serialisation is HTML or XML.


IANA registration ?

    "UNICODE" should probably be registered as an official alias for "UTF-16": http://www.iana.org/assignments/character-sets

Comment 1 Glenn Adams 2011-12-11 14:34:20 UTC

this proposal should be rejected for a variety of reasons:

(1) it is not (or should not be) the prerogative of the HTML specification to define a character encoding scheme (form) or to associate a label with an existing scheme (form); rather, HTML is a consumer (user) of externally defined character encoding schemes (forms);

(2) the Unicode Consortium is the appropriate forum for considering the possible registration of any new character encoding based upon the Unicode Character Set; [1]

(3) the term "UNICODE" is a registered trademark of the Unicode Consortium, governed by the Unicode trademark usage policy [2];

(4) the Unicode Standard expressly DOES NOT equate the character encoding "UTF-16" with the term "UNICODE", and does not give priority to any of the standard three Unicode Encoding Forms (UTF-8, UTF-16, UTF-32); see, e.g., Section 3.9 of [3]

(5) nonetheless, according to Brian Carpenter, former chair of the Internet Advisory Board (IETF), "The IETF has made the Unicode-compatible UTF-8 format of ISO 10646 the basis for its preferred default character encoding for internationalization of Internet application protocols" [4]

[1] http://www.unicode.org/
[2] http://www.unicode.org/policies/logo_policy.html 
[3] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
[4] http://www.unicode.org/announcements/pr-3.0.html

regards,
glenn adams
former technical director, unicode consortium

Comment 2 Julian Reschke 2011-12-11 14:58:07 UTC

(In reply to comment #1)
> this proposal should be rejected for a variety of reasons:

I agree with this conclusion, but...
 
> (2) the Unicode Consortium is the appropriate forum for considering the
> possible registration of any new character encoding based upon the Unicode
> Character Set; [1]

...then how come UTF-8 is defined in an IETF specification?

Best regards, Julian

Comment 3 Glenn Adams 2011-12-11 15:15:57 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > this proposal should be rejected for a variety of reasons:
> 
> I agree with this conclusion, but...
> 
> > (2) the Unicode Consortium is the appropriate forum for considering the
> > possible registration of any new character encoding based upon the Unicode
> > Character Set; [1]
> 
> ...then how come UTF-8 is defined in an IETF specification?
> 
> Best regards, Julian

a historical circumstance... in the 1992-93 time frame, ISO SC2/WG2 first proposed UTF-1 as a transformation encoding of ISO/IEC 10646 UCS-4;  although UTF-1 never caught on, the more efficient alternative, UTF-8, came out of work started at X/Open and concluded at Bell Labs in Plan 9;

later, the Unicode Standard incorporated the normative definition of UTF-8 into The Unicode Standard;
the current IETF RFC 3629 (STD 63) [1] refers to the Unicode Standard for the formal definition of UTF-8:

3.  UTF-8 definition

   UTF-8 is defined by the Unicode Standard [UNICODE].  Descriptions and
   formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]

[1] http://tools.ietf.org/html/rfc3629#section-3

glenn

Comment 4 Julian Reschke 2011-12-11 16:18:28 UTC

(In reply to comment #3)
> a historical circumstance... in the 1992-93 time frame, ISO SC2/WG2 first
> proposed UTF-1 as a transformation encoding of ISO/IEC 10646 UCS-4;  although
> UTF-1 never caught on, the more efficient alternative, UTF-8, came out of work
> started at X/Open and concluded at Bell Labs in Plan 9;
> 
> later, the Unicode Standard incorporated the normative definition of UTF-8 into
> The Unicode Standard;
> the current IETF RFC 3629 (STD 63) [1] refers to the Unicode Standard for the
> formal definition of UTF-8:
> 
> 3.  UTF-8 definition
> 
>    UTF-8 is defined by the Unicode Standard [UNICODE].  Descriptions and
>    formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
> 
> [1] http://tools.ietf.org/html/rfc3629#section-3
> 
> glenn

Point taken, but not convinced. For all practical purposes, UTF-8 is defined by RFC 3629. That's where people look. Also, RFC 3629 doesn't even link to another definition. So where is the definition by the Unicode consortium, and why isn't it referenced?

Also, a more general point: I would hope that all future definitions of character encoding schemes in the IANA registry are based on the Unicode code points, even those which can not represent all code points. The procedure for IANA charset registrations is in IETF BCP 19, which doesn't even mention Unicode, as far as I can tell.

Comment 5 Glenn Adams 2011-12-11 16:41:17 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > a historical circumstance... in the 1992-93 time frame, ISO SC2/WG2 first
> > proposed UTF-1 as a transformation encoding of ISO/IEC 10646 UCS-4;  although
> > UTF-1 never caught on, the more efficient alternative, UTF-8, came out of work
> > started at X/Open and concluded at Bell Labs in Plan 9;
> > 
> > later, the Unicode Standard incorporated the normative definition of UTF-8 into
> > The Unicode Standard;
> > the current IETF RFC 3629 (STD 63) [1] refers to the Unicode Standard for the
> > formal definition of UTF-8:
> > 
> > 3.  UTF-8 definition
> > 
> >    UTF-8 is defined by the Unicode Standard [UNICODE].  Descriptions and
> >    formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
> > 
> > [1] http://tools.ietf.org/html/rfc3629#section-3
> > 
> > glenn
> 
> Point taken, but not convinced. For all practical purposes, UTF-8 is defined by
> RFC 3629. That's where people look. Also, RFC 3629 doesn't even link to another
> definition. So where is the definition by the Unicode consortium, and why isn't
> it referenced?

Did you read the first paragraph in RFC 3629 Section 3 [1] (which I quoted above)?

> Also, a more general point: I would hope that all future definitions of
> character encoding schemes in the IANA registry are based on the Unicode code
> points, even those which can not represent all code points. The procedure for
> IANA charset registrations is in IETF BCP 19, which doesn't even mention
> Unicode, as far as I can tell.

Different national administrations have different priorities. There will always remain character encodings not based on the Unicode Character Set, for legacy reasons if no others. 

The Unicode Consortium does not maintain a character encoding scheme registry. IANA does. However, the Unicode Consortium does own the term "UNICODE", so if someone wishes to register this term as a charset value, they need to take it up with the Unicode Consortium, and not with the HTML WG. But I would suggest they would be wasting their time, since it is extremely unlikely the Unicode Consortium would choose to enter such registration (for some of the reasons I have cited as well as others).

Comment 6 Glenn Adams 2011-12-11 16:47:16 UTC

(In reply to comment #4)
> So where is the definition by the Unicode consortium, and why isn't
> it referenced?

Section 3.9 [1]. It is referenced by RFC 3629. See, e.g., sections 3 and 4 of 3629. Also see the note at end of 3629 section 4:

"NOTE -- The authoritative definition of UTF-8 is in [UNICODE].  This
   grammar is believed to describe the same thing Unicode describes, but
   does not claim to be authoritative.  Implementors are urged to rely
   on the authoritative source, rather than on this ABNF."

[1] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Comment 7 Julian Reschke 2011-12-11 16:53:18 UTC

(In reply to comment #5)

> > Point taken, but not convinced. For all practical purposes, UTF-8 is defined by
> > RFC 3629. That's where people look. Also, RFC 3629 doesn't even link to another
> > definition. So where is the definition by the Unicode consortium, and why isn't
> > it referenced?
> 
> Did you read the first paragraph in RFC 3629 Section 3 [1] (which I quoted
> above)?

Yes, "Unicode" is mentioned, but there's no reference that takes me to the actual definition.

In the meantime I noticed that UTF-8 is indeed defined in <http://unicode.org/versions/Unicode5.2.0/ch03.pdf>, and I believe it would be good to add an erratum to RFC 3629 pointing out that a revision should actually *reference* the Unicode definition.


> > Also, a more general point: I would hope that all future definitions of
> > character encoding schemes in the IANA registry are based on the Unicode code
> > points, even those which can not represent all code points. The procedure for
> > IANA charset registrations is in IETF BCP 19, which doesn't even mention
> > Unicode, as far as I can tell.
> 
> Different national administrations have different priorities. There will always
> remain character encodings not based on the Unicode Character Set, for legacy
> reasons if no others. 
> 
> The Unicode Consortium does not maintain a character encoding scheme registry.
> IANA does. However, the Unicode Consortium does own the term "UNICODE", so if
> someone wishes to register this term as a charset value, they need to take it
> up with the Unicode Consortium, and not with the HTML WG. But I would suggest
> they would be wasting their time, since it is extremely unlikely the Unicode
> Consortium would choose to enter such registration (for some of the reasons I
> have cited as well as others).

I agree that HTML is the wrong place to start. The registry is maintained by IANA, and how to get values into the registries is defined by an IETF BCP. I don't see a requirement to go through the Unicode Consortium.

That being said, I do agree that using the string "Unicode" as character encoding scheme name is a bad idea. I'm not sure about "ownership" of names though, if IANA would need to reject any registration for a "charset" name where somebody claims to "own" the name, the whole process might get very complicated :-).

Comment 8 Leif Halvard Silli 2011-12-11 22:06:23 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > this proposal should be rejected for a variety of reasons:
> 
> I agree with this conclusion, but...

Regarding "this proposal", then this bugzilla report could be said to include 3 proposals:

(1) The main proposal is to require the HTML5 parser to, when it see charset="UNICODE" (upper- or lowercase), replace it with charset="UTF-16" (which in turns gets replaced with "UTF-8" it occurs inside a HTML document). This in order to a) be compatible with "the Web", b) to support the shift to Unicode in particular and UTF-8 especially by c) making sure that content that is intended to be unicode, is treated as unicode by all HTML5 user agents.

(2) Secondly, it suggests that charset="UNICODE" should be non-conforming in HTML5 documents - authors should be allowed to use it. This in fact goes without saying, as it is even forbidden, per HTML5, to use <meta charset="UTF-16" > in a HTML document.

(3) Finally I took up whether the alias should be formally registered. However,  I suppose that even if it became formally registered, the recommended name of this encoding would remain "UTF-16".  For instance, Validator.nu whines if you use <meta charset="ANSI_X3.4-1968"> instead of <meta charset="US-ASCII"> as it is only the latter that is a recommended encoding name. I would expect the same behaviour for <meta charset="UNICODE">, regardless of whether it became registered.

QUESTIONS: Which of these 3 proposals are you disagreeing with?  And what are the pros and cons of registering? Julian, is it only that this is "the wrong place" that is the problem for you? Glenn, why have the UNICODE consortium been, quietly, looking at the psread of "UNICODE" as unofficial alias for "UTF-16"? 

W.r.t. to registering, here are some thoughts:One reason to *not* register "UNICODE" is the fact that it isn't supposed to be conforming anyway. However, this doesn't seem particulary strong, as it would most certainly nevertheless be non-conforming  to use it.

I am very willing to send an e-mail to the right authority to ask them to consder whether "UNICODE" should become an alias (not-recommended bust still alias) for "UTF-16". It is sofar unclear to me who to contact though.

Comment 9 Leif Halvard Silli 2011-12-11 22:27:39 UTC

Sorry, this:

(In reply to comment #8)
> authors should be allowed to use

was a typo. Please read it as "shoudld not".

Comment 10 Glenn Adams 2011-12-12 00:46:51 UTC

(In reply to comment #8)
> (1) The main proposal is to require the HTML5 parser to, when it see
> charset="UNICODE" (upper- or lowercase), replace it with charset="UTF-16"
> (which in turns gets replaced with "UTF-8" it occurs inside a HTML document).
> This in order to a) be compatible with "the Web", b) to support the shift to
> Unicode in particular and UTF-8 especially by c) making sure that content that
> is intended to be unicode, is treated as unicode by all HTML5 user agents.

If an HTML representation of an HTML5 document (not an XML representation) specifies either

<meta charset="UTF-16">

or

<meta charset="UNICODE">

it is effectively in violation of 4.2.5.5 [1]:

"If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the Encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding."

[1] http://dev.w3.org/html5/spec/Overview.html#character-encoding-declaration

This is because "a UTF-16 encoding" [2], whether it is labeled explicitly as "UTF-16" or labeled with a hypothetical alias "UNICODE" is not an "ASCII-compatible character encoding" [3].

[2] http://dev.w3.org/html5/spec/Overview.html#a-utf-16-encoding
[3] http://dev.w3.org/html5/spec/Overview.html#ascii-compatible-character-encoding

So, what you appear to be describing is parser behavior when processing an HTML representation of an HTML5 document that violates the constraint cited above in [1]. Is that correct?

If that is the case, then are you suggesting a change in the semantics or language of the "encoding sniffing algorithm" [4]?

[4] http://dev.w3.org/html5/spec/Overview.html#encoding-sniffing-algorithm

Even if you are suggesting a change in [4], it does not appear any change would be necessary in the first case, since any use of <meta charset="UTF=16"> or any logical equivalent would only come to play in step 5. sub-step 13.

"If charset is a UTF-16 encoding, change the value of charset to UTF-8."

However, since this language does not define what is meant by "if charset is a UTF-16 encoding", an implementation could interpret this flexibly.

That is, sub-step 13 does not say something like:

"If the value of the charset attribute is an ASCII case-insensitive match of an IANA-registered name or alias of a UTF-16 encoding, ..."

rather, the language of sub-step 13 simply says:

"If charset is a UTF-16 encoding..."

leaving it to the imagination of the reader (and the vagaries of the implementation) to interpret this as desired, including an interpretation that permits recognizing aliases that are not IANA-registered.

Note that any use of step 5 sub-step 13 occurs only when (1) there is no user specified encoding override, (2) there is no transport layer supplied character encoding metadata, and (3) there is no BOM.

Overall, I have to wonder at the utility of your proposal, whether or not such an alias exists de facto or de jure.

If there is a bug here, it is probably that sub-step 13 does not refer to the language in 8.2.2.2 [5], especially the 3rd and 4th paragraphs.

[5] http://dev.w3.org/html5/spec/Overview.html#character-encodings-0

In general, I oppose your proposal on the grounds that it is already inconsistent with the spirit of 4.2.5.5 [1] cited above.

As for registering an alias independently of what HTML5 makes use of it, the Unicode Consortium would be the appropriate party to take up that issue, not the HTML WG. I have forwarded a link to this thread to the Unicode Consortium  in case they wish to address this matter further. I can't comment on their possible position on the issue of registering "UNICODE" as an alias for "UTF-16", but I would speculate that they may not support the idea.

Regards,
Glenn

Comment 11 Leif Halvard Silli 2011-12-12 05:31:48 UTC

(In reply to comment #10)
 
> So, what you appear to be describing is parser behavior when processing an HTML
> representation of an HTML5 document that violates the constraint cited above in
> [1]. Is that correct?

Yes. That is my primary concern.
 
> If that is the case, then are you suggesting a change in the semantics or
> language of the "encoding sniffing algorithm" [4]?
> 
> [4] http://dev.w3.org/html5/spec/Overview.html#encoding-sniffing-algorithm

Yes, either to sub-step 13 or to the link in sub-step 13 - see below.
 
> Even if you are suggesting a change in [4], it does not appear any change would
> be necessary in the first case, since any use of <meta charset="UTF=16"> or any
> logical equivalent would only come to play in step 5. sub-step 13.
> 
> "If charset is a UTF-16 encoding, change the value of charset to UTF-8."
> 
> However, since this language does not define what is meant by "if charset is a
> UTF-16 encoding", an implementation could interpret this flexibly.

Therе is a link, on the wording "UTF-16 encoding", to the following text:

   "The term a UTF-16 encoding refers to any variant of UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without a BOM, raw UTF-16LE, and raw UTF-16BE. [RFC2781]"

Because of the phrase "value of charset",  it is natural to think that it *does* refer to valid encoding names, such as "UTF-16", "UTF 16LE" or "UTF-16BE". It does not seem naturally to include "UNICODE" in the above unless something explicitly says that one should link it.

It is the charset value that is supposed to be - or represent - "a UTF-16 encoding". And unless one knows and acknowledge that "UNICODE" represents a "UTF-16 encoding", then UAs we can only hope that they will treat it as such .... 

It has been said about HTML5 that it should be specific enough that it is possible to build a Web compatible parser based on it. And it could seem as if charset="UNICODE" is necessary to mention for that reason.

> That is, sub-step 13 does not say something like:
> 
> "If the value of the charset attribute is an ASCII case-insensitive match of an
> IANA-registered name or alias of a UTF-16 encoding, ..."
> 
> rather, the language of sub-step 13 simply says:
> 
> "If charset is a UTF-16 encoding..."
> 
> leaving it to the imagination of the reader (and the vagaries of the
> implementation) to interpret this as desired, including an interpretation that
> permits recognizing aliases that are not IANA-registered.

That's a possibility. But see my last point above.
 
> Note that any use of step 5 sub-step 13 occurs only when (1) there is no user
> specified encoding override, (2) there is no transport layer supplied character
> encoding metadata, and (3) there is no BOM.

If you download such page and open it from the harddisk in Firefox, it will default to the locale encoding instead of to UTF-8.

W.r.t. BOM. Hm, yes, it could seem as if MSHTML tends to ad the BOM whenver the "UNICODE" charset is used. So that's a thing that perhaps diminished the problem compared to the alternative - that MSHTML did not add the BOM. 

Btw, it seems like e.g. BBEdit/Textwrangler (the famous Macintosh text editor) recognizes "UNICODE" to mean "UTF-16".

For UTF-8 encoded pages, then an understanding of what "UNICODE" means allows e.g. Validator.nu to give specific advice, like "Replace UNICODE with UTF-8" instead of only "replace UNICODE with a valid name".
 
> Overall, I have to wonder at the utility of your proposal, whether or not such
> an alias exists de facto or de jure.

You will find quite a lot of author confusion around the "UNICODE" as an encoding name. But the ultimate proof is of course a page that gets interpreted Webkit and IE but not in Firefox and Opera. I suppose, seek and you shall find.
 
> If there is a bug here, it is probably that sub-step 13 does not refer to the
> language in 8.2.2.2 [5], especially the 3rd and 4th paragraphs.
> 
> [5] http://dev.w3.org/html5/spec/Overview.html#character-encodings-0

Perhaps "UNICODE" should be added to that Character Encoding Overrides table there ...
 
> In general, I oppose your proposal on the grounds that it is already
> inconsistent with the spirit of 4.2.5.5 [1] cited above.

It clear that it is already illegal ot use charset=UNICODE - we don't need to change anything for that to be clear. But my proposal does not make it any more legal. It instead helps us to have an authorative answer w.r.t. to how to help authors that mistakenly uses charset=UNICODE.
 
> As for registering an alias independently of what HTML5 makes use of it, the
> Unicode Consortium would be the appropriate party to take up that issue, not
> the HTML WG. I have forwarded a link to this thread to the Unicode Consortium 
> in case they wish to address this matter further. I can't comment on their
> possible position on the issue of registering "UNICODE" as an alias for
> "UTF-16", but I would speculate that they may not support the idea.

Thank you for notifying them!

Comment 12 Ian 'Hixie' Hickson 2011-12-14 00:47:19 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: The spec defers to IANA's registry for these things. Please register the alias there.