This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11423 - Character sets not registered with IANA
Summary: Character sets not registered with IANA
Status: RESOLVED WONTFIX
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-28 18:49 UTC by brian m. carlson
Modified: 2011-08-04 05:02 UTC (History)
7 users (show)

See Also:


Attachments

Description brian m. carlson 2010-11-28 18:49:19 UTC
In the section "Determining the character encoding", windows-949 is provided as a default, but it is not registered with IANA.  HTML5 should not be encouraging people to use a character set that the creator has not even bothered to register with IANA.  It's not like registering a character set with IANA is a particularly difficult or drawn-out process, and it guarantees that there is some reliable (and hopefully semi-permanent) documentation about the mapping between the character set and Unicode for potential implementers.

Furthermore, the next sections states that "User agents must support the preferred MIME name of every character encoding they support, and should support all the IANA-registered names and aliases of every character encoding they support."  It is obviously impossible to comply with this, since windows-949 does not have a preferred MIME name, due to its lack of registration with IANA.

I must therefore object to suggesting or encouraging the use of windows-949 until it has been registered appropriately with IANA.
Comment 1 Benjamin Hawkes-Lewis 2010-11-28 19:16:41 UTC
(In reply to comment #0)
> HTML5 should not be encouraging
> people to use a character set that the creator has not even bothered to
> register with IANA.

It doesn't.

"Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings."

http://dev.w3.org/html5/spec/semantics.html#character-encoding-declaration

I suspect that "valid character encoding name" is supposed to require listing in the IANA registry, so declaring windows-949 may not even be conforming even though applying windows-949 is required for backwards compatibility.

> It's not like registering a character set with IANA is a particularly difficult or drawn-out process 

And yet Microsoft's attempt to do so (back in 2005) seems to have failed:

http://mail.apps.ietf.org/ietf/charsets/msg01510.html

> Furthermore, the next sections states that "User agents must support the
> preferred MIME name of every character encoding they support, and should
> support all the IANA-registered names and aliases of every character encoding
> they support."  It is obviously impossible to comply with this, since
> windows-949 does not have a preferred MIME name, due to its lack of
> registration with IANA.

It's trivial to comply with this, since "preferred MIME name" is defined by the spec as "the name or alias labeled as 'preferred MIME name' in the IANA Character Sets registry, if there is one, or the encoding's name, if none of the aliases are so labeled". The name of windows-949 is "windows-949".

http://dev.w3.org/html5/spec/infrastructure.html#preferred-mime-name

Moreover, UAs aren't even required to support Windows-949:

"User agents must at a minimum support the UTF-8 and Windows-1252 encodings, but may support more."

http://dev.w3.org/html5/spec/parsing.html#character-encodings-0
 
> I must therefore object to suggesting or encouraging the use of windows-949
> until it has been registered appropriately with IANA.

Maybe try registering it? Perhaps you'll have better luck than Microsoft.
Comment 2 brian m. carlson 2010-11-28 19:52:14 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > HTML5 should not be encouraging
> > people to use a character set that the creator has not even bothered to
> > register with IANA.
> 
> It doesn't.

When a user agent would otherwise use an encoding given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it *must* instead use the encoding given in the cell in the second column of the same row. When a byte or sequence of bytes is treated differently due to this encoding aliasing, it is said to have been misinterpreted for compatibility. (Emphasis mine.)

EUC-KR and KS_C_5601-1987 are mapped onto windows-949.  I think a "must" directive is definitely an encouragement, even if you don't.

> > It's not like registering a character set with IANA is a particularly difficult or drawn-out process �
> 
> And yet Microsoft's attempt to do so (back in 2005) seems to have failed:
> 
> http://mail.apps.ietf.org/ietf/charsets/msg01510.html

Probably because, as the responses indicate, the specifications for those character sets were insufficient and contradictory.  It doesn't matter what exactly the reason is; it's not registered.  HP, IBM, and Adobe have managed to do it, so I'm sure that it's not impossible or unreasonably difficult.

> It's trivial to comply with this, since "preferred MIME name" is defined by the
> spec as "the name or alias labeled as 'preferred MIME name' in the IANA
> Character Sets registry, if there is one, or the encoding's name, if none of
> the aliases are so labeled". The name of windows-949 is "windows-949".

I believe "if there is one" means "if there is a name or alias labeled as 'preferred MIME name'", not "if there is an entry in the IANA Character Sets registry".  Even if we were to use your suggested interpretation, there are other names for this character set, such as "CP949".  How are we to know what the preferred name is if it's not IANA-registered?

> "User agents must at a minimum support the UTF-8 and Windows-1252 encodings,
> but may support more."

Right, but if they support EUC-KR or KS_C_5601-1987, they are effectively required to.  (Actually, the spec seems to prohibit the useful implementation of EUC-KR, since it's mandated that user agents use something else instead.)  If it's acceptable to support EUC-KR and not windows-949, then the spec should so state.

> > I must therefore object to suggesting or encouraging the use of windows-949
> > until it has been registered appropriately with IANA.
> 
> Maybe try registering it? Perhaps you'll have better luck than Microsoft.

I'm really not interested in registering what amount to platform-specific character sets.  Plus, since I don't use that platform, I have no knowledge about what the mapping should look like or whether it is correct.  Finally, there are numerous character sets in existence that handle Korean just fine, including UTF-8, and I don't see the need to add more.
Comment 3 Benjamin Hawkes-Lewis 2010-11-29 01:00:08 UTC
(In reply to comment #2): 
> EUC-KR and KS_C_5601-1987 are mapped onto windows-949.  I think a "must"
> directive is definitely an encouragement, even if you don't.

Oh, I thought by "encouraging" you referred to things a spec can realistically influence (like future authoring) as opposed to UA behaviour required to process an existing web corpus. HTML5 can't retrospectively change the corpus.

Anyhow, as I read the spec, a conforming UA is free to fail to process documents labeled as EUC-KR and KS_C_5601-1987 on the basis that HTML5 maps them to Windows-949 for backwards compatibility with the web corpus, and it happens not to support Windows-949.

> > > It's not like registering a character set with IANA is a particularly difficult or drawn-out process
> > 
> > And yet Microsoft's attempt to do so (back in 2005) seems to have failed:
> > 
> > http://mail.apps.ietf.org/ietf/charsets/msg01510.html
> 
> Probably because, as the responses indicate, the specifications for those
> character sets were insufficient and contradictory.  It doesn't matter what
> exactly the reason is; it's not registered. HP, IBM, and Adobe have managed to
> do it, so I'm sure that it's not impossible or unreasonably difficult.

Big Blue managed to do it, so it's easy? Your standard of proof may be lower than mine here. ;)

> I believe "if there is one" means "if there is a name or alias labeled as
> 'preferred MIME name'", not "if there is an entry in the IANA Character Sets
> registry".

Hmm. I think your reading is correct. :(

> Even if we were to use your suggested interpretation, there are
> other names for this character set, such as "CP949".  How are we to know what
> the preferred name is if it's not IANA-registered?
> 
> > "User agents must at a minimum support the UTF-8 and Windows-1252 encodings,
> > but may support more."
> 
> Right, but if they support EUC-KR or KS_C_5601-1987, they are effectively
> required to.  (Actually, the spec seems to prohibit the useful implementation
> of EUC-KR, since it's mandated that user agents use something else instead.)

The spec effectively:

   - prohibits implementing EUC-KR or KS_C_5601-1987;
   - allows implementing Windows-949;
   - requires mapping of EUC-KR or KS_C_5601-1987 to Windows-949, but does not require UAs to actually process such documents.
 
> > > I must therefore object to suggesting or encouraging the use of windows-949
> > > until it has been registered appropriately with IANA.
> > 
> > Maybe try registering it? Perhaps you'll have better luck than Microsoft.
> 
> I'm really not interested in registering what amount to platform-specific
> character sets. 

Assuming you're interested in user agents being able to process the existing web corpus using only IANA-registered characters sets, you perhaps should have some level of interest in doing so. ;)

> Finally, there are numerous character sets in existence that handle Korean just fine,
> including UTF-8, and I don't see the need to add more.

Which is why the spec recommends authors use UTF-8. :)

http://msdn.microsoft.com/en-gb/goglobal/cc305154.aspx (which the spec references) defines an authoritative mapping of windows-949 to Unicode.

If the spec simply defined the preferred name of Windows-949 as (case-insensitive) "Windows-949", could we close this bug?
Comment 4 Henri Sivonen 2010-11-29 12:34:42 UTC
(In reply to comment #0)
> It's not like registering a character set with IANA is a
> particularly difficult or drawn-out process,

Looks like it is considering that an attempt at registration was made:
http://mail.apps.ietf.org/ietf/charsets/msg01510.html
Comment 5 brian m. carlson 2010-11-29 13:17:04 UTC
(In reply to comment #3)
> Big Blue managed to do it, so it's easy? Your standard of proof may be lower
> than mine here. ;)

Several large corporations have done it, including Microsoft themselves.  I don't see even a cursory response to the listed objections to the registration, or really any follow-up whatsoever on Microsoft's part.  At least I respond to objections. ;-)

> Assuming you're interested in user agents being able to process the existing
> web corpus using only IANA-registered characters sets, you perhaps should have
> some level of interest in doing so. ;)

I'm interested in a complete specification.  If the HTML5 Working Group wants to refer to windows-949, then I think it should be appropriately registered with IANA.  I would be just as happy (maybe even happier) if windows-949 were not mentioned at all, since I personally find the "misinterpreted for compatibility" idea revolting.  But that's beside the point: for better or for worse, IANA is *the* authoritative source for character sets (among myriad other things).  If I need to know about a character set, I look there first, and so will pretty much every implementer.

> http://msdn.microsoft.com/en-gb/goglobal/cc305154.aspx (which the spec
> references) defines an authoritative mapping of windows-949 to Unicode.

If you want to register it, I will fully support it.  In fact, once it's done, I'll close this bug report myself.  I am not registering it because I am not willing to be ultimately responsible for the specification of a character set for a language I don't read, write, speak, or understand.  I also believe it's an unnecessary legacy character set.

> If the spec simply defined the preferred name of Windows-949 as
> (case-insensitive) "Windows-949", could we close this bug?

Nope.  I would be happy with (a) windows-949 being registered with IANA, or (b) windows-949 not being mentioned at all.  An additional alternative, which is not at all preferable, is a note in the text to the effect of "The HTML5 Working Group has deliberately chosen to refer to and favor over other, better-specified alternatives (e.g. EUC-KR) the character set 'windows-949', even though it is not registered properly with IANA."
Comment 6 Anne 2010-11-29 13:25:56 UTC
Implementor here. We hardly look at IANA. It is full of holes when it comes to the web. Gecko/WebKit/Trident are a more interesting source.
Comment 7 Benjamin Hawkes-Lewis 2010-11-29 16:22:31 UTC
(In reply to comment #5)
> If I need to know about a character set, I look there first,
> and so will pretty much every implementer.

Anne's provided significant evidence to the contrary.

> > If the spec simply defined the preferred name of Windows-949 as
> > (case-insensitive) "Windows-949", could we close this bug?
> 
> Nope.  I would be happy with (a) windows-949 being registered with IANA, or (b)
> windows-949 not being mentioned at all.  An additional alternative, which is
> not at all preferable, is a note in the text to the effect of "The HTML5
> Working Group has deliberately chosen to refer to and favor over other,
> better-specified alternatives (e.g. EUC-KR) the character set 'windows-949',
> even though it is not registered properly with IANA."

This sounds greatly preferable to me, as we need windows-949 for legacy content, key implementers like Opera don't care about its registration, and nobody who does care about its registration is keen to register it.

As far as I can tell, we've already got a note in the text to the effect of your note:

"The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification, motivated by a desire for compatibility with legacy content. [CHARMOD]"

http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

The referenced document:

http://www.w3.org/TR/charmod/

describes how content should be interpreted according to its declared IANA character set.

What's the practical difference between your suggested additional alternative and the text we already have?

By which I mean: what would your text cause implementors to do differently and why?
Comment 8 Ian 'Hixie' Hickson 2011-01-10 22:21:00 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: The issue I am responding to here is that windows-949 is not registered with IANA. If there is another issue, please file separate bugs -- only one issue per bug, please.

Now, in response to the report that windows-949 is not registered: That the IANA registry is incomplete is a bug with the IANA registry, not with the HTML spec. The HTML spec here is describing reality: an implementation must act as described if it is to be compatible with legacy content, and enabling the creation of user agents that are compatible with legacy content is the entire purpose of the specification. Thus, I recommend raising the issue with IANA. For the purposes of the HTML spec, this bug is invalid.

(I would reassign this bug to IANA but they don't use the same bug system so that's not possible, so instead I'll mark it as invalid.)

Please feel free to file separate bugs for other issues, if any, that the discussion above may have uncovered.
Comment 9 brian m. carlson 2011-01-11 00:24:08 UTC
The IANA character set registry is what the rest of the Internet uses.  It's what W3C uses.  It's what the XHTML serialization uses (via XML).  If you believe it is incomplete, feel free to augment it.  It would be satisfactory to me if the HTML specification decided to define it (as long as the specification is sufficiently precise for interoperability and it can be used as the reference for the IANA registry).  Choosing to specify behavior in terms of some vendor-specific character set that is undefined actually *harms* interoperability.  Since I don't use Windows, how am I to know what byte sequences are valid and what their meanings are in this mystery encoding?  It would be totally acceptable for me to simply define windows-949 as UTF-8, since there is no other specification, and since it is otherwise undefined, I hereby define it as such.

I think it's a little ridiculous to expect conformance to a vague, wishy-washy notion that's not clearly defined and call that interoperability.
Comment 10 Anne 2011-01-11 09:26:56 UTC
You appear to have missed comment 6. I do agree we need to more suitably define labels and the encodings they map to, but that really is out of scope of HTML5.
Comment 11 Ian 'Hixie' Hickson 2011-02-11 22:58:03 UTC
(In reply to comment #9)
> The IANA character set registry is what the rest of the Internet uses.

That's not at all clear. In particular, comment 6 suggests it's not what browser vendors use, which is the target audience of the requirement in question.


> If you believe it is incomplete, feel free to augment it.

I'm happy for the IANA character set registry to be updated, but whether I do it, or you do it, or someone else does it, or nobody does it, it doesn't change what the HTML spec needs to say.


> It would be satisfactory to
> me if the HTML specification decided to define it (as long as the specification
> is sufficiently precise for interoperability and it can be used as the
> reference for the IANA registry).

The HTML spec references a definition of Windows 949.


> Choosing to specify behavior in terms of
> some vendor-specific character set that is undefined actually *harms*
> interoperability.

It isn't undefined, just not registered.


> Since I don't use Windows, how am I to know what byte
> sequences are valid and what their meanings are in this mystery encoding?

Look at its definition, cited in the HTML spec.


> It
> would be totally acceptable for me to simply define windows-949 as UTF-8, since
> there is no other specification, and since it is otherwise undefined, I hereby
> define it as such.

There is another specification.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: no new information since comment 8.
Comment 12 Michael[tm] Smith 2011-08-04 05:02:43 UTC
mass-moved component to LC1