Bug 20479 - Add a Labelling section
Add a Labelling section
Status: RESOLVED FIXED
Product: WHATWG
Classification: Unclassified
Component: Encoding
unspecified
PC All
: P2 normal
: Unsorted
Assigned To: Anne
sideshowbarker+encodingspec
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-21 17:23 UTC by Leif Halvard Silli
Modified: 2013-01-21 14:17 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2012-12-21 17:23:37 UTC
In order to be "the definitive answer to encodings on the web", the spec should have a section on labelling. My proposal is to dow roughly the following:

1. Add a Labelling section

2. Let it say that, when necessary to label, then authors should use "UTF-8" and only "UTF-8" (case insensitively, I suppose) and not "unicode-1-1-utf-8"

3. Let it further say that for all the other encodings, then authors should not use them, hence this spec does not define the rules for how to label them and thus, if authors in violation of this spec wants to use them, then the must check the legacay specifications.
Comment 1 Anne 2013-01-15 11:37:06 UTC
I think if we do this it should simply be added to the requirement here: http://encoding.spec.whatwg.org/#the-encoding No need for a new section. An encoding consists of various things, one of which is a list of labels. Conformance requirements on those labels can very well be made there as everything else is flagged legacy already anyway.
Comment 2 Anne 2013-01-15 11:38:43 UTC
I think we should allow both utf-8 and utf8 as Unicode at some point declared those to be identical.

Do you have any thoughts on this Henri?
Comment 3 Leif Halvard Silli 2013-01-15 22:54:35 UTC
(In reply to comment #1)
My idea was that a separate labelling section would make it simpler to *skip* defining labels for the legacy encodings. Citing 3rd point of my proposal #0:

> 3. Let it further say that for all the other encodings, then authors should
> not use them, hence this spec does not define the rules for how to label
> them and thus, if authors in violation of this spec wants to use them, then
> the must check the legacay specifications.

But since you want to place the labelling rules inside the definition of the UTF-8 encoding and since you consider the label as part of the encoding, do you now plan to define what labels to use for the legacy encodings as well?
Comment 4 Leif Halvard Silli 2013-01-16 00:22:24 UTC
(In reply to comment #2)
> I think we should allow both utf-8 and utf8 as Unicode at some point
> declared those to be identical.

To authors, it is anyhow simplest if there is just one way to do it. And therefore it doesn't sound very reasonable to me to allow "utf8" in addition to "utf-8".

But now that you mention it, then the encoding spec also contains a table over labels <http://encoding.spec.whatwg.org/#concept-encoding-get>. And one very simple way to specifiy which labels authors should use, would be to add a third column in which you could specify what the canonical/recommended label to use should be.

In fact, I guess you could define the first column as the recommended label … This would work OK for all encodings except for UTF-16 (unless you want authors to stop using "UTF-16" as a label.
Comment 5 Masatoshi Kimura 2013-01-16 11:55:57 UTC
(In reply to comment #4)
> (unless you want
> authors to stop using "UTF-16" as a label.

Maybe the desire is that authors stop using anything other than utf-8.
Comment 6 Leif Halvard Silli 2013-01-16 12:13:25 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > (unless you want
> > authors to stop using "UTF-16" as a label.
> 
> Maybe the desire is that authors stop using anything other than > utf-8.

Hm … a light dawned on me: HTML5 do in fact encourage authors to *not* use the UTF-16 *label*. In fact, it encourages autghors to simply rely on the user agent detecting the it as (a flavor of) UTF-16.

And thus, the advice of this spec could in fact - and of course - be that authors don't label 16-bit documents.

And in the prolongment of that, one could just as well advice authors to, for external protocols like HTTP, always use "UTF-16", because "UTF-16" does in fact leave the question of the exact big/litte endian detection over to the user agent.

So the advice could be in general not label UTF-16 documents (with any other thing than the BOM), but permit the UTF-16 label for protocols that permit the labelling of UTF-16 content.
Comment 7 Anne 2013-01-16 17:07:29 UTC
(In reply to comment #5)
> Maybe the desire is that authors stop using anything other than utf-8.

Exactly. There's a reason the specification calls every other encoding "Legacy".
Comment 8 Leif Halvard Silli 2013-01-16 22:27:29 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > Maybe the desire is that authors stop using anything other than utf-8.
> 
> Exactly. There's a reason the specification calls every other encoding
> "Legacy".

I can assure you that I have not not missed that point.
Comment 9 Henri Sivonen 2013-01-21 08:46:15 UTC
(In reply to comment #2)
> I think we should allow both utf-8 and utf8 as Unicode at some point
> declared those to be identical.
> 
> Do you have any thoughts on this Henri?

Are all the three labels for UTF-8 equally compatible with existing software?

WHATWG HTML currently requires the preferred label in various places for conformance. Is there a true need to make “utf8” valid? Do authors use it enough that the preferred label requirement is an annoyance for validating existing content or output of existing tools?

(In reply to comment #0)
> 3. Let it further say that for all the other encodings, then authors should
> not use them, hence this spec does not define the rules for how to label
> them and thus, if authors in violation of this spec wants to use them, then
> the must check the legacay specifications.

I disagree. I think the spec should say that even though authors should (can we justify “must”?) not use legacy encodings, if they do, they must use (preferred?) labels defined in this spec.
Comment 10 Anne 2013-01-21 14:17:54 UTC
Thanks Henri. I forgot HTML used "preferred" here.

https://github.com/whatwg/encoding/commit/a454d2e543964b8d5432778ff917324e8032b78c