19817 – Point out conflicting default with Unicode for utf-16 (defaults to le rather than be)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19817 - Point out conflicting default with Unicode for utf-16 (defaults to le rather than be)

Summary: Point out conflicting default with Unicode for utf-16 (defaults to le rather ...

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-01 14:45 UTC by Anne
Modified:	2013-08-23 11:03 UTC (History)
CC List:	3 users (show)

See Also:	https://bugzilla.mozilla.org/show_bug.cgi?id=809934

Attachments

Description Anne 2012-11-01 14:45:58 UTC

Kawabata Taichi (川幡太一) pointed out we should be clear here we conflict with Unicode.

Comment 1 Anne 2012-11-01 15:24:36 UTC

So utf-16 and utf-16le being the same is also a conflict.

(Most of this goes away in practice I suppose if we have utf-16 sniffing.)

Comment 2 Masatoshi Kimura 2012-11-09 17:18:10 UTC

See also Bugzilla@Mozilla #809934.

Comment 3 Anne 2012-11-12 22:11:08 UTC

When doing this, also change the canonical encoding name. It should be utf-16le, not utf-16.

Comment 4 Anne 2012-11-12 22:46:28 UTC

FWIW, I fixed this, but GitHub is slow. Should probably resolve itself in a couple of hours.

Comment 5 Anne 2012-11-12 22:58:58 UTC

https://github.com/whatwg/encoding/commit/4a66751c0af30c239c107394fae8262902597340
http://encoding.spec.whatwg.org/#utf-16le

Comment 6 Leif Halvard Silli 2012-11-24 07:13:19 UTC

The new text creates a non-existing conflict between Unicode and the Encoding Standard:

]] Note: In violation of the Unicode standard, "utf-16" is a
label for utf-16le rather than its own standalone
encoding.[[

PROPOSAL:

ALT 1: The following text would be more accurate:

]] Note: The get an encoding algorithm does not follow RFC 2781’s
recommendation to interpret "utf-16" as a label for
utf-16be. Instead it handles "utf-16" as a label for
utf-16le.[[

ALT 2: Or if you would prefer to stay close to current text:

]] Note: For the purpose of the get an encoding algorithm, then,
due to the status of most deployed content, "utf-16" is
handled as a label for utf-16le. Unicode’s should level
recommendation to default to utf-16be, is thus overruled.[[

COMMENTS:

(1) RFC2781 does not operate with any MUST level requirement to treat ambigious "UTF-16" as big-endian/utf-16be. There is only a SHOULD level recommendation to do so: <http://tools.ietf.org/html/rfc2781#section-4.3>. I don't think it can be considered a "violation" to deviate from a should level recommendation.

(2) The Unicode Standard says that "when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian". But other places it says that e.g. a database does not need to declare the encoding if the encoding is known. What "is known" means, is of course relative. But if "deployed content" generally is UTF-16LE-encoded, then this is a variant of "is known" and a good reason for the Encoding Standard to ask UTF-16 to be interpreted as UTF-16LE when the BOM is lacking. And, at any rate, as far I can tell, there is no MUST level requirement to interpret UTF-16 in Unicode.

(3) Currently, when the Encoding Standard says that "UTF-16" is a synonymous with UTF-16le, then remember that, per Unicode, it would be an error to include a BOM in a file that is labelled 'UTF-16LE'. And thus, if UTF-16 - outside the context of the "get an encoding" algorithm, were to be seen as a label for UTF-16LE, then one would have to conclude that it would be an error to include a BOM in a file labelled "UTF-16" ... Which sounds kind of upside down, ya know … And for that reason, my replacement text tries to avoid saying anytihng that can be interpreted as a definition of what "UTF-16" means.

Comment 7 Masatoshi Kimura 2012-11-24 07:39:19 UTC

(In reply to comment #6)
> But other places it says that e.g. a database does not need to
> declare the encoding if the encoding is known.
Where does it say?

> What "is known" means, is of
> course relative. But if "deployed content" generally is UTF-16LE-encoded,
> then this is a variant of "is known" and a good reason for the Encoding
> Standard to ask UTF-16 to be interpreted as UTF-16LE when the BOM is
> lacking.
IMO it stretches the concept too far.

> And, at any rate, as far I can tell, there is no MUST level
> requirement to interpret UTF-16 in Unicode.
Conformance requirements C11 implicitly referes D98. So it looks a MUST level requirement to me even if the word "MUST" is not used. (the word "shall" is used instead.)

Comment 8 Leif Halvard Silli 2012-11-24 18:53:07 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > But other places it says that e.g. a database does not need to
> > declare the encoding if the encoding is known.
> Where does it say?

About databases:

]] Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE, then all U+FEFF characters—even at the very beginning of the text—are to be interpreted as zero width no-break spaces. Similarly, where Unicode text has known byte order, initial U+FEFF characters are not required, but for backward compatibility are to be interpreted as zero width no-break spaces. For example, for strings in an API, the memory architecture of the processor provides the explicit byte order. For databases and similar structures, it is much more efficient and robust to use a uniform byte order for the same field (if not the entire database), thereby avoiding use of the byte order mark. [[

Unicode 6.2 says that there are many contexts where un-labelled UTF-16 is used. Databases are one example. Machine architectures is another that it mentions. These contexts can be seen as "higher-level protocols". And hence it is not an error to not explicitly declare the encoding or to omit the BOM since the database and/or the machine architecture handles this information.

> > What "is known" means, is of
> > course relative. But if "deployed content" generally is UTF-16LE-encoded,
> > then this is a variant of "is known" and a good reason for the Encoding
> > Standard to ask UTF-16 to be interpreted as UTF-16LE when the BOM is
> > lacking.
>
> IMO it stretches the concept too far.

Disagree. Just have a look at what D16 (definition 16) says about "higher-level protocol":

]]
  D16 Higher-level protocol: 
  * Any agreement on the interpretation of Unicode characters that
    extends beyond the scope of this standard.
  * Such an agreement need not be formally announced in data; it
    may be implicit in the context.
  * The specification of some Unicode algorithms may limit the
    scope of what a conformant higher-level protocol may do.
[[

Thus, I see no problem with defining that implementations that are to follow the Encoding Standard must treat the "UTF-16" label as little-endian (aka "utf-16le") when the BOM is lacking. Doing such a thing is just an application of the higher-level protocol opportunity that Unicode operates with.

> > And, at any rate, as far I can tell, there is no MUST level
> > requirement to interpret UTF-16 in Unicode.
> Conformance requirements C11 implicitly referes D98. So it looks a MUST
> level requirement to me even if the word "MUST" is not used. (the word
> "shall" is used instead.)

Please excuse my error. I did not mean to say: "to interpret UTF-16 in Unicode". I meant to say: "to interpret UTF-16 _as big-endian_ in Unicode". However, in addition, I also want to modify myself slightly: It is pretty clear that Unicode 6.2 says that "UTF-16" comes in 3 variants: little-endian with BOM, big-endian with BOM and big-endian WITHOUT the BOM. So you are correct - and I am wrong - in that detail.

However, this definition of what "UTF-16" means was present also in Unicode 3.0, which formed the basis for RFC2781. And, like I said above, RFC2781 only has a SHOULD requirement to interpret UTF-16 as big-endian when the BOM is lacking: [1]

]] If the first two octets of the text is not 0xFE followed by
   0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
   interpreted as being big-endian. [[

So how can RFC2781 seemingly "deviate" from Unicode 3 that way? The answer has to be that, from Unicode's point of view, then RFC2781 - which really is a MIME charset registration[2] - can be seen as a higher-level protocol. Other examples of higher-level protocols are HTML and XML, see page 565 of Unicode 6.2: "The Unicode Standard recom- mends the use of higher-level protocols, such as HTML or XML <snip>". (XML 1.0 requires the use of the BOM for UTF-16 text that are stored in a file system and hence such texts MUST use the the "UTF-16" label.)

The conclusion must be that, no, it is not a violation of Unicode to set up a higher-level protocol that requires "UTF-16" to default to UTF-16LE whenever BOM is omitted.

[1] http://tools.ietf.org/html/rfc2781#section-4.3
|2] http://tools.ietf.org/html/rfc2781#appendix-A

Comment 9 Leif Halvard Silli 2012-11-24 19:00:05 UTC

(In reply to comment #8)

> The conclusion must be that, no, it is not a violation of Unicode to set up
> a higher-level protocol that requires "UTF-16" to default to UTF-16LE
> whenever BOM is omitted.

But of course, it can still be a good idea to *bring attention to* the fact that *this* higher-level protocol - The Encoding Standard - operates with another endianness-default than the one found in Unicode and RFC2781. However, it would be untrue - and thus only confuse the matter - to portray it as a *violation*.

Comment 10 Anne 2012-11-24 19:06:48 UTC

The violation is not as to whether it defaults to big or little endian, it is that we do not use it for a separate encoding, but rather as a label for a different one.

Comment 11 Leif Halvard Silli 2012-11-25 02:54:34 UTC

(In reply to comment #10)
> The violation is not as to whether it defaults to big or little endian, it
> is that we do not use it for a separate encoding, but rather as a label for
> a different one.

(Yeah, and that is a consequence of the Decode algorithm happening before the Get an encoding algorithm - there simply is no use for the official meaning of "UTF-16" once the Decode algorithm has run. OK. Fine.)

But could you at least clarify, in the Encoding Standard, who the "we" are? Is it "we all, including Web authors" (who should now consider "UTF-16" not a label for the standalone encoding with the same name but instead go over to user for "utf-16le" instead)? Or is is "we, one or more algorithm(s) of the Encoding Standard"? 

At the very minimum, I suggest that you add something like "for the purpose of this algorithm", to the existing note:

]] Note: FOR THE PURPOSE OF THIS ALGORITHM, THEN, In violation of the Unicode
         standard, "utf-16" is a label for utf-16le rather than its own 
         standalone encoding. [[

Comment 12 Anne 2012-12-21 15:06:03 UTC

Since the intent of the Encoding Standard is to be the definitive answer to encodings on the web, I'm not sure why that is necessary.

Comment 13 Leif Halvard Silli 2012-12-21 16:24:00 UTC

(In reply to comment #12)
> Since the intent of the Encoding Standard is to be the definitive answer to
> encodings on the web, I'm not sure why that is necessary.

FIRSLTY: Does the addition that I proposed (namely: "For the purpose of this algorithm, then") change any definitions in the spec? If so, then what? In my view, the addiditon only clarifies what the spec already says. 

SECONDLY: The "definitive answer" attitude of the spec is one reason why it is important to point out to which questions the spec currently does *not* offer any answers. One such field is labelling - how authors should label their documents.

E.g. the spec lists "unicode-1-1-utf-8" as a label for UTF-8, but does not say that authors should only use "UTF-8". And it lists "US-ASCII" as a label for Windows-1252. But it does not say that authors should not use US-ASCII for Windows-1252. For the "UTF-16" label, then the spec says that it means "UTF-16LE". However, the spec doesn't say - and doesn’t want to say - that it is forbidden to label big endian documents with the BOM as with the label "UTF-16".

This is why I think the spec should clarify for which context the "violation" claim is valid.

Comment 14 Anne 2012-12-21 16:30:36 UTC

The specification requires authors to use utf-8. It's pretty clear on that. What label is used for utf-8 does not matter, although I suppose we could encourage "utf-8". In any event, that's a separate issue.

Comment 15 Leif Halvard Silli 2012-12-21 17:24:59 UTC

(In reply to comment #14)
> The specification requires authors to use utf-8. It's pretty clear on that.
> What label is used for utf-8 does not matter, although I suppose we could
> encourage "utf-8". In any event, that's a separate issue.

Yes, saying that one should use the label "utf-8" and not "unicode-1-1-utf-8", would be a separate issue.

And the purpose of my proposed addition was to make it absolutely clear - to whoever tries to read this spec - that labelling is a separate issue. After all, it would only be a complication if, for UTF-16 documents with the BOM, authors started to believe that they would have to consided the endianness before choosing a label.

Adding a section (or whatever) which defines which labels authors should use, could be a good idea. If you do add a labelling section, then I will withdraw my proposed "For the purpose of" text. See bug 20479 for a proposal for labelling secrtion.

Comment 16 Anne 2013-01-21 15:45:25 UTC

So labeling is done. I don't think there's anything else.