20049 – Clarify the violation of rfc2781/MIME w.r.t. the meaning of 'utf-16be'/'utf-16le'

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 20049 - Clarify the violation of rfc2781/MIME w.r.t. the meaning of 'utf-16be'/'utf-16le'

Summary: Clarify the violation of rfc2781/MIME w.r.t. the meaning of 'utf-16be'/'utf-1...

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Windows 3.1

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-22 22:59 UTC by Leif Halvard Silli
Modified:	2013-09-04 09:26 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2012-11-22 22:59:23 UTC

The standard is fully correct when it says: 

"In violation of the Unicode standard, "utf-16" is a label for
 utf-16le rather than its own standalone encoding."

However, this correct statement gives the impression that it is only 'UTF-16' that has changed meaning while 'UTF-16LE' and 'UTF-16BE' have kept their old meanings.

But this impression is not correct since the MIME registration and the RFC2781 which the MIME registration is built on, says that for files files labeled UTF-16LE or UTF-16BE then one "MUST NOT prepend a BOM to the text". See 
<http://tools.ietf.org/html/rfc2781#section-3.3>

Thus, I suggest that to section '14.2 utf-16be', there should be added a note saying something like this:

"In violation of the Unicode standard, 'utf-16le' is a label for files that preferably, but not necessarily begins with the BOM."

And ditto about inside section '14.3 utf-16le',

Comment 1 Leif Halvard Silli 2012-11-23 01:00:03 UTC

Actually, a more correct underestanding of the matter would probably to say that that the Encoding Standard defines 'UTF-16LE' and 'UTF-16BE' as synonyms for 'UTF-16'.

(Why? Because UTF-16 is not required to contain the BOM.)

How does that make sense to you, Anne?

Comment 2 Anne 2012-11-23 10:23:57 UTC

Well, it's really the decode algorithm that causes the non-compliance. We could add in 14.1: ", even though Unicode does not allow special handling of the byte order mark for utf-16be and utf-16le."

Comment 3 Leif Halvard Silli 2012-11-23 20:11:05 UTC

(In reply to comment #2)
> Well, it's really the decode algorithm that causes the non-compliance. 

Ah. Finesse.

> We could add in 14.1: ", even though Unicode does not allow
> special handling of the byte order mark for utf-16be and 
> utf-16le."

Sounds good. Did you mean to extend the current note? Like so: 

Note: In violation of the Unicode standard, checking for
      a byte order mark happens before an encoding to decode
      a byte stream is chosen, as seen in the decode algorithm,
<INS> EVEN THOUGH UNICODE DOES NOT ALLOW SPECIAL HANDLING OF
      THE BYTE ORDER MARK FOR UTF-16BE AND UTF-16LE") </INS>

If so, then the proposed clarification could perhaps also be moved closer to "violation of the Unicode standard", like so:

Note: In violation of the Unicode standard, <INS> "WHICH FOR 
      UTF-16BE AND UTF-16LE DOES NOT ALLOW SPECIAL HANDLING
      OF THE BYTE ORDER MARK FOR",</INS> checking for a byte
      order mark happens before an encoding to decode a byte
      stream is chosen, as seen in the decode algorithm.
...

Comment 4 Anne 2013-09-04 09:26:48 UTC

https://github.com/whatwg/encoding/commit/337c27d8adaa39a0bd728e4285ed09e19e531fcd