This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 20049 - Clarify the violation of rfc2781/MIME w.r.t. the meaning of 'utf-16be'/'utf-16le'
Summary: Clarify the violation of rfc2781/MIME w.r.t. the meaning of 'utf-16be'/'utf-1...
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC Windows 3.1
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-22 22:59 UTC by Leif Halvard Silli
Modified: 2013-09-04 09:26 UTC (History)
2 users (show)

See Also:


Attachments

Description Leif Halvard Silli 2012-11-22 22:59:23 UTC
The standard is fully correct when it says: 

"In violation of the Unicode standard, "utf-16" is a label for
 utf-16le rather than its own standalone encoding."

However, this correct statement gives the impression that it is only 'UTF-16' that has changed meaning while 'UTF-16LE' and 'UTF-16BE' have kept their old meanings.

But this impression is not correct since the MIME registration and the RFC2781 which the MIME registration is built on, says that for files files labeled UTF-16LE or UTF-16BE then one "MUST NOT prepend a BOM to the text". See 
<http://tools.ietf.org/html/rfc2781#section-3.3>

Thus, I suggest that to section '14.2 utf-16be', there should be added a note saying something like this:

"In violation of the Unicode standard, 'utf-16le' is a label for files that preferably, but not necessarily begins with the BOM."

And ditto about inside section '14.3 utf-16le',
Comment 1 Leif Halvard Silli 2012-11-23 01:00:03 UTC
Actually, a more correct underestanding of the matter would probably to say that that the Encoding Standard defines 'UTF-16LE' and 'UTF-16BE' as synonyms for 'UTF-16'.

(Why? Because UTF-16 is not required to contain the BOM.)

How does that make sense to you, Anne?
Comment 2 Anne 2012-11-23 10:23:57 UTC
Well, it's really the decode algorithm that causes the non-compliance. We could add in 14.1: ", even though Unicode does not allow special handling of the byte order mark for utf-16be and utf-16le."
Comment 3 Leif Halvard Silli 2012-11-23 20:11:05 UTC
(In reply to comment #2)
> Well, it's really the decode algorithm that causes the non-compliance. 

Ah. Finesse.

> We could add in 14.1: ", even though Unicode does not allow
> special handling of the byte order mark for utf-16be and 
> utf-16le."

Sounds good. Did you mean to extend the current note? Like so: 

Note: In violation of the Unicode standard, checking for
      a byte order mark happens before an encoding to decode
      a byte stream is chosen, as seen in the decode algorithm,
<INS> EVEN THOUGH UNICODE DOES NOT ALLOW SPECIAL HANDLING OF
      THE BYTE ORDER MARK FOR UTF-16BE AND UTF-16LE") </INS>

If so, then the proposed clarification could perhaps also be moved closer to "violation of the Unicode standard", like so:

Note: In violation of the Unicode standard, <INS> "WHICH FOR 
      UTF-16BE AND UTF-16LE DOES NOT ALLOW SPECIAL HANDLING
      OF THE BYTE ORDER MARK FOR",</INS> checking for a byte
      order mark happens before an encoding to decode a byte
      stream is chosen, as seen in the decode algorithm.
...