15195 – apparently incorrect note about violation of Unicode wrt stripping leading BOM

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 15195 - apparently incorrect note about violation of Unicode wrt stripping leading BOM

Summary: apparently incorrect note about violation of Unicode wrt stripping leading BOM

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-12-15 01:15 UTC by Glenn Adams
Modified:	2012-01-30 20:08 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Glenn Adams 2011-12-15 01:15:19 UTC

Section 8.2.2.3 [1] includes a Note, cited below, that stripping BOM is a violation of Unicode.

[1] http://dev.w3.org/html5/spec/Overview.html#preprocessing-the-input-stream

"Note: The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether that character was used to determine the byte order is a willful violation of Unicode, motivated by a desire to increase the resilience of user agents in the face of naïve transcoders."

Firstly, I don't believe stripping BOM in the fashion described here is a violation of any conformance requirement of Unicode. However, if the editor believes this to be the case, then the specific compliance clause of the Unicode Standard Section 3.2 [2] believed to be violated should be cited in the note.

[2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Note that Unicode Section 16.8 [3], under "Byte Order Mark (BOM): U+FEFF" recommends the removal of a leading BOM:

"Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space."

[3] http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf

This language applies to HTML5 (as a system "that use[s] the byte order mark") whether it is in fact used or not used (on some specific occasion). That is, the language of 16.8 cited above does not say "if a system does not recognize an initial U+FEFF (in some particular case) signals the byte order, then it (the BOM) must not be removed".

Regards,
Glenn

Comment 1 Ian 'Hixie' Hickson 2012-01-28 18:50:31 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: The "systems" in [3] effectively means character encodings. When using an explicit UTF-16LE, you're not allowed to use a BOM, so a leading BOM isn't a BOM, it's part of the text stream. We strip it anyway.

Comment 2 Glenn Adams 2012-01-28 19:19:01 UTC

This issue is going to be continually raised in the future because the prescribed behavior is counter intuitive. The editor should add a comment explaining the rationale for this behavior so as to avoid future comments of this nature.

Comment 3 Ian 'Hixie' Hickson 2012-01-28 21:57:34 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: You're the first person to raise this in two and a bit years. If we get one comment every two years from people who don't understand the requirements in Unicode and UTF-16, I'm pretty sure we can explain the situation as I did above without any difficulty. Adding additional text to the spec for something like this does not help the majority of readers, who either don't care about this particular topic, or understand it well enough that the warning satisfies their instinctive reaction of pointing out the conflict with Unicode.

Comment 4 Glenn Adams 2012-01-29 05:19:59 UTC

> Rationale: The "systems" in [3] effectively means character encodings. When
> using an explicit UTF-16LE, you're not allowed to use a BOM, so a leading BOM
> isn't a BOM, it's part of the text stream. We strip it anyway.

Are you saying that because HTML5 8.2.2.1 step (2) allows and may make use of a transport specified encoding, and because that encoding may be UTF-16LE, that this makes HTML5 a system "using an explicit UTF-16LE"?

And that, consequently, the language (in [3])

"Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE, then all U+FEFF characters — even at the very beginning of the text — are to be interpreted as zero width no-break spaces."

is violated in the case that HTML5 8.2.2.3 requires an initial U+FEFF to be ignored (as if it were a BOM rather than a ZWNBSP)?

If this is the case, then I believe my comment can be positively resolved by merely adding the following sentence to the end of the Note in question:

<quote>
See [UNICODE] Section 16.8, which specifies that an initial U+FEFF be interpreted as zero width no-break space "where the byte order is explicitly specified".
</quote>

Comment 5 Ian 'Hixie' Hickson 2012-01-30 20:08:23 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: I don't want to refer to specific sections in Unicode because they change over time, which just means extra work for me maintaining this spec.