This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17151 - How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code unit sequence, but it can be converted to a Unicode code point. Firefox/Opera currently convert it to U+FFFD, which seems like the preferred behaviour.
Summary: How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code ...
Status: RESOLVED DUPLICATE of bug 16768
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-22 17:15 UTC by contributor
Modified: 2012-07-18 18:40 UTC (History)
3 users (show)

See Also:


Attachments

Description contributor 2012-05-22 17:15:25 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/
Multipage: http://www.whatwg.org/C#the-input-byte-stream
Complete: http://www.whatwg.org/c#the-input-byte-stream

Comment:
How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code
unit sequence, but it can be converted to a Unicode code point. Firefox/Opera
currently convert it to U+FFFD, which seems like the preferred behaviour.

Posted from: 81.110.227.73 by geoffers@gmail.com
User agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1130.1 Safari/536.11
Comment 1 Geoffrey Sneddon 2012-05-22 17:23:15 UTC
This pertains to the following:

> Bytes or sequences of bytes in the original byte stream that could not be converted to Unicode code points must be converted to U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is UTF-8, the bytes must be decoded with the error handling defined in this specification.

> Note: Bytes or sequences of bytes in the original byte stream that did not conform to the encoding specification (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report.

"\xD8\x00" can obviously be decoded as if it were UTF-16BE to HTML5's definition of a "Unicode code point" (which include lone surrogates), but according to the Unicode specification it is an invalid UTF-16 code unit sequence.

It would seem preferable that lone surrogates get converted to U+FFFD as they currently are in Opera/Firefox.
Comment 2 Simon Pieters 2012-05-23 06:40:30 UTC
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#utf-16-decoder already handles lone surrogates, AFAICT. HTML just needs to reference that spec.
Comment 3 Anne 2012-05-23 07:52:49 UTC

*** This bug has been marked as a duplicate of bug 16768 ***