17151 – How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code unit sequence, but it can be converted to a Unicode code point. Firefox/Opera currently convert it to U+FFFD, which seems like the preferred behaviour.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17151 - How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code unit sequence, but it can be converted to a Unicode code point. Firefox/Opera currently convert it to U+FFFD, which seems like the preferred behaviour.

Summary: How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code ...

Status:	RESOLVED DUPLICATE of bug 16768

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-05-22 17:15 UTC by contributor
Modified:	2012-07-18 18:40 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description contributor 2012-05-22 17:15:25 UTC

Specification: http://www.whatwg.org/specs/web-apps/current-work/
Multipage: http://www.whatwg.org/C#the-input-byte-stream
Complete: http://www.whatwg.org/c#the-input-byte-stream

Comment:
How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code
unit sequence, but it can be converted to a Unicode code point. Firefox/Opera
currently convert it to U+FFFD, which seems like the preferred behaviour.

Posted from: 81.110.227.73 by geoffers@gmail.com
User agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1130.1 Safari/536.11

Comment 1 Geoffrey Sneddon 2012-05-22 17:23:15 UTC

This pertains to the following:

> Bytes or sequences of bytes in the original byte stream that could not be converted to Unicode code points must be converted to U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is UTF-8, the bytes must be decoded with the error handling defined in this specification.

> Note: Bytes or sequences of bytes in the original byte stream that did not conform to the encoding specification (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report.

"\xD8\x00" can obviously be decoded as if it were UTF-16BE to HTML5's definition of a "Unicode code point" (which include lone surrogates), but according to the Unicode specification it is an invalid UTF-16 code unit sequence.

It would seem preferable that lone surrogates get converted to U+FFFD as they currently are in Opera/Firefox.

Comment 2 Simon Pieters 2012-05-23 06:40:30 UTC

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#utf-16-decoder already handles lone surrogates, AFAICT. HTML just needs to reference that spec.

Comment 3 Anne 2012-05-23 07:52:49 UTC


*** This bug has been marked as a duplicate of bug 16768 ***