This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11298 - Surrogate catching doesn't belong in input stream preprocessing
Summary: Surrogate catching doesn't belong in input stream preprocessing
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC Linux
: P1 critical
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
Depends on:
Reported: 2010-11-11 11:55 UTC by Henri Sivonen
Modified: 2011-08-04 05:03 UTC (History)
6 users (show)

See Also:


Description Henri Sivonen 2010-11-11 11:55:57 UTC
The spec says:
"Code points in the range U+D800 to U+DFFF in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs."

This doesn't really belong in the parser, since document.write()-inserted UTF-16 text should not be subject to lone surrogate replacement since it would add complexity without a backwards compatibility need.

Instead, the spec should have a note saying character decoders for UTF-8, UTF-16 and similar (GB18030 maybe?) are required to emit U+FFFD for bogus byte sequences and sequences decoding to surrogates in UTF-8 or lone surrogates in UTF-16 are bogus.
Comment 1 Ian 'Hixie' Hickson 2010-12-29 08:50:31 UTC
The way it's specced is intentional, so as to make � and a raw UTF-8 0xD800 be treated the same.

As I understand it, if you use document.write(), you're using UTF-16, and thus you can't pass in a lone surrogate that is treated as a Unicode codepoint  it would have to be UTF-16-decoded first, and there's no way for UTF-16 to represent lone surrogates.

I guess we could change this, though, so that instead of being handled in the HTML parser, it's handled in the "decode a byte string as UTF-8, with error handling" algorithm. Not sure what we'd say for UTF-16 or where we'd say it, exactly.
Comment 2 Henri Sivonen 2011-01-04 09:09:46 UTC
Considering established practice, the spec makes a conceptual error when it pretends that the parser operates on Unicode characters. In the real world, the parser (in applications that support document.write) operates on UTF-16 code units and document.write writes UTF-16 code units. If document.write writes unpaired surrogates, they pass through the parser unchanged and unpaired surrogates end up in the DOM. It's not worthwhile to prevent this as long as scripted DOM manipulation can put unpaired surrogates in the DOM.

The conceptually realistic setup is thus:
 1) The parser operates on UTF-16 code units.
 2) The parser is responsible for munging U+0000 and carriage return.
 3) The parser is *not* responsible for touching unpaired surrogates.
 4) document.write writes UTF-16 code units (with potentially unpaired surrogates)
 5) When the input is a byte stream, the process that converts input bytes into UTF-16 code units is responsible for replacing bogus byte sequences with U+FFFD. When the input byte stream is encoded in a flavor of UTF-16, unpaired surrogates constitute bogus byte sequences.
Comment 3 James Graham 2011-01-04 09:54:55 UTC
I believe the same considerations apply to innerHTML i.e. it is unclear whether innerHTML text should pass through the input stream preprocessing stage.
Comment 4 Ian 'Hixie' Hickson 2011-02-09 00:29:09 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:

Status: Partially Accepted
Change Description: see diff given below
Rationale: I've changed the parser to not do this stuff with surrogates (except for numeric char refs)  however, this is not because I think we should allow surrogates here, but because the DOM APIs shouldn't be letting surrogates into the platform, regardless of whether it's a parser API or some other API.

That JS uses UTF-16 is a design mistake, but not one that we need to propagate to the entire platform, nor one that we need to enforce on other languages should they ever be added to the platform. As such, the DOM should be Unicode-clean, not UTF-16.
Comment 5 contributor 2011-02-09 00:29:27 UTC
Checked in as WHATWG revision r5862.
Check-in comment: Remove the requirement that the parser deal with raw surrogates, since they can't make it this far.
Comment 6 Henri Sivonen 2011-02-14 14:57:13 UTC
The spec change looks OK. Thanks. I disagree with your rationale, though.
Comment 7 Ian 'Hixie' Hickson 2011-03-04 02:54:18 UTC
This change was made with the assumption that the UTF-8 decoder error handling stuff handled surrogates, but it seems the surrogates don't get handled at the moment. I'm going to fix that along with fixing a few other errors in the UTF-8 error handling description. The patch will be below. I would appreciate it if you could check that I didn't screw anything up  I have a sneaking suspicion that this is a mistake but I can't work out why (and spec archeology isn't helping me).
Comment 8 contributor 2011-03-04 02:57:03 UTC
Checked in as WHATWG revision r5942.
Check-in comment: Fix the UTF-8 decoder error handling to handle a few errors I'd missed, including in particular surrogate halves. This may be a mistake; if I'm forgetting something please let me know so I can fix it. (e.g. did we decide not to catch surrogates or something?)
Comment 9 Michael[tm] Smith 2011-08-04 05:03:26 UTC
mass-moved component to LC1