This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12100 - UAs do not actually convert DOMStrings to sequences of Unicode characters. Test case: data:text/html,<!doctype html><script>document.documentElement.title = "\ud800"; alert(document.documentElement.title.charCodeAt(0));</script> Expected 65533, got 5529
Summary: UAs do not actually convert DOMStrings to sequences of Unicode characters. T...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-16 19:51 UTC by contributor
Modified: 2011-08-04 05:12 UTC (History)
7 users (show)

See Also:


Attachments

Description contributor 2011-02-16 19:51:43 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html
Section: http://www.whatwg.org/specs/web-apps/current-work/#float-nan

Comment:
UAs do not actually convert DOMStrings to sequences of Unicode characters. 
Test case: data:text/html,<!doctype
html><script>document.documentElement.title = "\ud800";
alert(document.documentElement.title.charCodeAt(0));</script>  Expected 65533,
got 55296, in all tested browsers.  Did you mean to not match browsers here?

Posted from: 68.175.61.233
Comment 1 Henri Sivonen 2011-02-17 06:57:18 UTC
Can we please have a reality-reflecting spec on this point?
Comment 2 Aryeh Gregor 2011-02-18 19:03:26 UTC
IIRC, Hixie added this line along with my atob/btoa stuff, since my spec for that started by converting the input to a sequence of Unicode characters.  He extended it to everything, not just atob/btoa.  In the case of atob/btoa I was just doing it because it let me pretend that I was dealing with characters instead of code units; anything over U+FF would cause an exception to be thrown anyway, so it made no difference except in terminology.  But that's not necessarily safe at all for arbitrary DOMStrings.

For atob/btoa, it would probably be best to just rephrase it in terms of code units.  It will be more confusing to average authors, but oh well.
Comment 3 Ian 'Hixie' Hickson 2011-05-04 22:55:29 UTC
Agreed that DOM-tree-manipulating APIs should work in 16bit codepoints. The intent of this change was to make sure that APIs like window.alert() worked with Unicode, and that all the various algorithms in the spec worked with Unicode, etc. We don't want algorithms that talk about doing things character by character breaking every time an astral plane character gets involved because they get split in two.

Does anyone have any suggestions for how to do this without having to go down every single method and attribute or every single algorithm saying which ones are operating in Unicode space and which are operating in UTF-16 word space?
Comment 4 Cameron McCormack 2011-05-04 23:19:55 UTC
At the top of the description of the SVG interface that has the methods that allow indexing into strings (for rendered text length calculations etc.), we have this text:

  For the methods on this interface that refer to an index to a character
  or number of characters, these references are to be interpreted as
  an index to a UTF-16 code unit or a number of UTF-16 code units,
  respectively. This is for consistency with DOM Level 2 Core, where
  methods on the CharacterData interface use UTF-16 code units as indexes
  and counts within the character data. Thus for example, if the text
  content of a text element is a single non-BMP character, such as
  U+10000, then invoking getNumberOfChars on that element will return
  2 since there are two UTF-16 code units (the surrogate pair) used to
  represent that one character.

Something like that might be OK in the HTML spec too, although with the methods spread throughout the spec more, it might be less obvious.
Comment 5 Ian 'Hixie' Hickson 2011-05-05 06:27:07 UTC
That doesn't solve the problem. The problem is what to do with broken surrogates going in to the API.
Comment 6 Henri Sivonen 2011-05-05 06:30:36 UTC
(In reply to comment #3)
> Does anyone have any suggestions for how to do this without having to go down
> every single method and attribute or every single algorithm saying which ones
> are operating in Unicode space and which are operating in UTF-16 word space?

Are there, in implementation reality, any APIs that don't operate with UTF-16 code units? At least in Firefox, giving an unpaired surrogate to window.alert() doesn't throw. It shows an alert with a box that shows the hex for the surrogate.
Comment 7 Ian 'Hixie' Hickson 2011-05-05 06:34:08 UTC
(In reply to comment #6)
> 
> Are there, in implementation reality, any APIs that don't operate with UTF-16
> code units? At least in Firefox, giving an unpaired surrogate to window.alert()
> doesn't throw. It shows an alert with a box that shows the hex for the
> surrogate.

That's the kind of bug I think we should be trying to fix here. I'll be the first to say it shouldn't be a high priority. But it's the kind of thing that can be fixed as people go through codebases, little fix here, little fix there, until eventually it's all Unicode clean (except the DOM, which is a lost cause, sadly).
Comment 8 Cameron McCormack 2011-05-05 06:52:12 UTC
(In reply to comment #5)
> That doesn't solve the problem. The problem is what to do with broken
> surrogates going in to the API.

I misread.
Comment 9 Henri Sivonen 2011-05-05 06:58:08 UTC
(In reply to comment #7)
> That's the kind of bug I think we should be trying to fix here. I'll be the
> first to say it shouldn't be a high priority.

What's the payoff from "fixing" it?

> But it's the kind of thing that
> can be fixed as people go through codebases, little fix here, little fix there,
> until eventually it's all Unicode clean (except the DOM, which is a lost cause,
> sadly).

In Firefox, e.g. alert text is laid out and painted using the same machinery that is used for laying out and painting general DOM content, so that machinery has to deal with unpaired surrogates anyway without catching fire. How would the Web be better if window.alert() had extra code for doing something else with unpaired surrogates?
Comment 10 Ian 'Hixie' Hickson 2011-06-03 19:37:09 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments.

I went through the spec and try to rationalise all this to make more sense and more closely match reality. I think there might be some stuff that is border-line undefined, e.g. it's not clear to me what alert() should do if passed a DOMString with an unpaired surrogate. In fact in general I don't really understand what the type of DOMString is now... it's kind of Unicode, it's kind of UTF-16... If I figure out a better way to make define this stuff I'll update it again.

Aryeh: BTW, I looked at the atob/btoa stuff and it didn't seem to need this anymore. Please let me know if I broke it.
Comment 11 contributor 2011-06-03 19:40:30 UTC
Checked in as WHATWG revision r6184.
Check-in comment: Try to clean up the stuff about Unicode characters.
http://html5.org/tools/web-apps-tracker?from=6183&to=6184
Comment 12 Michael[tm] Smith 2011-08-04 05:12:46 UTC
mass-move component to LC1