Bug 24104 - Clarify how encoders should deal with lone surrogates
Clarify how encoders should deal with lone surrogates
Product: WHATWG
Classification: Unclassified
Component: Encoding
PC All
: P2 normal
: Unsorted
Assigned To: Anne
Depends on:
  Show dependency treegraph
Reported: 2013-12-15 20:27 UTC by Mathias Bynens
Modified: 2014-04-11 11:58 UTC (History)
7 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Mathias Bynens 2013-12-15 20:27:51 UTC
Apparently the intent is to allow only scalar values and error on lone surrogates:

Comment 2 Anne 2014-03-28 11:52:34 UTC
I tested this:

<meta charset=windows-1252>
<form action=http://software.hixie.ch/utilities/cgi/test-tools/echo>
<input name=a> <script> document.querySelector("input").value = "\ud801" </script>
<input type=submit>

Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per <form> error mode as windows-1252 can express neither).

Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8 bytes percent-encoded).

utf-16 results in the same as utf-8 as expected.

So either each encoder's handler needs to catch the surrogate range and return error with U+FFFD (Gecko) or not (Chrome). Gecko's behavior is slightly saner I suspect. I'll fix utf-8 and utf-16 to do this right away. Not sure who to consult how we should change the rest.
Comment 3 Anne 2014-03-28 12:01:05 UTC
I analyzed too quickly. In Gecko and Chrome is either lone surrogates never reach the utf-8 encoder (replaced by U+FFFD before) or are replaced as part of the encoder. They do not result in an error as that would cause something in the form of &#...; to be emitted rather than a straight U+FFFD.

Boris, Henri, Simon, do you have any preferences how we arrange the encoder setup? Should all encoders replace lone surrogates in the input stream with U+FFFD or should we make encoders only take Unicode scalar values and let a layer before handle the lone surrogates?

It seems more pragmatic to have encoders take code points. Maybe I should introduce a special lone surrogate error that does the replacing to U+FFFD?
Comment 4 Simon Pieters 2014-03-28 13:18:57 UTC
No opinion
Comment 5 Boris Zbarsky 2014-03-28 15:48:24 UTC
Are we talking about encoders generally or the specific case of form submission?
Comment 6 Anne 2014-03-28 15:58:56 UTC
Generally. But it affects form submission and URLs of course.

It seems Unicode has the contract as a mapping of Unicode scalar values (code points minus surrogates) to bytes and vice versa. That seems reasonable to me but does mean that everyone using encoders/decoders has to convert their code point sequence to a Unicode scalar value sequence first.
Comment 7 Boris Zbarsky 2014-03-28 16:33:47 UTC
So this is just about which exact layer does the lone surrogate replacement with U+FFFD; black-box the resulting behavior is the same?
Comment 8 Anne 2014-03-28 18:05:34 UTC
Well, as shown in comment 2 currently the behavior is different for encodings other than utf-8 and utf-16le/be. Chrome will emit lone surrogates escaped (meaning its encoders take code points) where as Firefox emits lone surrogates as U+FFFD escaped.

Other than that it is mostly a layer and debugging question I suppose, yes, but also affects whether e.g. IDL needs [EnsureUTF16] or some such or not.