Bugzilla – Bug 24104
Clarify how encoders should deal with lone surrogates
Last modified: 2014-04-11 11:58:59 UTC
Apparently the intent is to allow only scalar values and error on lone surrogates:
I tested this:
<input name=a> <script> document.querySelector("input").value = "\ud801" </script>
Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per <form> error mode as windows-1252 can express neither).
Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8 bytes percent-encoded).
utf-16 results in the same as utf-8 as expected.
So either each encoder's handler needs to catch the surrogate range and return error with U+FFFD (Gecko) or not (Chrome). Gecko's behavior is slightly saner I suspect. I'll fix utf-8 and utf-16 to do this right away. Not sure who to consult how we should change the rest.
I analyzed too quickly. In Gecko and Chrome is either lone surrogates never reach the utf-8 encoder (replaced by U+FFFD before) or are replaced as part of the encoder. They do not result in an error as that would cause something in the form of &#...; to be emitted rather than a straight U+FFFD.
Boris, Henri, Simon, do you have any preferences how we arrange the encoder setup? Should all encoders replace lone surrogates in the input stream with U+FFFD or should we make encoders only take Unicode scalar values and let a layer before handle the lone surrogates?
It seems more pragmatic to have encoders take code points. Maybe I should introduce a special lone surrogate error that does the replacing to U+FFFD?
Are we talking about encoders generally or the specific case of form submission?
Generally. But it affects form submission and URLs of course.
It seems Unicode has the contract as a mapping of Unicode scalar values (code points minus surrogates) to bytes and vice versa. That seems reasonable to me but does mean that everyone using encoders/decoders has to convert their code point sequence to a Unicode scalar value sequence first.
So this is just about which exact layer does the lone surrogate replacement with U+FFFD; black-box the resulting behavior is the same?
Well, as shown in comment 2 currently the behavior is different for encodings other than utf-8 and utf-16le/be. Chrome will emit lone surrogates escaped (meaning its encoders take code points) where as Firefox emits lone surrogates as U+FFFD escaped.
Other than that it is mostly a layer and debugging question I suppose, yes, but also affects whether e.g. IDL needs [EnsureUTF16] or some such or not.