This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Apparently the intent is to allow only scalar values and error on lone surrogates: http://krijnhoetmer.nl/irc-logs/whatwg/20131214#l-500 http://krijnhoetmer.nl/irc-logs/whatwg/20131215#l-221
http://lists.w3.org/Archives/Public/public-whatwg-archive/2013Sep/0020.html
I tested this: <meta charset=windows-1252> <form action=http://software.hixie.ch/utilities/cgi/test-tools/echo> <input name=a> <script> document.querySelector("input").value = "\ud801" </script> <input type=submit> </form> Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per <form> error mode as windows-1252 can express neither). Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8 bytes percent-encoded). utf-16 results in the same as utf-8 as expected. So either each encoder's handler needs to catch the surrogate range and return error with U+FFFD (Gecko) or not (Chrome). Gecko's behavior is slightly saner I suspect. I'll fix utf-8 and utf-16 to do this right away. Not sure who to consult how we should change the rest.
I analyzed too quickly. In Gecko and Chrome is either lone surrogates never reach the utf-8 encoder (replaced by U+FFFD before) or are replaced as part of the encoder. They do not result in an error as that would cause something in the form of &#...; to be emitted rather than a straight U+FFFD. Boris, Henri, Simon, do you have any preferences how we arrange the encoder setup? Should all encoders replace lone surrogates in the input stream with U+FFFD or should we make encoders only take Unicode scalar values and let a layer before handle the lone surrogates? It seems more pragmatic to have encoders take code points. Maybe I should introduce a special lone surrogate error that does the replacing to U+FFFD?
No opinion
Are we talking about encoders generally or the specific case of form submission?
Generally. But it affects form submission and URLs of course. It seems Unicode has the contract as a mapping of Unicode scalar values (code points minus surrogates) to bytes and vice versa. That seems reasonable to me but does mean that everyone using encoders/decoders has to convert their code point sequence to a Unicode scalar value sequence first.
So this is just about which exact layer does the lone surrogate replacement with U+FFFD; black-box the resulting behavior is the same?
Well, as shown in comment 2 currently the behavior is different for encodings other than utf-8 and utf-16le/be. Chrome will emit lone surrogates escaped (meaning its encoders take code points) where as Firefox emits lone surrogates as U+FFFD escaped. Other than that it is mostly a layer and debugging question I suppose, yes, but also affects whether e.g. IDL needs [EnsureUTF16] or some such or not.
https://github.com/whatwg/encoding/commit/4abe74d1400c5ab8913c5f229b59b237ae5aac51