24104 2013-12-15 20:27:51 +0000 Clarify how encoders should deal with lone surrogates 2014-04-11 11:58:59 +0000 1 1 1 Unclassified WHATWG Encoding unspecified PC All RESOLVED FIXED P2 normal Unsorted 1 mathias annevk bzbarsky hsivonen mathias mike simon.sapin www-international zcorpan sideshowbarker+encodingspec oldest_to_newest 97632 0 mathias 2013-12-15 20:27:51 +0000 Apparently the intent is to allow only scalar values and error on lone surrogates: http://krijnhoetmer.nl/irc-logs/whatwg/20131214#l-500 http://krijnhoetmer.nl/irc-logs/whatwg/20131215#l-221 97633 1 annevk 2013-12-16 01:05:20 +0000 http://lists.w3.org/Archives/Public/public-whatwg-archive/2013Sep/0020.html 103085 2 annevk 2014-03-28 11:52:34 +0000 I tested this: <meta charset=windows-1252> <form action=http://software.hixie.ch/utilities/cgi/test-tools/echo> <input name=a> <script> document.querySelector("input").value = "\ud801" </script> <input type=submit> </form> Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per <form> error mode as windows-1252 can express neither). Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8 bytes percent-encoded). utf-16 results in the same as utf-8 as expected. So either each encoder's handler needs to catch the surrogate range and return error with U+FFFD (Gecko) or not (Chrome). Gecko's behavior is slightly saner I suspect. I'll fix utf-8 and utf-16 to do this right away. Not sure who to consult how we should change the rest. 103086 3 annevk 2014-03-28 12:01:05 +0000 I analyzed too quickly. In Gecko and Chrome is either lone surrogates never reach the utf-8 encoder (replaced by U+FFFD before) or are replaced as part of the encoder. They do not result in an error as that would cause something in the form of &#...; to be emitted rather than a straight U+FFFD. Boris, Henri, Simon, do you have any preferences how we arrange the encoder setup? Should all encoders replace lone surrogates in the input stream with U+FFFD or should we make encoders only take Unicode scalar values and let a layer before handle the lone surrogates? It seems more pragmatic to have encoders take code points. Maybe I should introduce a special lone surrogate error that does the replacing to U+FFFD? 103096 4 zcorpan 2014-03-28 13:18:57 +0000 No opinion 103101 5 bzbarsky 2014-03-28 15:48:24 +0000 Are we talking about encoders generally or the specific case of form submission? 103103 6 annevk 2014-03-28 15:58:56 +0000 Generally. But it affects form submission and URLs of course. It seems Unicode has the contract as a mapping of Unicode scalar values (code points minus surrogates) to bytes and vice versa. That seems reasonable to me but does mean that everyone using encoders/decoders has to convert their code point sequence to a Unicode scalar value sequence first. 103105 7 bzbarsky 2014-03-28 16:33:47 +0000 So this is just about which exact layer does the lone surrogate replacement with U+FFFD; black-box the resulting behavior is the same? 103106 8 annevk 2014-03-28 18:05:34 +0000 Well, as shown in comment 2 currently the behavior is different for encodings other than utf-8 and utf-16le/be. Chrome will emit lone surrogates escaped (meaning its encoders take code points) where as Firefox emits lone surrogates as U+FFFD escaped. Other than that it is mostly a layer and debugging question I suppose, yes, but also affects whether e.g. IDL needs [EnsureUTF16] or some such or not. 103736 9 annevk 2014-04-11 11:58:59 +0000 https://github.com/whatwg/encoding/commit/4abe74d1400c5ab8913c5f229b59b237ae5aac51