23927 2013-11-26 15:45:02 +0000 ASCII-incompatible encoder error handling 2014-03-26 18:36:13 +0000 1 1 1 Unclassified WHATWG Encoding unspecified PC Linux RESOLVED FIXED P2 normal Unsorted 16688 1 simon.sapin annevk addison mike www-international sideshowbarker+encodingspec oldest_to_newest 96834 0 simon.sapin 2013-11-26 15:45:02 +0000 http://encoding.spec.whatwg.org/#encodings [[ Otherwise, if encoder's error handling mode is URL, emit byte 0x3F. Otherwise, emit the result of running utf-8 encode on U+0026, U+0023, followed by the shortest sequence of ASCII digits representing c in base ten, followed by U+003B. ]] Is it intentional to emit bytes for the ASCII representation of `?` or `&#nnn;`, even if the encoding being used is not ASCII-compatible? rust-encoding’s current implementation instead uses the current encoder to encode `?` or `&#nnn;` to bytes, and aborts if that fails (which I’m not convinced can ever happen, even in weird non-web encodings that this implementation supports.) If this is intentional, I’ll file a bug on rust-encoding. 96836 1 simon.sapin 2013-11-26 15:46:52 +0000 *** Bug 23926 has been marked as a duplicate of this bug. *** 96837 2 annevk 2013-11-26 15:57:15 +0000 Example? Is an encoding not switching modes correctly? 96839 3 simon.sapin 2013-11-26 16:14:04 +0000 Although this "should not happen", the UTF 16 encoder is specified to emit an error for surrogate code points in the input: http://encoding.spec.whatwg.org/#utf-16-encoder 96840 4 addison 2013-11-26 16:33:39 +0000 If the mode is URL, emitting 0x3F might make some sense. Normally, though, a utf-16-encoder would emit U+FFFD when it errors in this way. I think I would prefer if the resulting UTF-16 actually had U+FFFD instead of 0x3F (and actually, if this is a UTF-16 *encoder*, emitting the single byte 0x3F would result in the string not be valid UTF-16). Emitting an HTML entity makes sense when encoding HTML text (the resulting isolated surrogate code point still shows in the output, but the text is now validly UTF-16). 96843 5 annevk 2013-11-26 17:47:35 +0000 I don't think you can even get to the utf-16 encoder from the web platform stack. You'll end up using utf-8 instead. And it's not entirely clear to me if the utf-16 encoder should deal with non-Unicode-scalar-value input. 96845 6 addison 2013-11-26 18:06:11 +0000 You're probably right about not being able to get to the UTF-16 encoder directly. I'm trying to think of cases and the only one that occurs to me out of hand would be reading data into a JS string? Or maybe writing an XML document (**NOT** XHTML, please note). A UTF-16 encoder should deal with non-Unicode-scalar-value input: that is one of its edge conditions. Bad data exists everywhere and the failure conditions should be well-described. It's easy enough to chop a UTF-16 buffer between two surrogate code points (if your code is surrogate stupid). Similarly someone might use it as a form of attack ("?" has a meaning in syntaxes such as URL but U+D800 might look like a tofu box and not arouse suspicion). In any case, don't you agree that the "error" instructions are for ASCII-compatible encodings and, as written, aren't quite right for a UTF-16 encoder? If you changed the word "byte" to "code unit", that might fix it (at the cost of confusion for all other encodings). 96910 7 annevk 2013-11-28 12:38:25 +0000 Well, the question is whether the encoder needs to deal with lone surrogates or whether lone surrogates need to be handled before the encoder is invoked. I guess I could see the former make sense, but that would mean we need some special rules for utf-8 and utf-16 as they should always emit the byte sequence for U+FFFD for lone surrogates and never anything else. 97516 8 annevk 2013-12-12 16:08:09 +0000 I think the correct fix here is for the encoder error algorithm to push code points on the stream that is being converted. Currently state switching for iso-2022-jp and such does not happen correctly either. Fixing bug 16688 would make this easier I think. 102938 9 annevk 2014-03-26 18:36:13 +0000 Fixed as part of bug 16688. https://github.com/whatwg/encoding/commit/dc8e4c10c9b4a91f188f3145c2e31ddec4d52a78 This is a massive change, review appreciated!